Numerical Descriptives and Distributions

Comprehensive reference guide for descriptive statistics of numerical data.

Numerical Descriptives: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of summarising continuous and discrete numerical data all the way through advanced interpretation, reporting, visualisation, assumption checking, and practical usage within the DataStatPro application. Whether you are encountering numerical descriptive statistics for the first time or deepening your understanding of how to characterise, display, and communicate the distribution of numerical variables, this guide builds your knowledge systematically from the ground up.


Table of Contents

  1. Prerequisites and Background Concepts
  2. What are Numerical Descriptives?
  3. The Mathematics Behind Numerical Descriptives
  4. Considerations and Data Quality Checks
  5. Types of Numerical Descriptive Measures
  6. Using the Numerical Descriptives Calculator Component
  7. Step-by-Step Procedure
  8. Interpreting the Output
  9. Visualising Numerical Data
  10. Confidence Intervals for Numerical Descriptives
  11. Advanced Topics
  12. Worked Examples
  13. Common Mistakes and How to Avoid Them
  14. Troubleshooting
  15. Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into numerical descriptive statistics, it is essential to be comfortable with the following foundational statistical and mathematical concepts. Each is briefly reviewed below.

1.1 Variables and Observations

A variable is any measurable characteristic that can take on different values across observations. A numerical variable (also called a quantitative variable) records values on a numeric scale that has an inherent magnitude, enabling arithmetic operations such as addition, subtraction, and averaging.

  • Observation: A single unit of study (one person, one trial, one time point).
  • Dataset: A rectangular array of observations (rows) and variables (columns).
  • Value: The specific numeric measurement recorded for a variable on a given observation.

1.2 Scales of Measurement: Interval and Ratio

Numerical descriptive statistics are appropriate for variables measured on interval or ratio scales:

ScalePropertiesTrue ZeroExamples
IntervalEqual spacing between values; no true zeroNoTemperature (°C or °F), year, IQ score
RatioEqual spacing and a meaningful true zeroYesHeight, weight, income, reaction time, count data

⚠️ For interval scales, differences are meaningful but ratios are not — saying 40°C is "twice as hot" as 20°C is meaningless. For ratio scales, both differences and ratios are meaningful — a person weighing 80 kg is genuinely twice as heavy as one weighing 40 kg. Most numerical descriptives apply to both scales, but ratios such as the coefficient of variation require ratio-scale data.

1.3 Continuous vs. Discrete Numerical Variables

Numerical variables are further classified by the set of values they can take:

  • Continuous variables: Can take any real value within a range; the precision is limited only by the measurement instrument. Examples: height, blood pressure, temperature, time.
  • Discrete variables: Take only countable, distinct values (often non-negative integers). Examples: number of children, number of hospital admissions, count of errors.

Most numerical descriptives apply equally to both types, though visualisation choices (histograms vs. bar charts) and some distribution assumptions differ.

1.4 The Concept of a Distribution

The distribution of a numerical variable describes how its values are spread across the number line. Understanding a distribution requires characterising:

  1. Location (central tendency): Where the centre or typical value lies.
  2. Spread (dispersion): How much values vary around the centre.
  3. Shape: Whether the distribution is symmetric, skewed, peaked, or flat.
  4. Outliers: Whether extreme values are present.

Numerical descriptive statistics provide compact summaries of each of these four distributional properties.

1.5 The Normal Distribution

The normal (Gaussian) distribution is the most important continuous probability distribution in statistics. It is parameterised by its mean μ\mu and standard deviation σ\sigma:

XN(μ,  σ2)X \sim \mathcal{N}(\mu,\; \sigma^2)

Key properties:

  • Perfectly symmetric and bell-shaped around μ\mu.
  • Mean, median, and mode are all equal to μ\mu.
  • Approximately 68% of observations fall within ±1σ\pm 1\sigma of μ\mu.
  • Approximately 95% of observations fall within ±1.96σ\pm 1.96\sigma of μ\mu.
  • Approximately 99.7% of observations fall within ±3σ\pm 3\sigma of μ\mu.
  • Skewness = 0; excess kurtosis = 0.

Many numerical descriptives (mean, standard deviation, standard error) are most meaningful and interpretable when the underlying distribution is approximately normal.

1.6 Robustness

A statistical measure is robust if it is relatively unaffected by outliers or departures from assumed distributional shapes. This is a critical concept for choosing between competing descriptive measures:

  • Non-robust measures: Mean, variance, standard deviation — one extreme outlier can substantially shift these measures.
  • Robust measures: Median, interquartile range, median absolute deviation — designed to be resistant to the influence of extreme values.

1.7 Population Parameters vs. Sample Statistics

All numerical descriptives computed from data are sample statistics — estimates of unknown population parameters:

Population ParameterSample StatisticSymbol (Parameter / Statistic)
Population meanSample meanμ\mu / xˉ\bar{x}
Population varianceSample varianceσ2\sigma^2 / s2s^2
Population standard deviationSample SDσ\sigma / ss
Population medianSample medianμ~\tilde{\mu} / x~\tilde{x}
Population correlationSample correlationρ\rho / rr

Sample statistics carry sampling variability; confidence intervals (Section 10) quantify the precision of these estimates.

1.8 Summation Notation

Numerical descriptives are expressed using summation notation. For a variable XX with nn observations x1,x2,,xnx_1, x_2, \ldots, x_n:

i=1nxi=x1+x2++xn\sum_{i=1}^{n} x_i = x_1 + x_2 + \cdots + x_n

i=1nxi2=x12+x22++xn2\sum_{i=1}^{n} x_i^2 = x_1^2 + x_2^2 + \cdots + x_n^2

i=1n(xixˉ)2=Sum of squared deviations from the mean\sum_{i=1}^{n} (x_i - \bar{x})^2 = \text{Sum of squared deviations from the mean}

Understanding this notation is essential for interpreting the mathematical formulae throughout this tutorial.


2. What are Numerical Descriptives?

2.1 The Core Purpose

Numerical descriptive statistics are mathematical summaries that characterise the distribution of a continuous or discrete numerical variable. Their collective purpose is to replace a raw list of nn numbers with a small, interpretable set of values that faithfully conveys the essential features of the data — location, spread, shape, and extremes — without requiring the reader to examine every individual observation.

2.2 The Four Pillars of a Numerical Description

Every complete numerical description addresses four fundamental questions:

PillarQuestionAddressed By
Central tendencyWhere does the centre of the distribution lie?Mean, median, mode, trimmed mean
DispersionHow spread out are the values?Range, IQR, variance, SD, CV, MAD
ShapeIs the distribution symmetric, skewed, peaked, or flat?Skewness, kurtosis
ExtremesAre there unusual observations at the tails?Minimum, maximum, outlier flags

No single number captures all four pillars. A complete description always reports at least one measure from each category.

2.3 When to Use Numerical Descriptives

ConditionRequirement
Variable scaleInterval or ratio (continuous or discrete)
Data formatNumeric observations
PurposeSummarise the marginal distribution of one variable
Sample sizeAny; larger NN yields more stable and precise estimates
ReportingAlways precede inferential tests with descriptive summaries

2.4 Real-World Applications

FieldVariableKey Descriptives
Clinical MedicineBlood pressure (mmHg)Mean ± SD; reference range; outlier flags
FinanceDaily stock return (%)Mean, SD, skewness, kurtosis; Value at Risk
EducationExam score (0–100)Mean, median, SD, percentiles, min, max
ManufacturingComponent diameter (mm)Mean, SD, CV; process capability indices
Environmental ScienceRainfall (mm)Median, IQR; skewness; seasonal breakdown
Sports AnalyticsPlayer sprint speed (m/s)Mean ± SD; percentile ranks; outlier detection
PharmacologyDrug concentration (ng/mL)Geometric mean; CV; log-transformed summaries
EpidemiologyBody mass index (kg/m²)Mean, SD, percentiles; skewness
PsychologyReaction time (ms)Median, IQR; robust measures due to skewness
Quality ControlProcess yield (%)Mean, SD; capability index (CpkC_{pk})
GoalAppropriate Method
Summarise one numerical variableNumerical descriptives
Compare means across two groupsIndependent-samples t-test
Compare means across three or more groupsOne-way ANOVA
Assess relationship between two numerical variablesPearson or Spearman correlation
Predict one numerical variable from anotherSimple or multiple regression
Test normality of a numerical variableShapiro-Wilk, Kolmogorov-Smirnov test
Compare spread across two groupsLevene's test, F-test for equality of variances
Summarise a categorical variableCategorical descriptives

3. The Mathematics Behind Numerical Descriptives

3.1 Notation

Consider a numerical variable XX with nn valid observations x1,x2,,xnx_1, x_2, \ldots, x_n, arranged in ascending order as the order statistics x(1)x(2)x(n)x_{(1)} \leq x_{(2)} \leq \cdots \leq x_{(n)}.

3.2 Measures of Central Tendency

3.2.1 Arithmetic Mean

The arithmetic mean (simply "the mean") is the sum of all values divided by the number of observations:

xˉ=1ni=1nxi=x1+x2++xnn\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i = \frac{x_1 + x_2 + \cdots + x_n}{n}

The mean is the centre of gravity of the data — the point at which the distribution balances. It uses all observations equally and is the most efficient estimator of the population mean μ\mu when the distribution is normal. However, it is sensitive to outliers.

3.2.2 Median

The median x~\tilde{x} is the middle value when observations are arranged in order. It divides the distribution into two equal halves:

x~={x((n+1)/2)if n is oddx(n/2)+x(n/2+1)2if n is even\tilde{x} = \begin{cases} x_{((n+1)/2)} & \text{if } n \text{ is odd} \\ \dfrac{x_{(n/2)} + x_{(n/2 + 1)}}{2} & \text{if } n \text{ is even} \end{cases}

The median is the 50th percentile of the distribution. It is robust to outliers and is preferred over the mean when the distribution is skewed or when extreme values are present.

3.2.3 Mode

For numerical data, the mode is the value (or range of values) that appears most frequently. For continuous variables, the mode is typically identified from a histogram as the peak(s) of the distribution rather than a specific repeated value. For discrete data with genuine ties, the mode is the most frequently occurring value.

3.2.4 Trimmed Mean

The trimmed mean (or truncated mean) removes a fixed proportion α\alpha of the most extreme observations from each tail before computing the mean:

xˉα=1n2αni=αn+1nαnx(i)\bar{x}_{\alpha} = \frac{1}{n - 2\lfloor \alpha n \rfloor} \sum_{i = \lfloor \alpha n \rfloor + 1}^{n - \lfloor \alpha n \rfloor} x_{(i)}

Common choices: α=0.05\alpha = 0.05 (5% trimmed mean) or α=0.10\alpha = 0.10 (10% trimmed mean). The trimmed mean is more robust than the mean while being more efficient than the median, making it a useful middle ground.

3.2.5 Geometric Mean

The geometric mean is appropriate for positively skewed, multiplicative, or ratio- scale data (e.g., concentration values, growth rates, fold changes):

xˉgeom=(i=1nxi)1/n=exp ⁣(1ni=1nlnxi)\bar{x}_{geom} = \left(\prod_{i=1}^n x_i\right)^{1/n} = \exp\!\left(\frac{1}{n}\sum_{i=1}^n \ln x_i\right)

The geometric mean is defined only for strictly positive values (xi>0x_i > 0). It is the antilog of the mean of the log-transformed values and is always \leq the arithmetic mean (equality holds when all values are identical).

3.2.6 Harmonic Mean

The harmonic mean is appropriate for averaging rates or ratios (e.g., speeds, price-to-earnings ratios):

xˉharm=ni=1n1xi\bar{x}_{harm} = \frac{n}{\sum_{i=1}^n \frac{1}{x_i}}

The harmonic mean is always \leq the geometric mean \leq the arithmetic mean (the AM-GM-HM inequality), with equality when all values are identical.

3.3 Measures of Dispersion

3.3.1 Range

The range is the simplest measure of spread:

Range=x(n)x(1)=max(x)min(x)\text{Range} = x_{(n)} - x_{(1)} = \max(x) - \min(x)

The range is easy to compute and interpret but is maximally non-robust — it is determined entirely by the two most extreme observations and increases without bound as nn grows.

3.3.2 Interquartile Range

The interquartile range (IQR) is the range of the middle 50% of the data:

IQR=Q3Q1IQR = Q_3 - Q_1

Where Q1Q_1 is the 25th percentile and Q3Q_3 is the 75th percentile. The IQR is the most widely used robust measure of dispersion. It is unaffected by the values of the most extreme observations and provides direct information about the spread of the central bulk of the data.

3.3.3 Percentiles and Quartiles

The pp-th percentile PpP_p is the value below which p%p\% of observations fall. For nn ordered observations, the percentile is computed using linear interpolation:

L=p(n1)100+1L = \frac{p(n-1)}{100} + 1

If LL is an integer, Pp=x(L)P_p = x_{(L)}. Otherwise, Pp=x(L)+(LL)(x(L)x(L))P_p = x_{(\lfloor L \rfloor)} + (L - \lfloor L \rfloor)(x_{(\lceil L \rceil)} - x_{(\lfloor L \rfloor)}).

Key percentiles:

PercentileSymbolAlternative Name
25thQ1Q_1First quartile, lower quartile
50thQ2Q_2Median, second quartile
75thQ3Q_3Third quartile, upper quartile
10th, 20th, …, 90thD1,D2,,D9D_1, D_2, \ldots, D_9Deciles
1st, 2nd, …, 99thP1,P2,,P99P_1, P_2, \ldots, P_{99}Percentiles

⚠️ Multiple methods exist for computing percentiles (e.g., Type 7 in R, Excel's PERCENTILE.INC). These differ in how they handle boundary cases and interpolation. DataStatPro uses the standard linear interpolation method (Type 7 / inclusive method) by default. The method is stated in the output footnote.

3.3.4 Variance

The sample variance s2s^2 measures the average squared deviation from the mean. It uses n1n - 1 in the denominator (Bessel's correction) to produce an unbiased estimate of the population variance σ2\sigma^2:

s2=1n1i=1n(xixˉ)2=i=1nxi2nxˉ2n1s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 = \frac{\sum_{i=1}^n x_i^2 - n\bar{x}^2}{n-1}

The divisor n1n - 1 reflects the one degree of freedom consumed in estimating xˉ\bar{x}. Using nn instead of n1n-1 yields the population variance formula (maximum likelihood estimator), which is biased for small samples.

3.3.5 Standard Deviation

The sample standard deviation ss is the square root of the sample variance:

s=s2=1n1i=1n(xixˉ)2s = \sqrt{s^2} = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2}

The standard deviation is expressed in the same units as the original variable, making it directly interpretable as a typical distance from the mean. Under normality, approximately 68% of observations fall within xˉ±s\bar{x} \pm s.

3.3.6 Standard Error of the Mean

The standard error of the mean (SEM or SE) quantifies the precision of xˉ\bar{x} as an estimate of μ\mu — it is the standard deviation of the sampling distribution of the mean:

SExˉ=snSE_{\bar{x}} = \frac{s}{\sqrt{n}}

As nn increases, SExˉSE_{\bar{x}} decreases proportionally to 1/n1/\sqrt{n}. The SEM is not a measure of the spread of individual observations (that role belongs to ss).

⚠️ A common and serious error is reporting the SEM instead of the SD as the measure of variability in individual observations. The SEM is always smaller than the SD and can give a misleading impression of how variable the data are. Use SD to describe variability in the data; use SEM to describe precision of the mean estimate.

3.3.7 Coefficient of Variation

The coefficient of variation (CV) expresses the standard deviation as a percentage of the mean, enabling comparison of variability across variables measured on different scales or with different units:

CV=sxˉ×100%CV = \frac{s}{\bar{x}} \times 100\%

The CV is a dimensionless measure of relative variability. It is only meaningful for ratio-scale variables (true zero exists) and when xˉ>0\bar{x} > 0. CV is widely used in clinical chemistry, analytical measurement science, and quality control.

CVVerbal Label
<10%< 10\%Low variability
10%20%10\% - 20\%Moderate variability
20%30%20\% - 30\%High variability
>30%> 30\%Very high variability

3.3.8 Mean Absolute Deviation

The mean absolute deviation (MAD from mean) is a robust alternative to the standard deviation:

MADmean=1ni=1nxixˉMAD_{mean} = \frac{1}{n} \sum_{i=1}^n |x_i - \bar{x}|

Unlike the variance, which squares deviations (over-weighting outliers), the MAD from mean uses absolute deviations. Under normality, MADmean0.798×sMAD_{mean} \approx 0.798 \times s.

3.3.9 Median Absolute Deviation

The median absolute deviation (MAD from median) is the most robust common measure of spread:

MADmedian=median ⁣(xix~)MAD_{median} = \text{median}\!\left(|x_i - \tilde{x}|\right)

To use MADmedianMAD_{median} as a robust estimator of the standard deviation under normality, apply the consistency factor:

σ^robust=1.4826×MADmedian\hat{\sigma}_{robust} = 1.4826 \times MAD_{median}

MADmedianMAD_{median} has a 50% breakdown point — it remains stable even when up to 50% of observations are outliers. DataStatPro reports both MAD measures, labelling them clearly as "MAD (from mean)" and "MAD (from median)".

3.4 Measures of Shape

3.4.1 Skewness

Skewness measures the degree and direction of asymmetry in the distribution:

g1=1ni=1n(xixˉ)3(1ni=1n(xixˉ)2)3/2g_1 = \frac{\frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^3}{\left(\frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2\right)^{3/2}}

The bias-corrected sample skewness (Fisher's formula, used by most software including DataStatPro):

G1=n(n1)(n2)i=1n(xixˉs)3G_1 = \frac{n}{(n-1)(n-2)} \sum_{i=1}^n \left(\frac{x_i - \bar{x}}{s}\right)^3

G1G_1Interpretation
G1=0G_1 = 0Perfectly symmetric
G1>0G_1 > 0Positively skewed (right-tailed; long right tail; mean >> median)
G1<0G_1 < 0Negatively skewed (left-tailed; long left tail; mean << median)
$G_1
$G_1

3.4.2 Kurtosis

Kurtosis measures the heaviness of the tails and the peakedness of the distribution relative to a normal distribution:

g2=1ni=1n(xixˉ)4(1ni=1n(xixˉ)2)2g_2 = \frac{\frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^4}{\left(\frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2\right)^{2}}

The bias-corrected excess kurtosis (subtracting 3 so that a normal distribution has excess kurtosis = 0) is:

G2=n(n+1)(n1)(n2)(n3)i=1n(xixˉs)43(n1)2(n2)(n3)G_2 = \frac{n(n+1)}{(n-1)(n-2)(n-3)} \sum_{i=1}^n \left(\frac{x_i - \bar{x}}{s}\right)^4 - \frac{3(n-1)^2}{(n-2)(n-3)}

G2G_2Distribution TypeInterpretation
G2=0G_2 = 0MesokurticNormal-like tails
G2>0G_2 > 0LeptokurticHeavier tails than normal; more extreme outliers
G2<0G_2 < 0PlatykurticLighter tails than normal; fewer extreme values
$G_2> 2$

⚠️ Both skewness and kurtosis are sensitive to sample size and outliers. With n<50n < 50, these estimates are highly unstable. Use formal normality tests (Shapiro-Wilk) to complement visual inspection for normality assessment.

3.5 Measures of Position

3.5.1 Z-Scores (Standardised Values)

The z-score of observation xix_i expresses its distance from the mean in standard deviation units:

zi=xixˉsz_i = \frac{x_i - \bar{x}}{s}

Z-scores enable comparison of observations from different variables or different populations. Under normality, zi>3|z_i| > 3 is a widely used criterion for flagging potential outliers.

3.5.2 Percentile Rank

The percentile rank of observation xix_i is the percentage of observations in the sample that are less than or equal to xix_i:

PR(xi)=#{j:xjxi}n×100%PR(x_i) = \frac{\#\{j : x_j \leq x_i\}}{n} \times 100\%

Percentile ranks are used for norm-referenced scoring (e.g., standardised test reporting) and for identifying the relative standing of individual observations.

3.6 The Five-Number Summary

The five-number summary provides a compact, robust distributional snapshot:

{Min,  Q1,  Median,  Q3,  Max}\{\text{Min},\; Q_1,\; \text{Median},\; Q_3,\; \text{Max}\}

This summary is the basis for the box plot (Section 9.3) and directly communicates the range, central tendency, spread (IQR), and potential skewness of the distribution. It is resistant to outliers (except for the min and max).

3.7 Normality Assessment Statistics

3.7.1 Shapiro-Wilk Test

The Shapiro-Wilk test is the most powerful test for normality for n2000n \leq 2000. It computes the correlation between the ordered sample values and the expected order statistics under normality:

W=(i=1naix(i))2i=1n(xixˉ)2W = \frac{\left(\sum_{i=1}^n a_i x_{(i)}\right)^2}{\sum_{i=1}^n (x_i - \bar{x})^2}

Where aia_i are the Shapiro-Wilk coefficients derived from the expected normal order statistics. WW ranges from 0 to 1; values close to 1 indicate normality.

H0H_0: The data are drawn from a normal distribution.

If p<αp < \alpha, reject normality.

3.7.2 Kolmogorov-Smirnov Test (Lilliefors Correction)

The Kolmogorov-Smirnov test (with Lilliefors correction for estimated parameters) computes the maximum absolute difference between the empirical CDF and the normal CDF with estimated μ\mu and σ\sigma:

D=supxFn(x)F(x;μ^,σ^)D = \sup_x |F_n(x) - F(x;\hat{\mu}, \hat{\sigma})|

The Lilliefors test is less powerful than Shapiro-Wilk for detecting non-normality but is useful as a supplementary check, especially for larger samples.

3.7.3 Skewness and Kurtosis Tests

Formal tests for normality using the skewness and kurtosis statistics:

Standard errors under normality:

SEG16n(n1)(n2)(n+1)(n+3),SEG22×SEG1×n21(n3)(n+5)SE_{G_1} \approx \sqrt{\frac{6n(n-1)}{(n-2)(n+1)(n+3)}}, \qquad SE_{G_2} \approx 2 \times SE_{G_1} \times \sqrt{\frac{n^2 - 1}{(n-3)(n+5)}}

Test statistics:

zG1=G1SEG1,zG2=G2SEG2z_{G_1} = \frac{G_1}{SE_{G_1}}, \qquad z_{G_2} = \frac{G_2}{SE_{G_2}}

Values z>1.96|z| > 1.96 suggest significant departure from normality at α=.05\alpha = .05.


4. Considerations and Data Quality Checks

4.1 Identifying Outliers

An outlier is an observation that appears inconsistent with the bulk of the data. Outliers can arise from data entry errors, measurement malfunctions, genuine extreme values, or distributional heavy tails. It is critical to identify, investigate, and make a principled decision about outliers before computing and reporting descriptives.

Common outlier detection methods:

MethodRuleAppropriate When
IQR fence (Tukey)Outlier if x<Q11.5×IQRx < Q_1 - 1.5 \times IQR or x>Q3+1.5×IQRx > Q_3 + 1.5 \times IQRGeneral use; robust
Extreme fence (Tukey)Extreme outlier if x<Q13×IQRx < Q_1 - 3 \times IQR or x>Q3+3×IQRx > Q_3 + 3 \times IQRIdentifying severe outliers
Z-score ruleOutlier if $z_i
Modified Z-score (Iglewicz-Hoaglin)Outlier if Mi>3.5\|M_i\| > 3.5 where Mi=0.6745(xix~)/MADmedianM_i = 0.6745(x_i - \tilde{x})/MAD_{median}Robust; preferred for skewed data
Grubbs' testFormal significance test for the single most extreme valueNormal distribution; single outlier
Visual inspectionBox plot, histogram, Q-Q plotAlways; complements formal methods

⚠️ Outlier removal must be justified on scientific or methodological grounds, not statistical convenience. An outlier is not grounds for deletion merely because it is extreme — it may represent the most scientifically interesting observation. Document all outlier decisions transparently. DataStatPro flags outliers but never removes them automatically.

4.2 Missing Data Assessment

Before any numerical summary is computed, the extent and pattern of missing data must be evaluated. Report nmissn_{miss} and the missing rate nmiss/ntotaln_{miss}/n_{total}. Investigate the missing data mechanism (MCAR, MAR, or MNAR) as in any statistical analysis.

Impact of missing data on numerical descriptives:

  • Complete case analysis (default): Compute descriptives on valid observations only. Unbiased under MCAR; potentially biased under MAR or MNAR.
  • Mean/median imputation: Replace missing values with the sample mean or median. Reduces apparent variability; distorts distributional shape. Not recommended.
  • Multiple imputation: Statistically principled but complex; appropriate for inference, less so for purely descriptive purposes.

DataStatPro reports nvalidn_{valid} and nmissn_{miss} in all output tables.

4.3 Checking Variable Type and Scale

Before computing numerical descriptives, confirm that the variable is genuinely measured on an interval or ratio scale. Common pitfalls:

VariableProblemCorrect Action
Likert item (1–5)Ordinal, not intervalReport categorical and/or ordinal descriptives
Coded categorical (1 = Male, 2 = Female)Nominal, not numericalRecode and use categorical descriptives
Year of birthInterval; ratios not meaningfulReport mean year, not "2× older"
Count (non-negative integer)Discrete ratio; may be skewedConsider median and IQR; check for zero-inflation

4.4 Distributional Assumptions

Many uses of numerical descriptives (confidence intervals for the mean, standard error interpretation, power calculations) assume approximate normality. Check this assumption using:

  1. Histograms: Assess overall shape visually.
  2. Q-Q plot: Plot sample quantiles against theoretical normal quantiles; departures from the diagonal indicate non-normality.
  3. Shapiro-Wilk test: Formal test of normality.
  4. Skewness and kurtosis: Numerical shape indicators.

When data are non-normal, prefer robust measures (median, IQR, MAD) over mean-based summaries, especially with small samples.

4.5 Sample Size Adequacy

The stability and interpretability of numerical descriptives depend on nn:

nnGuidance
<10< 10Descriptives are very unstable; report individual values or at most min, max, median
103010 - 30Mean and SD interpretable; normality hard to assess; use median/IQR as supplement
3010030 - 100Most descriptives reasonably stable; normality assessment meaningful
>100> 100Descriptives reliable; shape statistics informative; skewness and kurtosis stable
>1000> 1000Very precise estimates; even tiny departures from normality detected by formal tests

4.6 Significant Figures and Rounding

Numerical descriptives should be reported to a level of precision consistent with the measurement instrument:

  • Report the mean and standard deviation to one more decimal place than the raw data.
  • Report the median and IQR to the same precision as the raw data.
  • Report test statistics (zz, tt, WW) to two decimal places.
  • Report p-values to two or three decimal places (or as <.001< .001).
  • Do not report spurious precision (e.g., mean =3.14159265= 3.14159265 for data measured to the nearest integer).

4.7 Weighted Data

For survey data with design weights, apply weights when computing all numerical descriptives to produce population-representative estimates:

xˉ(w)=i=1nwixii=1nwi,s2(w)=i=1nwi(xixˉ(w))2i=1nwi1\bar{x}^{(w)} = \frac{\sum_{i=1}^n w_i x_i}{\sum_{i=1}^n w_i}, \qquad s^{2(w)} = \frac{\sum_{i=1}^n w_i (x_i - \bar{x}^{(w)})^2}{\sum_{i=1}^n w_i - 1}

DataStatPro supports weighted numerical descriptives when a weight variable is specified, reporting both unweighted and weighted estimates side by side.


5. Types of Numerical Descriptive Measures

5.1 Measures of Central Tendency

MeasureFormulaRobust?Best For
Arithmetic meanxˉ=1nxi\bar{x} = \frac{1}{n}\sum x_iNoSymmetric distributions; no outliers
MedianMiddle value of sorted dataYesSkewed distributions; outliers present
ModeMost frequent valueYesDiscrete data; bimodal distributions
Trimmed mean (α\alpha%)Mean after removing α\alpha% from each tailModerateMild outliers; alternatives to mean
Geometric mean(xi)1/n(\prod x_i)^{1/n}NoMultiplicative data; log-normal distributions
Harmonic meann/(1/xi)n / \sum (1/x_i)NoRates and ratios

5.2 Measures of Dispersion

MeasureFormulaRobust?Best For
Rangex(n)x(1)x_{(n)} - x_{(1)}NoQuick overview; not for inference
IQRQ3Q1Q_3 - Q_1YesPaired with median; skewed data
Variances2=(xixˉ)2n1s^2 = \frac{\sum(x_i-\bar{x})^2}{n-1}NoTheoretical derivations; inferential tests
Standard deviations=s2s = \sqrt{s^2}NoSymmetric data; normal distribution
SEMs/ns/\sqrt{n}NoPrecision of mean estimate; confidence intervals
CVs/xˉ×100%s/\bar{x} \times 100\%NoComparing variability across different scales
MAD (mean)1nxixˉ\frac{1}{n}\sum\|x_i - \bar{x}\|ModerateRobust alternative to SD
MAD (median)median(xix~)\text{median}(\|x_i - \tilde{x}\|)YesMaximum robustness; extreme outliers

5.3 Measures of Shape

MeasureFormulaNormal ValueInterpretation
Skewness (G1G_1)Third standardised central moment0Asymmetry of distribution
Excess kurtosis (G2G_2)Fourth standardised central moment 3- 30Tail heaviness relative to normal

5.4 Measures of Position

MeasureDefinitionUse
Minimum (x(1)x_{(1)})Smallest observationRange; outlier detection
Maximum (x(n)x_{(n)})Largest observationRange; outlier detection
Percentiles (PpP_p)Value below which pp% of data fallNorm-referenced scores; reference ranges
Quartiles (Q1,Q2,Q3Q_1, Q_2, Q_3)25th, 50th, 75th percentilesFive-number summary; box plot
Z-score(xixˉ)/s(x_i - \bar{x})/sStandardisation; outlier detection

5.5 Five-Number Summary

{x(1),  Q1,  x~,  Q3,  x(n)}\{x_{(1)},\; Q_1,\; \tilde{x},\; Q_3,\; x_{(n)}\}

Compact, robust distributional description; forms the basis for box plots.

5.6 Normality Diagnostics

DiagnosticTypeOutput
Shapiro-Wilk testFormal testWW statistic and p-value
Kolmogorov-Smirnov (Lilliefors)Formal testDD statistic and p-value
Skewness z-testFormal testzG1z_{G_1} and p-value
Kurtosis z-testFormal testzG2z_{G_2} and p-value
HistogramVisualShape assessment
Q-Q plotVisualQuantile-by-quantile deviation from normality
Box plotVisualSymmetry, spread, and outlier identification

6. Using the Numerical Descriptives Calculator Component

The Numerical Descriptives Calculator in DataStatPro provides a fully featured tool for computing, diagnosing, visualising, and reporting descriptive statistics for numerical variables.

Step-by-Step Guide

Step 1 — Navigate to the Component

Go to Descriptive Statistics → Numerical Descriptives.

Step 2 — Input Method

Choose how to provide your data:

  • Raw data: Upload a CSV/Excel file or paste a column of numeric values. DataStatPro automatically detects the variable type, identifies non-numeric entries, and flags missing values.
  • Multiple variables: Select two or more numeric columns to run batch descriptives across all selected variables simultaneously, producing a comparative summary table.
  • Grouped analysis: Designate a categorical grouping variable to compute descriptives separately within each group, enabling direct comparison of distributional summaries across subgroups.

Step 3 — Variable Configuration

  • Assign a meaningful variable name and unit of measurement for display.
  • Specify the measurement scale (interval or ratio) to unlock scale-appropriate measures (e.g., CV requires ratio scale).
  • Specify a grouping variable (optional) to produce stratified descriptives.
  • Specify a weight variable (optional) for survey-weighted estimates.
  • Designate whether log-transformation should be applied for geometric mean reporting (appropriate for log-normal data).

Step 4 — Missing Data Handling

Select one of the following:

  • Exclude missing (valid nn only): All summaries computed on valid observations. nmissn_{miss} reported separately.
  • Flag and exclude: Missing values flagged and listed; summaries exclude missing.

Step 5 — Outlier Handling

  • Select outlier detection method: Tukey IQR fence (default), Z-score (z>3|z| > 3), or Modified Z-score (Iglewicz-Hoaglin).
  • Choose whether to flag only or to compute descriptives both with and without flagged outliers (DataStatPro never removes outliers automatically).

Step 6 — Set Display Options

  • nvalidn_{valid}, nmissn_{miss}, and total nn.
  • ✅ Mean, median, mode, trimmed mean (selectable α\alpha), geometric mean, harmonic mean.
  • ✅ Minimum, maximum, range.
  • Q1Q_1, Q3Q_3, IQR, and full percentile table (selectable percentile set).
  • ✅ Variance (s2s^2), standard deviation (ss), SEM (SExˉSE_{\bar{x}}).
  • ✅ Coefficient of variation (CV).
  • ✅ MAD (from mean and from median), with robust σ^\hat{\sigma}.
  • ✅ Skewness (G1G_1) and excess kurtosis (G2G_2) with standard errors and z-tests.
  • ✅ Five-number summary table.
  • ✅ Shapiro-Wilk and Lilliefors normality tests with p-values.
  • ✅ Outlier table with flagging method, z-score, and modified z-score.
  • ✅ 95% confidence intervals for the mean (t-based), median (bootstrap), and SD.
  • ✅ Histogram with optional normal curve overlay, density curve, and rug plot.
  • ✅ Box plot (standard and notched).
  • ✅ Violin plot.
  • ✅ Q-Q plot with confidence band.
  • ✅ Empirical cumulative distribution function (ECDF) plot.
  • ✅ Dot plot / strip chart.
  • ✅ Stem-and-leaf display (for n200n \leq 200).
  • ✅ Grouped comparison plots (when grouping variable specified).
  • ✅ APA 7th edition results paragraph (auto-generated).
  • ✅ Publication-ready descriptives table.

Step 7 — Run the Analysis

Click "Compute Numerical Descriptives". DataStatPro will:

  1. Validate data: check variable type, identify non-numeric entries, count missing values.
  2. Compute the complete set of central tendency, dispersion, shape, and position measures.
  3. Detect outliers using the selected method and produce an outlier report.
  4. Run Shapiro-Wilk and Lilliefors normality tests.
  5. Compute 95% CIs for the mean (t-distribution), median (bootstrap), and SD (chi-square).
  6. Generate all selected visualisations with customisable formatting.
  7. Produce the APA-compliant results paragraph and formatted descriptives table.

7. Step-by-Step Procedure

7.1 Full Manual Procedure

Step 1 — Define the Variable

State the variable name, unit of measurement, scale (interval or ratio), and the population of observations. Confirm that the variable is genuinely numerical.

Step 2 — Count Total and Missing Observations

ntotal=nvalid+nmissn_{total} = n_{valid} + n_{miss}

Report nmissn_{miss} and the missing rate nmiss/ntotaln_{miss}/n_{total}. Apply the chosen missing data strategy before proceeding.

Step 3 — Sort the Data

Sort the nvalidn_{valid} observations in ascending order to obtain the order statistics x(1)x(2)x(n)x_{(1)} \leq x_{(2)} \leq \cdots \leq x_{(n)}.

Step 4 — Compute Central Tendency Measures

Mean:

xˉ=1ni=1nxi\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i

Median:

x~={x((n+1)/2)n odd(x(n/2)+x(n/2+1))/2n even\tilde{x} = \begin{cases} x_{((n+1)/2)} & n \text{ odd} \\ (x_{(n/2)} + x_{(n/2+1)})/2 & n \text{ even} \end{cases}

Mode: Identify the most frequently occurring value(s).

Geometric mean (if applicable):

xˉgeom=exp ⁣(1ni=1nlnxi)\bar{x}_{geom} = \exp\!\left(\frac{1}{n}\sum_{i=1}^n \ln x_i\right)

Step 5 — Compute the Five-Number Summary

Identify: {x(1),  Q1,  x~,  Q3,  x(n)}\{x_{(1)},\; Q_1,\; \tilde{x},\; Q_3,\; x_{(n)}\} using linear interpolation for Q1Q_1 and Q3Q_3.

IQR=Q3Q1IQR = Q_3 - Q_1

Range=x(n)x(1)\text{Range} = x_{(n)} - x_{(1)}

Step 6 — Compute Dispersion Measures

Variance:

s2=1n1i=1n(xixˉ)2s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2

Standard deviation:

s=s2s = \sqrt{s^2}

Standard error of the mean:

SExˉ=snSE_{\bar{x}} = \frac{s}{\sqrt{n}}

Coefficient of variation (ratio scale only):

CV=sxˉ×100%CV = \frac{s}{\bar{x}} \times 100\%

Median absolute deviation:

MADmedian=median(xix~)MAD_{median} = \text{median}(|x_i - \tilde{x}|)

Step 7 — Compute Shape Measures

Skewness (G1G_1) and excess kurtosis (G2G_2) using bias-corrected formulae (Section 3.4).

Compute zG1=G1/SEG1z_{G_1} = G_1/SE_{G_1} and zG2=G2/SEG2z_{G_2} = G_2/SE_{G_2} to formally test departure from normality.

Step 8 — Detect Outliers

Apply Tukey's IQR fence:

Lower fence=Q11.5×IQR\text{Lower fence} = Q_1 - 1.5 \times IQR

Upper fence=Q3+1.5×IQR\text{Upper fence} = Q_3 + 1.5 \times IQR

Flag any xix_i outside the fence as a potential outlier. Investigate each flagged observation individually.

Step 9 — Assess Normality

Compute the Shapiro-Wilk statistic WW (for n2000n \leq 2000). Inspect the histogram and Q-Q plot. Report skewness and kurtosis. Conclude whether the normality assumption is tenable.

Step 10 — Compute Confidence Intervals

95% CI for the mean (t-based):

xˉ±tα/2,  n1×SExˉ\bar{x} \pm t_{\alpha/2,\; n-1} \times SE_{\bar{x}}

Where tα/2,  n1t_{\alpha/2,\; n-1} is the critical value from the t-distribution with n1n - 1 degrees of freedom (t1.960t \approx 1.960 for n>120n > 120; t2.306t \approx 2.306 for n=9n = 9).

95% CI for the standard deviation (chi-square based):

[(n1)s2χα/2,  n12,    (n1)s2χ1α/2,  n12]\left[\sqrt{\frac{(n-1)s^2}{\chi^2_{\alpha/2,\; n-1}}},\;\; \sqrt{\frac{(n-1)s^2}{\chi^2_{1-\alpha/2,\; n-1}}}\right]

95% CI for the median (bootstrap): Use B=2000B = 2000 bootstrap resamples to compute the percentile bootstrap CI.

Step 11 — Produce Visualisations

Select appropriate chart types (see Section 9), annotate with key descriptive values (mean, median, SD), and ensure axes are labelled with variable name and units.

Step 12 — Interpret and Report

Use APA reporting guidelines (Section 15). Always report nvalidn_{valid}, nmissn_{miss}, mean, SD, median, IQR, min, max, skewness, and the result of the normality assessment. Report CIs for the mean. For non-normal distributions, emphasise the median and IQR.


8. Interpreting the Output

8.1 Central Tendency: Mean vs. Median

RelationshipDistribution ShapePreferred Measure
Mean \approx MedianSymmetricEither; report both
Mean >> Median (substantially)Positively skewed (right tail)Median
Mean << Median (substantially)Negatively skewed (left tail)Median
Mean \gg MedianOutliers or extreme right skewMedian; investigate outliers
Large discrepancy; small nnInsufficient data to assessReport both with caution

8.2 Interpreting the Standard Deviation

Relationship Between ss and xˉ\bar{x}Interpretation
sxˉs \ll \bar{x} (small CV)Values tightly clustered around the mean
sxˉs \approx \bar{x} (CV \approx 100%)Very high relative variability
s>xˉs > \bar{x} (CV >100%> 100\%)Extreme variability; check for outliers or zero-inflation
Under normality: most values in xˉ±2s\bar{x} \pm 2sApproximately 95% of observations lie in this range

8.3 Interpreting the IQR and Five-Number Summary

Five-Number FeatureInterpretation
Symmetric: x~Q1Q3x~\tilde{x} - Q_1 \approx Q_3 - \tilde{x}Symmetric distribution
x~\tilde{x} closer to Q1Q_1 than Q3Q_3Positively skewed
x~\tilde{x} closer to Q3Q_3 than Q1Q_1Negatively skewed
Large gap between Q3Q_3 and MaxOutliers or long right tail
Large gap between Min and Q1Q_1Outliers or long left tail
Narrow IQR relative to RangeOutliers or extreme tail values dominate range

8.4 Interpreting Skewness and Kurtosis

Skewness G1G_1Shape Interpretation
0.5G10.5-0.5 \leq G_1 \leq 0.5Approximately symmetric
0.5<G11.00.5 < G_1 \leq 1.0Moderately positively skewed
G1>1.0G_1 > 1.0Substantially positively skewed
1.0G1<0.5-1.0 \leq G_1 < -0.5Moderately negatively skewed
G1<1.0G_1 < -1.0Substantially negatively skewed
Excess Kurtosis G2G_2Tail Interpretation
0.5G20.5-0.5 \leq G_2 \leq 0.5Normal-like tails
G2>0.5G_2 > 0.5Heavier tails than normal; more outlier-prone
G2>3G_2 > 3Substantially heavy tails (e.g., financial returns)
G2<0.5G_2 < -0.5Lighter tails than normal; bounded distribution

8.5 Interpreting Normality Tests

Normality Assessment ResultImplication for Reporting
WW close to 1; p>.05p > .05Consistent with normality; mean ± SD appropriate
W<0.95W < 0.95; p<.05p < .05Evidence of non-normality; prefer median and IQR
W<0.90W < 0.90Clear non-normality; transformation or non-parametric methods indicated
Large nn (>200> 200) with p<.05p < .05 but small $G_1

⚠️ Normality tests are extremely sensitive to sample size. With n>200n > 200, trivial departures from normality will be statistically significant. With n<20n < 20, even severe departures may not be detected. Always combine formal tests with visual inspection (histogram and Q-Q plot). The practical question is not whether the data are perfectly normal, but whether the departure is large enough to affect the validity of subsequent analyses.

8.6 The Mean-SD vs. Median-IQR Decision

This is one of the most frequently encountered reporting decisions in numerical descriptives:

Use Mean ± SD WhenUse Median [IQR] When
Distribution is approximately normalDistribution is clearly skewed
No substantial outliersOutliers are present
nn is reasonably large (>30> 30)nn is small (<20< 20)
Parametric inferential tests are plannedNon-parametric tests are planned
Data are ratio-scale and symmetricData are bounded (e.g., floor/ceiling effects)

⚠️ The notation "Mean ± SD" means mean plus or minus one standard deviation. "Mean ± SEM" means mean plus or minus one standard error of the mean. These are very different quantities. Always specify which you are reporting. Most journals recommend reporting SD (not SEM) as the measure of variability when describing the sample.

8.7 Contextualising Descriptives: Reference Values

Numerical descriptives are most useful when compared against reference benchmarks:

Benchmark TypeExampleSource
Clinical reference rangeBlood pressure: systolic 90–120 mmHgClinical guidelines
Historical baselineCompany revenue: prior year mean ± SDInternal records
Population normBMI: adult population median 25–26 kg/m²Epidemiological surveys
Theoretical valueFair coin flip: proportion heads = 0.50Mathematical model
Regulatory limitContaminant level: max 10 ppbGovernment regulation

9. Visualising Numerical Data

9.1 Histogram

The histogram is the most fundamental visualisation for continuous numerical data. It divides the value range into contiguous bins of equal width and displays the frequency (or density) of observations in each bin as a bar.

Construction choices:

  • Number of bins: Too few bins obscure distributional shape; too many create noisy fluctuations. Common rules:
    • Sturges' rule: K=1+3.322log10nK = \lceil 1 + 3.322 \log_{10} n \rceil
    • Freedman-Diaconis rule: Bin width =2×IQR×n1/3= 2 \times IQR \times n^{-1/3} (robust, recommended)
    • Scott's rule: Bin width =3.49×s×n1/3= 3.49 \times s \times n^{-1/3} (assumes normality)
  • Frequency vs. density: Use density on the y-axis when overlaying a theoretical probability distribution curve.

Reading a histogram:

  • Overall shape (symmetric, skewed, bimodal, uniform).
  • Location of the peak (mode).
  • Spread (width of the bulk of the distribution).
  • Outliers (isolated bars far from the main body).

Best practices:

  • Overlay a normal curve to assess departures from normality visually.
  • Optionally add a rug plot (tick marks below the x-axis for each observation).
  • Label the x-axis with the variable name and units; label the y-axis as "Frequency" or "Density".

9.2 Density Plot (Kernel Density Estimate)

The kernel density estimate (KDE) is a smoothed, continuous version of the histogram, constructed by placing a smooth kernel function (typically Gaussian) at each observed value and summing:

f^(x)=1nhi=1nK ⁣(xxih)\hat{f}(x) = \frac{1}{nh} \sum_{i=1}^n K\!\left(\frac{x - x_i}{h}\right)

Where hh is the bandwidth (smoothing parameter). A larger hh produces a smoother curve; a smaller hh reveals more local features. DataStatPro uses Silverman's rule of thumb for the default bandwidth:

h=0.9×min ⁣(s,  IQR1.34)×n1/5h = 0.9 \times \min\!\left(s,\; \frac{IQR}{1.34}\right) \times n^{-1/5}

Advantages over histogram: Continuous, smooth, and not dependent on bin boundary choices. Excellent for comparing multiple distributions on the same plot.

9.3 Box Plot

The box plot (box-and-whisker plot) is a compact, robust visualisation of the five-number summary and outliers:

  • Box: Spans Q1Q_1 to Q3Q_3; width represents the IQR.
  • Median line: Horizontal line inside the box at x~\tilde{x}.
  • Whiskers: Extend to the most extreme observations within the Tukey fences (Q11.5×IQRQ_1 - 1.5 \times IQR and Q3+1.5×IQRQ_3 + 1.5 \times IQR).
  • Outlier points: Individual observations beyond the whisker fences are plotted as dots or circles.
  • Notched box plot: V-shaped notches around the median line indicate an approximate 95% CI for the median. Non-overlapping notches suggest the medians of two groups differ significantly.

Best practices:

  • Include a jittered strip of individual data points overlaid on the box plot when n<100n < 100 (to avoid hiding the raw data).
  • For group comparisons, align box plots side by side with a common y-axis.
  • Always specify whether whiskers represent 1.5×IQR (Tukey, standard), 2×IQR, or min/max.

9.4 Violin Plot

The violin plot combines the compact shape of a box plot with the distributional detail of a density plot. Each "violin" is a mirrored kernel density estimate, showing the full distributional shape. A box plot or five-number summary is often overlaid.

Advantages over box plots: Reveals distributional shape (bimodality, skewness, gaps) that box plots conceal. Particularly useful for comparing distributions across multiple groups.

9.5 Q-Q Plot (Quantile-Quantile Plot)

The Q-Q plot assesses normality by plotting the sample quantiles against the theoretical quantiles of a standard normal distribution:

  • If the data are approximately normal, points fall on or near the diagonal reference line.
  • Systematic deviations from the diagonal indicate non-normality:
    • S-shaped curve: Over-dispersed (heavier tails than normal).
    • Banana-shaped curve: Under-dispersed (lighter tails).
    • Points above the line at both ends: Positive skew.
    • Points below the line at both ends: Negative skew.
    • Points far off the line at one end: Outliers.

DataStatPro overlays a 95% confidence band (Kolmogorov-Smirnov envelope) on the Q-Q plot. Points outside the band indicate significant departures from normality.

9.6 Stem-and-Leaf Plot

The stem-and-leaf plot is a text-based display that shows the full distribution while preserving every individual value. Each observation is split into a "stem" (leading digits) and a "leaf" (trailing digit). The stems are listed vertically, and the leaves are arranged horizontally.

Appropriate for: n200n \leq 200; exploratory data analysis; teaching contexts. The back-to-back stem-and-leaf plot compares two groups simultaneously.

9.7 Empirical Cumulative Distribution Function (ECDF)

The ECDF plots the proportion of observations less than or equal to each value xx:

F^(x)=1n#{i:xix}\hat{F}(x) = \frac{1}{n}\#\{i : x_i \leq x\}

The ECDF is a step function that rises by 1/n1/n at each observed value. It is a non-parametric estimate of the true CDF and is used to:

  • Identify percentiles visually (read the yy-value corresponding to any xx).
  • Compare two distributions (Kolmogorov-Smirnov test is based on the maximum vertical distance between two ECDFs).
  • Assess goodness-of-fit to a theoretical distribution.

9.8 Dot Plot / Strip Chart

A strip chart (also called a dot plot or jitter plot) displays every individual observation as a dot along a single axis, with a small random vertical jitter to prevent overplotting. It is the most information-rich plot for small to medium samples (n200n \leq 200).

Advantages: Shows the actual data; reveals gaps, clusters, and potential outliers not visible in summary plots. Highly recommended as a supplement to box plots.

9.9 Error Bar Plot

An error bar plot displays the mean (or median) as a point and uncertainty as symmetric bars. Common configurations:

Bar TypeWhat It RepresentsWhen to Use
Mean ± SDVariability of individual observationsDescribing sample variability
Mean ± SEMPrecision of the mean estimateShowing estimation precision
Mean ± 95% CI95% CI for the population meanInferential comparisons

⚠️ Always label the error bar type explicitly in figure captions. An unlabelled error bar is uninterpretable — ± SD, ± SEM, and ± 95% CI have very different meanings and widths. Many published figures fail to specify this information.

9.10 Visualisation Selection Guide

PurposennRecommended Chart(s)
Full distributional shapeAnyHistogram + density curve
Summary of centre and spreadAnyBox plot (+ strip chart for n<100n < 100)
Normality assessmentAnyQ-Q plot + histogram
Individual data visibilityn200n \leq 200Strip chart / dot plot
Smooth density estimationn30n \geq 30Violin plot or KDE
Cumulative distributionAnyECDF plot
Group comparisonsAnySide-by-side box plots; violin plots
Precise distributional shapen200n \leq 200Stem-and-leaf plot
Reporting central tendency with uncertaintyAnyError bar plot (specify type)

10. Confidence Intervals for Numerical Descriptives

10.1 Why Confidence Intervals Are Essential

Sample descriptive statistics are estimates of population parameters. A 95% confidence interval (CI) provides a range of plausible values for the true population parameter, given the observed sample statistic. CIs convey both the direction and the precision of an estimate, making them far more informative than a point estimate alone.

10.2 CI for the Mean: t-Distribution

For any numerical variable, the exact 95% CI for the population mean μ\mu uses the t-distribution:

CIμ=xˉ±tα/2,  n1×snCI_\mu = \bar{x} \pm t_{\alpha/2,\; n-1} \times \frac{s}{\sqrt{n}}

Where tα/2,  n1t_{\alpha/2,\; n-1} is the upper α/2\alpha/2 critical value of the t-distribution with n1n - 1 degrees of freedom. This CI is exact when the population is normal and asymptotically valid for non-normal populations with sufficiently large nn (by the Central Limit Theorem).

Critical values for common nn:

nnt0.025,  n1t_{0.025,\; n-1}t0.005,  n1t_{0.005,\; n-1}
52.7764.604
102.2623.250
202.0932.861
302.0452.756
602.0002.660
1201.9802.617
\infty1.9602.576

10.3 CI for the Mean: Bootstrap (Non-Normal Data)

When normality is violated and nn is small, a bootstrap CI for the mean provides better coverage. The percentile bootstrap proceeds as follows:

  1. Draw B=2000B = 2000 (or more) bootstrap resamples of size nn with replacement.
  2. Compute xˉb\bar{x}^*_b for each resample b=1,,Bb = 1, \ldots, B.
  3. The 95% CI is the 2.5th and 97.5th percentiles of the BB bootstrap means: CIboot=[Q0.025(xˉ),Q0.975(xˉ)]CI_{boot} = [Q_{0.025}(\bar{x}^*), Q_{0.975}(\bar{x}^*)].

DataStatPro uses the bias-corrected and accelerated (BCa) bootstrap by default, which provides better coverage than the simple percentile bootstrap, especially for small samples.

10.4 CI for the Median: Bootstrap

No simple closed-form CI exists for the median. DataStatPro uses the BCa bootstrap CI:

  1. Draw B=2000B = 2000 bootstrap resamples of size nn with replacement.
  2. Compute x~b\tilde{x}^*_b for each resample.
  3. The 95% CI is the BCa-corrected percentile interval of the BB bootstrap medians.

10.5 CI for the Standard Deviation: Chi-Square Distribution

The exact 95% CI for the population standard deviation σ\sigma uses the chi-square distribution (assuming normality):

CIσ=[(n1)s2χ1α/2,  n12,    (n1)s2χα/2,  n12]CI_\sigma = \left[\sqrt{\frac{(n-1)s^2}{\chi^2_{1-\alpha/2,\; n-1}}},\;\; \sqrt{\frac{(n-1)s^2}{\chi^2_{\alpha/2,\; n-1}}}\right]

⚠️ The chi-square CI for σ\sigma is sensitive to departures from normality, more so than the t-CI for μ\mu. Use the bootstrap CI for ss when the distribution is clearly non-normal.

10.6 CI for the Variance

The exact 95% CI for the population variance σ2\sigma^2:

CIσ2=[(n1)s2χ1α/2,  n12,    (n1)s2χα/2,  n12]CI_{\sigma^2} = \left[\frac{(n-1)s^2}{\chi^2_{1-\alpha/2,\; n-1}},\;\; \frac{(n-1)s^2}{\chi^2_{\alpha/2,\; n-1}}\right]

10.7 CI Width as a Function of nn and ss

The width of the 95% CI for the mean is approximately 2×1.96×s/n2 \times 1.96 \times s/\sqrt{n}. For a given ss, quadrupling nn halves the CI width:

Required nn to achieve target CI half-width δ\delta (95% CI, α=.05\alpha = .05):

n(1.96×sδ)2n \approx \left(\frac{1.96 \times s}{\delta}\right)^2

Example: If s=15s = 15 (e.g., IQ scores), to achieve a CI half-width of ±3 points:

n(1.96×153)2=(9.8)296n \approx \left(\frac{1.96 \times 15}{3}\right)^2 = (9.8)^2 \approx 96

10.8 Confidence Intervals for the Geometric Mean

For log-normally distributed data, compute the CI on the log scale and transform back:

  1. Compute yˉ=1ni=1nlnxi\bar{y} = \frac{1}{n}\sum_{i=1}^n \ln x_i and sy=SD(lnxi)s_y = SD(\ln x_i).
  2. 95% CI on log scale: yˉ±tα/2,  n1×sy/n\bar{y} \pm t_{\alpha/2,\; n-1} \times s_y/\sqrt{n}.
  3. Transform back: CIgeom=[exp(yˉtsy/n),  exp(yˉ+tsy/n)]CI_{geom} = [\exp(\bar{y} - t \cdot s_y/\sqrt{n}),\; \exp(\bar{y} + t \cdot s_y/\sqrt{n})].

11. Advanced Topics

11.1 Subgroup Comparisons and Stratified Descriptives

When the variable of interest is examined across levels of a grouping variable (e.g., treatment vs. control), stratified descriptives provide within-group summaries and enable direct comparison of central tendency, spread, and shape across groups.

A useful compact format is the comparative descriptives table:

GroupnnMean ± SDMedian [IQR]Min – MaxSkewness
Group AnAn_AxˉA±sA\bar{x}_A \pm s_Ax~A[IQRA]\tilde{x}_A [IQR_A]...G1,AG_{1,A}
Group BnBn_BxˉB±sB\bar{x}_B \pm s_Bx~B[IQRB]\tilde{x}_B [IQR_B]...G1,BG_{1,B}

⚠️ Descriptive comparisons between groups do not constitute formal hypothesis tests. A large apparent difference in means may not be statistically significant (underpowered study), and a small apparent difference may be highly significant (very large nn). Always follow descriptive comparisons with appropriate inferential tests (t-test, Mann-Whitney U, ANOVA).

11.2 Data Transformations for Non-Normal Variables

When a variable is substantially non-normal, transforming it can improve the interpretability of mean-based descriptives and the validity of downstream parametric tests.

Common transformations:

TransformationFormulaAppropriate When
Log transformationy=ln(x)y = \ln(x) or log10(x)\log_{10}(x)Positive right-skewed data; multiplicative processes
Square rooty=xy = \sqrt{x}Count data; moderate right skew
Reciprocaly=1/xy = 1/xSeverely right-skewed rates
Box-Coxy=(xλ1)/λy = (x^\lambda - 1)/\lambdaSystematic search for optimal λ\lambda
Arcsine square rooty=arcsin(p)y = \arcsin(\sqrt{p})Proportion data (though logit preferred)
Logity=ln(p/(1p))y = \ln(p/(1-p))Proportion data bounded in (0,1)(0,1)
Rank transformationy=rank(x)y = \text{rank}(x)Non-parametric basis; extreme outliers

⚠️ After transformation, descriptives are computed and reported on the transformed scale. The geometric mean on the original scale equals the antilog of the arithmetic mean on the log scale. Always state explicitly that transformed-scale descriptives are being reported and provide the back-transformed mean (geometric mean) for interpretability.

11.3 Robust Descriptive Statistics

When data contain outliers or are drawn from heavy-tailed distributions, robust estimators are preferred over classical mean-based measures:

Classical MeasureRobust AlternativeBreakdown Point
MeanMedian50%
Mean10% trimmed mean10%
Standard deviation1.4826×MADmedian1.4826 \times MAD_{median}50%
Standard deviationIQR/1.349IQR / 1.34925%
VarianceQnQ_n estimator (Rousseeuw-Croux)50%
MeanHuber M-estimatorTunable

The breakdown point is the proportion of data that can be replaced by arbitrarily large values before the estimator becomes unreliable.

11.4 Effect Size: Cohen's dd and Standardised Differences

When comparing the means of two groups, Cohen's dd standardises the mean difference by the pooled standard deviation, producing an interpretable, scale-free effect size:

d=xˉ1xˉ2spooled,spooled=(n11)s12+(n21)s22n1+n22d = \frac{\bar{x}_1 - \bar{x}_2}{s_{pooled}}, \qquad s_{pooled} = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}}

Cohen's conventions for d|d|:

| d|d| | Effect Size Label | | :---- | :---------------- | | 0.200.20 | Small | | 0.500.50 | Medium | | 0.800.80 | Large |

11.5 Standardised Scores and Norm-Referencing

Z-scores standardise raw values to a common scale with mean 0 and SD 1, enabling cross-variable and cross-population comparisons. Many applied contexts use T-scores (mean 50, SD 10) or IQ-type scaled scores (mean 100, SD 15):

Ti=50+10×zi=50+10×xixˉsT_i = 50 + 10 \times z_i = 50 + 10 \times \frac{x_i - \bar{x}}{s}

IQi=100+15×ziIQ_i = 100 + 15 \times z_i

Percentile ranks from the z-score (under normality): PR=Φ(zi)×100PR = \Phi(z_i) \times 100, where Φ\Phi is the standard normal CDF.

11.6 The Central Limit Theorem and its Implications

The Central Limit Theorem (CLT) states that for any population with finite mean μ\mu and variance σ2\sigma^2, the sampling distribution of xˉ\bar{x} approaches a normal distribution as nn \to \infty:

xˉN ⁣(μ,  σ2n)approximately, for large n\bar{x} \sim \mathcal{N}\!\left(\mu,\; \frac{\sigma^2}{n}\right) \quad \text{approximately, for large } n

Practical implications for descriptives:

  • For n30n \geq 30 and mildly non-normal populations, the t-CI for μ\mu is approximately valid.
  • For highly skewed distributions or n<30n < 30, bootstrap CIs are preferred.
  • The CLT justifies reporting the mean and SD even for non-normal data, provided nn is sufficiently large and the goal is inference about μ\mu.

11.7 Detecting Bimodality

A bimodal distribution has two distinct peaks, often indicating the presence of two subpopulations. Indicators of bimodality:

  • Histogram shows two visible humps.
  • Large positive kurtosis alone does not imply bimodality.
  • Hartigan's dip test formally tests for unimodality against multimodal alternatives.
  • Bimodality coefficient BC=(G12+1)/(G2+3(n1)2/((n2)(n3)))BC = (G_1^2 + 1) / (G_2 + 3(n-1)^2/((n-2)(n-3))); BC>0.555BC > 0.555 suggests bimodality (Pfister et al., 2013).
  • Gaussian mixture models can formally decompose a bimodal distribution into component distributions.

When bimodality is detected, computing a single mean and SD is misleading — report descriptives separately for each subpopulation if they can be identified.

11.8 Process Capability Indices

In quality control and manufacturing, process capability indices relate the distribution of a measured variable to specification limits (LSLLSL = lower, USLUSL = upper):

Cp=USLLSL6s,Cpk=min ⁣(USLxˉ3s,  xˉLSL3s)C_p = \frac{USL - LSL}{6s}, \qquad C_{pk} = \min\!\left(\frac{USL - \bar{x}}{3s},\; \frac{\bar{x} - LSL}{3s}\right)

  • Cp1.33C_p \geq 1.33: Process capable (if centred).
  • Cpk1.33C_{pk} \geq 1.33: Process capable and well-centred.
  • Cpk<1C_{pk} < 1: Process produces unacceptable proportion of out-of-specification output.

DataStatPro computes and reports CpC_p, CpkC_{pk}, CpmC_{pm} (Taguchi index), and the estimated proportion out-of-specification when specification limits are supplied.

When a numerical variable is measured repeatedly over time, rolling (moving window) descriptives track how central tendency and variability change:

xˉrolling,t=1wi=tw+1txi\bar{x}_{rolling,t} = \frac{1}{w}\sum_{i=t-w+1}^{t} x_i

Where ww is the window width. Rolling means, medians, and standard deviations are plotted over time to reveal trends, seasonal patterns, and structural breaks. DataStatPro supports rolling descriptives with user-specified window width.

11.10 Sensitivity Analysis: Influence of Outliers

To assess the influence of potential outliers on key descriptives, a sensitivity analysis reports descriptives both including and excluding flagged outliers:

StatisticWith OutliersWithout OutliersDifference
Meanxˉfull\bar{x}_{full}xˉexcl\bar{x}_{excl}Δxˉ\Delta\bar{x}
Medianx~full\tilde{x}_{full}x~excl\tilde{x}_{excl}Δx~\Delta\tilde{x}
SDsfulls_{full}sexcls_{excl}Δs\Delta s
SkewnessG1,fullG_{1,full}G1,exclG_{1,excl}ΔG1\Delta G_1

A large Δxˉ\Delta\bar{x} with a small Δx~\Delta\tilde{x} confirms that the outlier is exerting disproportionate leverage on the mean. Report both full-data and outlier-excluded descriptives when outliers are detected, along with a justification for any exclusions.


12. Worked Examples

Example 1: Symmetric Distribution — Exam Scores

A class of n=25n = 25 students sits an exam (0–100 marks). Three students were absent (missing); valid responses: nvalid=22n_{valid} = 22.

Raw scores (sorted):

42, 51, 55, 58, 61, 63, 65, 67, 68, 70, 71, 72, 73, 74, 75, 76, 78, 80, 82, 85, 88, 93

Step 1 — Central Tendency:

xˉ=42+51++9322=152722=69.41\bar{x} = \frac{42 + 51 + \cdots + 93}{22} = \frac{1527}{22} = 69.41

Median (n=22n = 22, even): Average of 11th and 12th values:

x~=71+722=71.50\tilde{x} = \frac{71 + 72}{2} = 71.50

Step 2 — Quartiles:

Q1Q_1 (25th percentile, position =0.25×21+1=6.25= 0.25 \times 21 + 1 = 6.25):

Q1=x(6)+0.25(x(7)x(6))=63+0.25(6563)=63.50Q_1 = x_{(6)} + 0.25(x_{(7)} - x_{(6)}) = 63 + 0.25(65 - 63) = 63.50

Q3Q_3 (75th percentile, position =0.75×21+1=16.75= 0.75 \times 21 + 1 = 16.75):

Q3=x(16)+0.75(x(17)x(16))=76+0.75(7876)=77.50Q_3 = x_{(16)} + 0.75(x_{(17)} - x_{(16)}) = 76 + 0.75(78 - 76) = 77.50

IQR=77.5063.50=14.00IQR = 77.50 - 63.50 = 14.00

Step 3 — Dispersion:

s2=(xi69.41)221=2818.7721=134.23,s=134.23=11.59s^2 = \frac{\sum(x_i - 69.41)^2}{21} = \frac{2818.77}{21} = 134.23, \qquad s = \sqrt{134.23} = 11.59

SExˉ=11.5922=2.47SE_{\bar{x}} = \frac{11.59}{\sqrt{22}} = 2.47

CV=11.5969.41×100%=16.7%CV = \frac{11.59}{69.41} \times 100\% = 16.7\%

Step 4 — Shape:

G1=0.42(slight negative skew)G2=0.15(approximately mesokurtic)G_1 = -0.42 \quad (\text{slight negative skew}) \qquad G_2 = 0.15 \quad (\text{approximately mesokurtic})

Step 5 — Normality:

Shapiro-Wilk: W=0.974W = 0.974, p=.812p = .812 → Consistent with normality.

Step 6 — Outlier Check:

Lower fence: 63.501.5(14.00)=42.5063.50 - 1.5(14.00) = 42.50. Score of 42 is just below the fence.

Upper fence: 77.50+1.5(14.00)=98.5077.50 + 1.5(14.00) = 98.50. No upper outliers.

Score 42 is borderline; investigate (absent-then-returning student?). Reported as a mild outlier.

Step 7 — 95% CI for the Mean:

CIμ=69.41±2.080×2.47=[64.27,  74.55]CI_\mu = 69.41 \pm 2.080 \times 2.47 = [64.27,\; 74.55]

(t0.025,  21=2.080t_{0.025,\; 21} = 2.080)

Summary Table:

StatisticValue
Valid nn22
Missing nn3
Mean (95% CI)69.41 [64.27, 74.55]
Median [IQR]71.50 [63.50, 77.50]
SD11.59
SEM2.47
CV16.7%
Min – Max42 – 93
Range51
Skewness (G1G_1)−0.42
Excess kurtosis (G2G_2)0.15
Shapiro-Wilk WW (pp)0.974 (.812)
Outliers flagged1 (score = 42)

APA write-up: "Exam scores for 22 students (3 absent) ranged from 42 to 93 (M=69.41M = 69.41, 95% CI [64.27, 74.55], SD=11.59SD = 11.59, Mdn=71.50Mdn = 71.50, IQR=14.00IQR = 14.00). The distribution was approximately symmetric (G1=0.42G_1 = -0.42, G2=0.15G_2 = 0.15) and consistent with normality (Shapiro-Wilk W=0.974W = 0.974, p=.812p = .812). One borderline outlier (score = 42) was identified using Tukey's IQR fence."


Example 2: Skewed Distribution — Household Income

A social survey records annual household income (£ thousands) for n=150n = 150 households. nmiss=0n_{miss} = 0.

Selected descriptive statistics (computed by DataStatPro):

StatisticValue
Valid nn150
Mean (95% CI)£62.3k [57.1k, 67.5k]
Median [IQR]£48.5k [34.2k, 72.6k]
SD£32.8k
SEM£2.68k
CV52.7%
Min – Max£11.2k – £198.4k
Skewness (G1G_1)1.84
Excess kurtosis (G2G_2)3.21
Shapiro-Wilk WW (pp)0.891 (p<.001p < .001)
MAD (median)£18.7k
Robust σ^\hat{\sigma}£27.7k
Outliers flagged (Tukey)8 (all upper)
Geometric mean (95% CI)£52.6k [48.9k, 56.6k]

Interpretation: The distribution is substantially positively skewed (G1=1.84G_1 = 1.84), as is typical of income data. The mean (£62.3k) is considerably higher than the median (£48.5k), indicating that a small number of high-income households pull the mean upward. Eight high-income outliers are flagged. The Shapiro-Wilk test confirms significant non-normality (p<.001p < .001).

For this variable, the median and IQR (£48.5k [34.2k, 72.6k]) are the appropriate summary measures. The geometric mean (£52.6k) is also informative, as income data are approximately log-normally distributed.

APA write-up: "Household income (N=150N = 150) showed substantial positive skewness (G1=1.84G_1 = 1.84), and the Shapiro-Wilk test indicated significant departure from normality (W=0.891W = 0.891, p<.001p < .001). The median income was £48.5k (IQRIQR = £34.2k – £72.6k; range: £11.2k – £198.4k). The geometric mean was £52.6k (95% CI: £48.9k – £56.6k). Eight upper outliers were identified using Tukey's IQR fence."


Example 3: Grouped Descriptives — Reaction Time by Caffeine Condition

A psychology experiment measures reaction time (ms) in two conditions: caffeine (n=30n = 30) and placebo (n=30n = 30).

StatisticCaffeinePlacebo
Valid nn3030
Mean (95% CI)287.3 [278.5, 296.1]312.6 [300.4, 324.8]
Median [IQR]284.0 [268.5, 302.0]309.5 [290.0, 331.0]
SD23.532.1
SEM4.35.9
CV8.2%10.3%
Min – Max248 – 341251 – 391
Skewness (G1G_1)0.410.62
Excess kurtosis (G2G_2)−0.180.55
Shapiro-Wilk WW (pp)0.968 (.490)0.956 (.253)
Outliers flagged01

Cohen's dd:

d=312.6287.3spooled=25.329×23.52+29×32.1258=25.328.1=0.90d = \frac{312.6 - 287.3}{s_{pooled}} = \frac{25.3}{\sqrt{\frac{29 \times 23.5^2 + 29 \times 32.1^2}{58}}} = \frac{25.3}{28.1} = 0.90

A large effect size (Cohen, 1988).

Interpretation: Participants in the caffeine condition responded on average 25.3 ms faster than those in the placebo condition (d=0.90d = 0.90). Both distributions are approximately normal (Shapiro-Wilk p>.05p > .05 for both groups). The placebo group shows slightly higher variability (CV=10.3%CV = 10.3\% vs. 8.2%8.2\%) and one flagged outlier.

APA write-up: "Reaction times in the caffeine condition (M=287.3M = 287.3 ms, SD=23.5SD = 23.5, 95% CI [278.5, 296.1]) were lower than in the placebo condition (M=312.6M = 312.6 ms, SD=32.1SD = 32.1, 95% CI [300.4, 324.8]). Both distributions were consistent with normality (Shapiro-Wilk W0.956W \geq 0.956, p.253p \geq .253). The effect size was large (d=0.90d = 0.90)."


Example 4: Log-Normal Variable — Bacterial Colony Counts

A microbiology study counts bacterial colonies per sample (N=40N = 40; counts range from 3 to 4,820). Raw data are highly right-skewed.

StatisticRaw ScaleLog10_{10} Scale
Mean (95% CI)684.2 [463.5, 904.9]2.41 [2.28, 2.54]
Median [IQR]412.5 [98.5, 1021.0]2.62 [1.99, 3.01]
SD710.40.41
Skewness (G1G_1)2.31−0.18
Shapiro-Wilk WW (pp)0.823 (<.001< .001)0.974 (.483)
Geometric mean (95% CI)257.0 [195.5, 338.1]

Interpretation: Colony counts are log-normally distributed — the raw data are severely right-skewed (G1=2.31G_1 = 2.31, Shapiro-Wilk p<.001p < .001), but log10_{10}- transformed counts are approximately normally distributed (G1=0.18G_1 = -0.18, Shapiro-Wilk p=.483p = .483). The appropriate measure of central tendency is the geometric mean of 257.0 colonies (equivalent to the antilog of the mean on the log scale: 102.41=257.010^{2.41} = 257.0). The arithmetic mean (684.2) is substantially inflated by high-count outliers and is not recommended as the primary summary.

APA write-up: "Colony counts (N=40N = 40) were log-normally distributed (Shapiro-Wilk on log10_{10}-transformed data: W=0.974W = 0.974, p=.483p = .483; on raw data: W=0.823W = 0.823, p<.001p < .001). The geometric mean colony count was 257.0 (95% CI: 195.5 – 338.1). Descriptive statistics are reported on the log10_{10} scale: Mlog=2.41M_{log} = 2.41, SDlog=0.41SD_{log} = 0.41, Mdnlog=2.62Mdn_{log} = 2.62, IQRlog=1.993.01IQR_{log} = 1.99 – 3.01."


13. Common Mistakes and How to Avoid Them

Mistake 1: Reporting the SEM as a Measure of Variability

Problem: Reporting "Mean ± SEM" and implying that the SEM describes the spread of individual observations. Because SE=s/nSE = s/\sqrt{n}, the SEM shrinks with larger samples and always underestimates the variability in the data. This practice gives a misleadingly impression of data homogeneity.

Solution: Use the SD to describe the variability of individual observations in the sample. Use the SEM or the 95% CI to describe the precision of the mean estimate. Always label clearly which you are reporting.


Mistake 2: Applying the Mean and SD to a Skewed Distribution

Problem: Reporting mean ± SD for a right-skewed variable (e.g., income, hospital length of stay, reaction time) where the mean is not representative of a typical observation and the SD interval (xˉ±s\bar{x} \pm s) may extend below zero.

Solution: For skewed distributions, report the median and IQR as the primary summary. Consider also reporting the geometric mean for log-normally distributed data. Always inspect the histogram and Shapiro-Wilk test before choosing between mean-based and median-based summaries.


Mistake 3: Deleting Outliers Without Justification

Problem: Automatically removing observations that fall outside Tukey's fences or that have z>3|z| > 3, without investigating whether they are genuine data points or errors. Removing legitimate extreme values biases the descriptives and invalidates downstream inferential tests.

Solution: Investigate every flagged outlier individually. Is it a data entry error? A measurement error? Or a genuine extreme value? Delete only errors; retain genuine extreme values. Report descriptives both with and without suspected outliers and note any exclusions explicitly.


Mistake 4: Confusing Standard Deviation with Standard Error

Problem: Using "SD" and "SE" (or "SEM") interchangeably, or not specifying which is reported. This is one of the most common statistical errors in published research.

Solution: Clearly define all notation at first use. Use SD to describe sample variability; use SE or 95% CI to describe estimation precision. Refer to APA style: "M=45.3M = 45.3, SD=8.7SD = 8.7, 95% CI [43.6, 47.0]."


Mistake 5: Over-Interpreting Descriptives from Very Small Samples

Problem: Computing and reporting detailed descriptives (skewness, kurtosis, mode, full percentile table) from samples of n=5n = 5 or n=10n = 10, as if they were stable estimates of population parameters. With very small nn, all descriptives are highly unstable.

Solution: For n<10n < 10, report individual values and at most the minimum, maximum, and median. For n<30n < 30, report mean and SD with wide CIs, note the small sample size, and avoid strong distributional claims. Accompany all small-sample descriptives with explicit acknowledgment of imprecision.


Mistake 6: Not Assessing Normality Before Reporting Mean-Based Summaries

Problem: Automatically reporting mean ± SD for every numerical variable without checking whether the distribution is approximately normal. For skewed data, the mean and SD are poor summaries and can be actively misleading.

Solution: Always assess normality as part of the descriptive analysis, using at minimum a histogram and the Shapiro-Wilk test. Select mean ± SD (normal data) or median [IQR] (skewed data) based on the assessment. Report the normality assessment results alongside the descriptives.


Mistake 7: Truncating the y-Axis in Histograms or Bar Charts

Problem: Starting the frequency axis at a value other than zero to exaggerate differences or make the distribution appear more concentrated. This distorts the visual impression of the data.

Solution: Always start the y-axis of a histogram or bar chart at zero. If the range of values is large, consider a secondary plot zooming in on a region of interest, rather than truncating the primary axis.


Mistake 8: Reporting Spurious Precision

Problem: Reporting the mean as 47.38271 when the raw data are measured to the nearest integer. This implies a level of measurement precision that does not exist and does not aid interpretation.

Solution: Report the mean and SD to one more decimal place than the original data. Data recorded to the nearest unit → report mean to one decimal place. Data recorded to one decimal place → report mean to two decimal places.


Mistake 9: Failing to Distinguish Between Descriptive and Inferential Uses of Confidence Intervals

Problem: Reporting a 95% CI for the mean as a range within which "95% of observations fall". This is a fundamentally incorrect interpretation — that description applies to the prediction interval, not the CI for the mean.

Solution: A 95% CI for the mean means: "If we repeated this study many times, 95% of the constructed intervals would contain the true population mean μ\mu." It is an interval for a parameter, not for individual observations. Use prediction intervals or reference ranges when the goal is to describe the expected range for an individual observation.


Mistake 10: Comparing Means Across Groups Without Assessing Comparability of Spread

Problem: Reporting that the mean in Group A is higher than in Group B, without noting that Group A has three times the standard deviation. A mean difference is much more practically important when variability is low than when it is high.

Solution: Always report the variability measure (SD or IQR) alongside the central tendency measure for each group. Compute and report Cohen's dd or the Glass Δ\Delta as the standardised effect size when comparing group means.


14. Troubleshooting

ProblemLikely CauseSolution
Mean and median differ greatlySkewed distribution or outliersReport median and IQR; investigate outliers; consider transformation
SD larger than meanHigh variability, right skew, or zero-inflated dataCheck distribution shape; consider median/IQR; check for data errors
SD = 0All observations identicalExpected; report as constant; investigate if unexpected
Negative varianceComputation errorVerify formula uses n1n-1 denominator; check data integrity
CV reported as negativeNegative mean (e.g., change scores, temperatures in °C)CV is not meaningful for interval-scale variables with possible negative means; omit CV
Skewness or kurtosis very large ($G_1> 5$)
Shapiro-Wilk p<.001p < .001 with large nnCLT: trivial departures from normality significant for n>200n > 200Inspect histogram and Q-Q plot; if visually approximately normal, proceed with mean-based summaries
Shapiro-Wilk p>.05p > .05 with small nnLow power of normality test for n<20n < 20Visual inspection essential; bootstrap CI safer than t-CI for small non-normal samples
Confidence interval includes negative values for SDIncorrectly applied symmetric CI for SDUse chi-square CI for SD (Section 10.5); SD CI is inherently asymmetric
Geometric mean cannot be computedOne or more zero or negative valuesLog transformation undefined at 0; add a small constant or use arithmetic mean; check for data errors
Percentile estimates differ from other softwareDifferent percentile computation method (Type 7 vs. other)Specify the method used; DataStatPro uses Type 7 (linear interpolation)
IQR = 0More than 50% of observations share the same valueCommon with discrete or heavily tied data; report as 0 and note the high frequency of tied values
Mean dramatically changes after excluding one outlierOutlier exerts high leverageReport sensitivity analysis; consider robust estimators; investigate the outlier
MAD (median) = 0More than 50% of observations share the median valueCommon with integer data; robust σ^\hat{\sigma} is not meaningful; note and use other measures
CpkC_{pk} is negativeProcess mean outside specification limitsUrgent process adjustment needed; Cpk<0C_{pk} < 0 means more than 50% of output is out-of-specification
Rolling mean diverges unexpectedlyStructural break or data anomaly in the time seriesInvestigate the time period around the divergence; check for data entry errors

15. Quick Reference Cheat Sheet

Core Equations

FormulaDescription
xˉ=1nxi\bar{x} = \frac{1}{n}\sum x_iArithmetic mean
x~=x((n+1)/2)\tilde{x} = x_{((n+1)/2)} or average of middle twoMedian
s2=(xixˉ)2n1s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1}Sample variance
s=s2s = \sqrt{s^2}Sample standard deviation
SE=s/nSE = s / \sqrt{n}Standard error of the mean
CV=(s/xˉ)×100%CV = (s / \bar{x}) \times 100\%Coefficient of variation (ratio scale only)
IQR=Q3Q1IQR = Q_3 - Q_1Interquartile range
MADmed=median(xix~)MAD_{med} = \text{median}(\|x_i - \tilde{x}\|)Median absolute deviation
σ^robust=1.4826×MADmed\hat{\sigma}_{robust} = 1.4826 \times MAD_{med}Robust SD estimator
G1G_1 = bias-corrected third standardised momentSample skewness
G2G_2 = bias-corrected fourth standardised moment 3- 3Excess kurtosis
zi=(xixˉ)/sz_i = (x_i - \bar{x})/sZ-score (standardised value)
xˉgeom=exp(1nlnxi)\bar{x}_{geom} = \exp(\frac{1}{n}\sum \ln x_i)Geometric mean
CIμ=xˉ±tα/2,  n1×s/nCI_\mu = \bar{x} \pm t_{\alpha/2,\; n-1} \times s/\sqrt{n}95% CI for mean
CIσ=[(n1)s2/χupper2,  (n1)s2/χlower2]CI_\sigma = [\sqrt{(n-1)s^2/\chi^2_{upper}},\; \sqrt{(n-1)s^2/\chi^2_{lower}}]95% CI for SD
d=(xˉ1xˉ2)/spooledd = (\bar{x}_1 - \bar{x}_2)/s_{pooled}Cohen's dd (standardised mean difference)

Measure Applicability by Data Type

MeasureContinuous NormalContinuous SkewedDiscrete CountRatio-Scale Required
Mean, SD, SEM✅ Primary⚠️ Use with caution
Median, IQR✅ Supplement✅ Primary
Geometric mean✅ Log-normal✅ Counts
CV
MAD (median)✅ Primary
Skewness, kurtosis
Percentiles

Mean vs. Median Decision Guide

Use Mean ± SD WhenUse Median [IQR] When
Approximately normal (Shapiro-Wilk p>.05p > .05)Substantially skewed ($
No substantial outliersOutliers present
n30n \geq 30n<20n < 20 with non-normal data
Parametric tests plannedNon-parametric tests planned
Ratio-scale variableBounded or censored data

Normality Assessment Summary

ToolOutputAction Threshold
Shapiro-Wilk testWW, pp-valuep<.05p < .05: evidence of non-normality
Skewness z-testzG1z_{G_1}$
Kurtosis z-testzG2z_{G_2}$
HistogramVisualObvious skew, bimodality, gaps
Q-Q plotVisualPoints outside confidence band

Outlier Detection Thresholds

MethodLower BoundUpper Bound
Tukey IQR fence (mild)Q11.5×IQRQ_1 - 1.5 \times IQRQ3+1.5×IQRQ_3 + 1.5 \times IQR
Tukey IQR fence (extreme)Q13×IQRQ_1 - 3 \times IQRQ3+3×IQRQ_3 + 3 \times IQR
Z-scorexˉ3s\bar{x} - 3sxˉ+3s\bar{x} + 3s
Modified Z-score$M_i

Cohen's dd Benchmarks

| d|d| | Effect Size | Contextual Note | | :----- | :---------- | :-------------- | | 0.200.20 | Small | Barely noticeable in practice | | 0.500.50 | Medium | Visible to a careful observer | | 0.800.80 | Large | Obvious to a casual observer | | 1.20+1.20+ | Very large | Highly practically significant |

Required Sample Size for Target CI Width

n(1.96×s/δ)2n \approx (1.96 \times s / \delta)^2 where δ\delta = desired half-width of 95% CI.

ssδ=0.5s\delta = 0.5sδ=0.25s\delta = 0.25sδ=0.1s\delta = 0.1s
Any16\approx 1662\approx 62385\approx 385

APA 7th Edition Reporting Templates

Normal distribution (primary: mean and SD): "[Variable] scores ranged from [Min] to [Max] (MM = [value], SDSD = [value], 95% CI [[LB], [UB]]). The distribution was approximately normal (Shapiro-Wilk WW = [value], pp = [value], G1G_1 = [value], G2G_2 = [value])."

Non-normal distribution (primary: median and IQR): "[Variable] scores ranged from [Min] to [Max] (MdnMdn = [value], IQRIQR = [LB] – [UB]). The distribution was positively/negatively skewed (G1G_1 = [value]), and the Shapiro-Wilk test indicated significant departure from normality (WW = [value], pp = [value])."

Log-normal variable (geometric mean): "[Variable] data were log-normally distributed (Shapiro-Wilk on log-transformed data: WW = [value], pp = [value]). The geometric mean was [value] (95% CI: [LB], [UB])."

Group comparison: "[Group A] (nn = [value], MM = [value], SDSD = [value]) showed [higher/lower/similar] [variable] compared to [Group B] (nn = [value], MM = [value], SDSD = [value]), with a [small/medium/large] effect size (dd = [value])."

With outliers: "[Number] outlier(s) were identified using Tukey's IQR fence method ([list values]). Descriptives are reported for the full sample (nn = [total]) and with outliers excluded (nn = [excl]): full: MM = [value], SDSD = [value]; excluding outliers: MM = [value], SDSD = [value]."

Reporting Checklist

ItemRequired
Valid nn and missing nn (with missing rate)✅ Always
Mean and SD (or median and IQR)✅ Always
State which: Mean ± SD or Median [IQR] — with justification✅ Always
Minimum and maximum (or range)✅ Always
95% CI for the mean (or median)✅ Always
Skewness (G1G_1) and excess kurtosis (G2G_2)✅ Always
Normality assessment (Shapiro-Wilk + histogram/Q-Q)✅ Always
Quartiles (Q1Q_1, Q3Q_3, IQRIQR)✅ When reporting median
SEM✅ When mean precision is the focus (clearly labelled)
CV✅ For ratio-scale variables; when comparing variability
MAD (median)✅ For non-normal data; when robustness is relevant
Geometric mean✅ For log-normal or multiplicative data
Outlier detection results✅ Always
Sensitivity analysis (with/without outliers)✅ When outliers are detected
Units of measurement stated✅ Always
Percentile table✅ For norm-referenced scores; clinical reference ranges
Five-number summary✅ When box plot is presented
Cohen's dd (group comparisons)✅ When comparing two group means
Bootstrap CI✅ When n<30n < 30 and normality is violated
Transformation stated and justified✅ When data are transformed before reporting
Measurement scale stated (interval / ratio)✅ Always
Missing data mechanism discussed✅ When nmiss>5%n_{miss} > 5\%
Weighted estimates (if survey data)✅ When design weights provided

This tutorial provides a comprehensive foundation for understanding, computing, interpreting, visualising, and reporting numerical descriptive statistics within the DataStatPro application. For further reading, consult Tukey's "Exploratory Data Analysis" (1977) for robust and exploratory methods, Altman's "Practical Statistics for Medical Research" (1991) for clinical applications, Wilcox's "Introduction to Robust Estimation and Hypothesis Testing" (4th ed., 2017) for robust descriptives, Cohen's "Statistical Power Analysis for the Behavioral Sciences" (2nd ed., 1988) for effect size conventions, and Hoaglin, Mosteller & Tukey's "Understanding Robust and Exploratory Data Analysis" (1983) for advanced exploratory methods. For feature requests or support, contact the DataStatPro team.