Numerical Descriptives: Zero to Hero Tutorial
This comprehensive tutorial takes you from the foundational concepts of summarising continuous and discrete numerical data all the way through advanced interpretation, reporting, visualisation, assumption checking, and practical usage within the DataStatPro application. Whether you are encountering numerical descriptive statistics for the first time or deepening your understanding of how to characterise, display, and communicate the distribution of numerical variables, this guide builds your knowledge systematically from the ground up.
Table of Contents
- Prerequisites and Background Concepts
- What are Numerical Descriptives?
- The Mathematics Behind Numerical Descriptives
- Considerations and Data Quality Checks
- Types of Numerical Descriptive Measures
- Using the Numerical Descriptives Calculator Component
- Step-by-Step Procedure
- Interpreting the Output
- Visualising Numerical Data
- Confidence Intervals for Numerical Descriptives
- Advanced Topics
- Worked Examples
- Common Mistakes and How to Avoid Them
- Troubleshooting
- Quick Reference Cheat Sheet
1. Prerequisites and Background Concepts
Before diving into numerical descriptive statistics, it is essential to be comfortable with the following foundational statistical and mathematical concepts. Each is briefly reviewed below.
1.1 Variables and Observations
A variable is any measurable characteristic that can take on different values across observations. A numerical variable (also called a quantitative variable) records values on a numeric scale that has an inherent magnitude, enabling arithmetic operations such as addition, subtraction, and averaging.
- Observation: A single unit of study (one person, one trial, one time point).
- Dataset: A rectangular array of observations (rows) and variables (columns).
- Value: The specific numeric measurement recorded for a variable on a given observation.
1.2 Scales of Measurement: Interval and Ratio
Numerical descriptive statistics are appropriate for variables measured on interval or ratio scales:
| Scale | Properties | True Zero | Examples |
|---|---|---|---|
| Interval | Equal spacing between values; no true zero | No | Temperature (°C or °F), year, IQ score |
| Ratio | Equal spacing and a meaningful true zero | Yes | Height, weight, income, reaction time, count data |
⚠️ For interval scales, differences are meaningful but ratios are not — saying 40°C is "twice as hot" as 20°C is meaningless. For ratio scales, both differences and ratios are meaningful — a person weighing 80 kg is genuinely twice as heavy as one weighing 40 kg. Most numerical descriptives apply to both scales, but ratios such as the coefficient of variation require ratio-scale data.
1.3 Continuous vs. Discrete Numerical Variables
Numerical variables are further classified by the set of values they can take:
- Continuous variables: Can take any real value within a range; the precision is limited only by the measurement instrument. Examples: height, blood pressure, temperature, time.
- Discrete variables: Take only countable, distinct values (often non-negative integers). Examples: number of children, number of hospital admissions, count of errors.
Most numerical descriptives apply equally to both types, though visualisation choices (histograms vs. bar charts) and some distribution assumptions differ.
1.4 The Concept of a Distribution
The distribution of a numerical variable describes how its values are spread across the number line. Understanding a distribution requires characterising:
- Location (central tendency): Where the centre or typical value lies.
- Spread (dispersion): How much values vary around the centre.
- Shape: Whether the distribution is symmetric, skewed, peaked, or flat.
- Outliers: Whether extreme values are present.
Numerical descriptive statistics provide compact summaries of each of these four distributional properties.
1.5 The Normal Distribution
The normal (Gaussian) distribution is the most important continuous probability distribution in statistics. It is parameterised by its mean and standard deviation :
Key properties:
- Perfectly symmetric and bell-shaped around .
- Mean, median, and mode are all equal to .
- Approximately 68% of observations fall within of .
- Approximately 95% of observations fall within of .
- Approximately 99.7% of observations fall within of .
- Skewness = 0; excess kurtosis = 0.
Many numerical descriptives (mean, standard deviation, standard error) are most meaningful and interpretable when the underlying distribution is approximately normal.
1.6 Robustness
A statistical measure is robust if it is relatively unaffected by outliers or departures from assumed distributional shapes. This is a critical concept for choosing between competing descriptive measures:
- Non-robust measures: Mean, variance, standard deviation — one extreme outlier can substantially shift these measures.
- Robust measures: Median, interquartile range, median absolute deviation — designed to be resistant to the influence of extreme values.
1.7 Population Parameters vs. Sample Statistics
All numerical descriptives computed from data are sample statistics — estimates of unknown population parameters:
| Population Parameter | Sample Statistic | Symbol (Parameter / Statistic) |
|---|---|---|
| Population mean | Sample mean | / |
| Population variance | Sample variance | / |
| Population standard deviation | Sample SD | / |
| Population median | Sample median | / |
| Population correlation | Sample correlation | / |
Sample statistics carry sampling variability; confidence intervals (Section 10) quantify the precision of these estimates.
1.8 Summation Notation
Numerical descriptives are expressed using summation notation. For a variable with observations :
Understanding this notation is essential for interpreting the mathematical formulae throughout this tutorial.
2. What are Numerical Descriptives?
2.1 The Core Purpose
Numerical descriptive statistics are mathematical summaries that characterise the distribution of a continuous or discrete numerical variable. Their collective purpose is to replace a raw list of numbers with a small, interpretable set of values that faithfully conveys the essential features of the data — location, spread, shape, and extremes — without requiring the reader to examine every individual observation.
2.2 The Four Pillars of a Numerical Description
Every complete numerical description addresses four fundamental questions:
| Pillar | Question | Addressed By |
|---|---|---|
| Central tendency | Where does the centre of the distribution lie? | Mean, median, mode, trimmed mean |
| Dispersion | How spread out are the values? | Range, IQR, variance, SD, CV, MAD |
| Shape | Is the distribution symmetric, skewed, peaked, or flat? | Skewness, kurtosis |
| Extremes | Are there unusual observations at the tails? | Minimum, maximum, outlier flags |
No single number captures all four pillars. A complete description always reports at least one measure from each category.
2.3 When to Use Numerical Descriptives
| Condition | Requirement |
|---|---|
| Variable scale | Interval or ratio (continuous or discrete) |
| Data format | Numeric observations |
| Purpose | Summarise the marginal distribution of one variable |
| Sample size | Any; larger yields more stable and precise estimates |
| Reporting | Always precede inferential tests with descriptive summaries |
2.4 Real-World Applications
| Field | Variable | Key Descriptives |
|---|---|---|
| Clinical Medicine | Blood pressure (mmHg) | Mean ± SD; reference range; outlier flags |
| Finance | Daily stock return (%) | Mean, SD, skewness, kurtosis; Value at Risk |
| Education | Exam score (0–100) | Mean, median, SD, percentiles, min, max |
| Manufacturing | Component diameter (mm) | Mean, SD, CV; process capability indices |
| Environmental Science | Rainfall (mm) | Median, IQR; skewness; seasonal breakdown |
| Sports Analytics | Player sprint speed (m/s) | Mean ± SD; percentile ranks; outlier detection |
| Pharmacology | Drug concentration (ng/mL) | Geometric mean; CV; log-transformed summaries |
| Epidemiology | Body mass index (kg/m²) | Mean, SD, percentiles; skewness |
| Psychology | Reaction time (ms) | Median, IQR; robust measures due to skewness |
| Quality Control | Process yield (%) | Mean, SD; capability index () |
2.5 Distinguishing Numerical Descriptives from Related Analyses
| Goal | Appropriate Method |
|---|---|
| Summarise one numerical variable | Numerical descriptives |
| Compare means across two groups | Independent-samples t-test |
| Compare means across three or more groups | One-way ANOVA |
| Assess relationship between two numerical variables | Pearson or Spearman correlation |
| Predict one numerical variable from another | Simple or multiple regression |
| Test normality of a numerical variable | Shapiro-Wilk, Kolmogorov-Smirnov test |
| Compare spread across two groups | Levene's test, F-test for equality of variances |
| Summarise a categorical variable | Categorical descriptives |
3. The Mathematics Behind Numerical Descriptives
3.1 Notation
Consider a numerical variable with valid observations , arranged in ascending order as the order statistics .
3.2 Measures of Central Tendency
3.2.1 Arithmetic Mean
The arithmetic mean (simply "the mean") is the sum of all values divided by the number of observations:
The mean is the centre of gravity of the data — the point at which the distribution balances. It uses all observations equally and is the most efficient estimator of the population mean when the distribution is normal. However, it is sensitive to outliers.
3.2.2 Median
The median is the middle value when observations are arranged in order. It divides the distribution into two equal halves:
The median is the 50th percentile of the distribution. It is robust to outliers and is preferred over the mean when the distribution is skewed or when extreme values are present.
3.2.3 Mode
For numerical data, the mode is the value (or range of values) that appears most frequently. For continuous variables, the mode is typically identified from a histogram as the peak(s) of the distribution rather than a specific repeated value. For discrete data with genuine ties, the mode is the most frequently occurring value.
3.2.4 Trimmed Mean
The trimmed mean (or truncated mean) removes a fixed proportion of the most extreme observations from each tail before computing the mean:
Common choices: (5% trimmed mean) or (10% trimmed mean). The trimmed mean is more robust than the mean while being more efficient than the median, making it a useful middle ground.
3.2.5 Geometric Mean
The geometric mean is appropriate for positively skewed, multiplicative, or ratio- scale data (e.g., concentration values, growth rates, fold changes):
The geometric mean is defined only for strictly positive values (). It is the antilog of the mean of the log-transformed values and is always the arithmetic mean (equality holds when all values are identical).
3.2.6 Harmonic Mean
The harmonic mean is appropriate for averaging rates or ratios (e.g., speeds, price-to-earnings ratios):
The harmonic mean is always the geometric mean the arithmetic mean (the AM-GM-HM inequality), with equality when all values are identical.
3.3 Measures of Dispersion
3.3.1 Range
The range is the simplest measure of spread:
The range is easy to compute and interpret but is maximally non-robust — it is determined entirely by the two most extreme observations and increases without bound as grows.
3.3.2 Interquartile Range
The interquartile range (IQR) is the range of the middle 50% of the data:
Where is the 25th percentile and is the 75th percentile. The IQR is the most widely used robust measure of dispersion. It is unaffected by the values of the most extreme observations and provides direct information about the spread of the central bulk of the data.
3.3.3 Percentiles and Quartiles
The -th percentile is the value below which of observations fall. For ordered observations, the percentile is computed using linear interpolation:
If is an integer, . Otherwise, .
Key percentiles:
| Percentile | Symbol | Alternative Name |
|---|---|---|
| 25th | First quartile, lower quartile | |
| 50th | Median, second quartile | |
| 75th | Third quartile, upper quartile | |
| 10th, 20th, …, 90th | Deciles | |
| 1st, 2nd, …, 99th | Percentiles |
⚠️ Multiple methods exist for computing percentiles (e.g., Type 7 in R, Excel's PERCENTILE.INC). These differ in how they handle boundary cases and interpolation. DataStatPro uses the standard linear interpolation method (Type 7 / inclusive method) by default. The method is stated in the output footnote.
3.3.4 Variance
The sample variance measures the average squared deviation from the mean. It uses in the denominator (Bessel's correction) to produce an unbiased estimate of the population variance :
The divisor reflects the one degree of freedom consumed in estimating . Using instead of yields the population variance formula (maximum likelihood estimator), which is biased for small samples.
3.3.5 Standard Deviation
The sample standard deviation is the square root of the sample variance:
The standard deviation is expressed in the same units as the original variable, making it directly interpretable as a typical distance from the mean. Under normality, approximately 68% of observations fall within .
3.3.6 Standard Error of the Mean
The standard error of the mean (SEM or SE) quantifies the precision of as an estimate of — it is the standard deviation of the sampling distribution of the mean:
As increases, decreases proportionally to . The SEM is not a measure of the spread of individual observations (that role belongs to ).
⚠️ A common and serious error is reporting the SEM instead of the SD as the measure of variability in individual observations. The SEM is always smaller than the SD and can give a misleading impression of how variable the data are. Use SD to describe variability in the data; use SEM to describe precision of the mean estimate.
3.3.7 Coefficient of Variation
The coefficient of variation (CV) expresses the standard deviation as a percentage of the mean, enabling comparison of variability across variables measured on different scales or with different units:
The CV is a dimensionless measure of relative variability. It is only meaningful for ratio-scale variables (true zero exists) and when . CV is widely used in clinical chemistry, analytical measurement science, and quality control.
| CV | Verbal Label |
|---|---|
| Low variability | |
| Moderate variability | |
| High variability | |
| Very high variability |
3.3.8 Mean Absolute Deviation
The mean absolute deviation (MAD from mean) is a robust alternative to the standard deviation:
Unlike the variance, which squares deviations (over-weighting outliers), the MAD from mean uses absolute deviations. Under normality, .
3.3.9 Median Absolute Deviation
The median absolute deviation (MAD from median) is the most robust common measure of spread:
To use as a robust estimator of the standard deviation under normality, apply the consistency factor:
has a 50% breakdown point — it remains stable even when up to 50% of observations are outliers. DataStatPro reports both MAD measures, labelling them clearly as "MAD (from mean)" and "MAD (from median)".
3.4 Measures of Shape
3.4.1 Skewness
Skewness measures the degree and direction of asymmetry in the distribution:
The bias-corrected sample skewness (Fisher's formula, used by most software including DataStatPro):
| Interpretation | |
|---|---|
| Perfectly symmetric | |
| Positively skewed (right-tailed; long right tail; mean median) | |
| Negatively skewed (left-tailed; long left tail; mean median) | |
| $ | G_1 |
| $ | G_1 |
3.4.2 Kurtosis
Kurtosis measures the heaviness of the tails and the peakedness of the distribution relative to a normal distribution:
The bias-corrected excess kurtosis (subtracting 3 so that a normal distribution has excess kurtosis = 0) is:
| Distribution Type | Interpretation | |
|---|---|---|
| Mesokurtic | Normal-like tails | |
| Leptokurtic | Heavier tails than normal; more extreme outliers | |
| Platykurtic | Lighter tails than normal; fewer extreme values | |
| $ | G_2 | > 2$ |
⚠️ Both skewness and kurtosis are sensitive to sample size and outliers. With , these estimates are highly unstable. Use formal normality tests (Shapiro-Wilk) to complement visual inspection for normality assessment.
3.5 Measures of Position
3.5.1 Z-Scores (Standardised Values)
The z-score of observation expresses its distance from the mean in standard deviation units:
Z-scores enable comparison of observations from different variables or different populations. Under normality, is a widely used criterion for flagging potential outliers.
3.5.2 Percentile Rank
The percentile rank of observation is the percentage of observations in the sample that are less than or equal to :
Percentile ranks are used for norm-referenced scoring (e.g., standardised test reporting) and for identifying the relative standing of individual observations.
3.6 The Five-Number Summary
The five-number summary provides a compact, robust distributional snapshot:
This summary is the basis for the box plot (Section 9.3) and directly communicates the range, central tendency, spread (IQR), and potential skewness of the distribution. It is resistant to outliers (except for the min and max).
3.7 Normality Assessment Statistics
3.7.1 Shapiro-Wilk Test
The Shapiro-Wilk test is the most powerful test for normality for . It computes the correlation between the ordered sample values and the expected order statistics under normality:
Where are the Shapiro-Wilk coefficients derived from the expected normal order statistics. ranges from 0 to 1; values close to 1 indicate normality.
: The data are drawn from a normal distribution.
If , reject normality.
3.7.2 Kolmogorov-Smirnov Test (Lilliefors Correction)
The Kolmogorov-Smirnov test (with Lilliefors correction for estimated parameters) computes the maximum absolute difference between the empirical CDF and the normal CDF with estimated and :
The Lilliefors test is less powerful than Shapiro-Wilk for detecting non-normality but is useful as a supplementary check, especially for larger samples.
3.7.3 Skewness and Kurtosis Tests
Formal tests for normality using the skewness and kurtosis statistics:
Standard errors under normality:
Test statistics:
Values suggest significant departure from normality at .
4. Considerations and Data Quality Checks
4.1 Identifying Outliers
An outlier is an observation that appears inconsistent with the bulk of the data. Outliers can arise from data entry errors, measurement malfunctions, genuine extreme values, or distributional heavy tails. It is critical to identify, investigate, and make a principled decision about outliers before computing and reporting descriptives.
Common outlier detection methods:
| Method | Rule | Appropriate When |
|---|---|---|
| IQR fence (Tukey) | Outlier if or | General use; robust |
| Extreme fence (Tukey) | Extreme outlier if or | Identifying severe outliers |
| Z-score rule | Outlier if $ | z_i |
| Modified Z-score (Iglewicz-Hoaglin) | Outlier if where | Robust; preferred for skewed data |
| Grubbs' test | Formal significance test for the single most extreme value | Normal distribution; single outlier |
| Visual inspection | Box plot, histogram, Q-Q plot | Always; complements formal methods |
⚠️ Outlier removal must be justified on scientific or methodological grounds, not statistical convenience. An outlier is not grounds for deletion merely because it is extreme — it may represent the most scientifically interesting observation. Document all outlier decisions transparently. DataStatPro flags outliers but never removes them automatically.
4.2 Missing Data Assessment
Before any numerical summary is computed, the extent and pattern of missing data must be evaluated. Report and the missing rate . Investigate the missing data mechanism (MCAR, MAR, or MNAR) as in any statistical analysis.
Impact of missing data on numerical descriptives:
- Complete case analysis (default): Compute descriptives on valid observations only. Unbiased under MCAR; potentially biased under MAR or MNAR.
- Mean/median imputation: Replace missing values with the sample mean or median. Reduces apparent variability; distorts distributional shape. Not recommended.
- Multiple imputation: Statistically principled but complex; appropriate for inference, less so for purely descriptive purposes.
DataStatPro reports and in all output tables.
4.3 Checking Variable Type and Scale
Before computing numerical descriptives, confirm that the variable is genuinely measured on an interval or ratio scale. Common pitfalls:
| Variable | Problem | Correct Action |
|---|---|---|
| Likert item (1–5) | Ordinal, not interval | Report categorical and/or ordinal descriptives |
| Coded categorical (1 = Male, 2 = Female) | Nominal, not numerical | Recode and use categorical descriptives |
| Year of birth | Interval; ratios not meaningful | Report mean year, not "2× older" |
| Count (non-negative integer) | Discrete ratio; may be skewed | Consider median and IQR; check for zero-inflation |
4.4 Distributional Assumptions
Many uses of numerical descriptives (confidence intervals for the mean, standard error interpretation, power calculations) assume approximate normality. Check this assumption using:
- Histograms: Assess overall shape visually.
- Q-Q plot: Plot sample quantiles against theoretical normal quantiles; departures from the diagonal indicate non-normality.
- Shapiro-Wilk test: Formal test of normality.
- Skewness and kurtosis: Numerical shape indicators.
When data are non-normal, prefer robust measures (median, IQR, MAD) over mean-based summaries, especially with small samples.
4.5 Sample Size Adequacy
The stability and interpretability of numerical descriptives depend on :
| Guidance | |
|---|---|
| Descriptives are very unstable; report individual values or at most min, max, median | |
| Mean and SD interpretable; normality hard to assess; use median/IQR as supplement | |
| Most descriptives reasonably stable; normality assessment meaningful | |
| Descriptives reliable; shape statistics informative; skewness and kurtosis stable | |
| Very precise estimates; even tiny departures from normality detected by formal tests |
4.6 Significant Figures and Rounding
Numerical descriptives should be reported to a level of precision consistent with the measurement instrument:
- Report the mean and standard deviation to one more decimal place than the raw data.
- Report the median and IQR to the same precision as the raw data.
- Report test statistics (, , ) to two decimal places.
- Report p-values to two or three decimal places (or as ).
- Do not report spurious precision (e.g., mean for data measured to the nearest integer).
4.7 Weighted Data
For survey data with design weights, apply weights when computing all numerical descriptives to produce population-representative estimates:
DataStatPro supports weighted numerical descriptives when a weight variable is specified, reporting both unweighted and weighted estimates side by side.
5. Types of Numerical Descriptive Measures
5.1 Measures of Central Tendency
| Measure | Formula | Robust? | Best For |
|---|---|---|---|
| Arithmetic mean | No | Symmetric distributions; no outliers | |
| Median | Middle value of sorted data | Yes | Skewed distributions; outliers present |
| Mode | Most frequent value | Yes | Discrete data; bimodal distributions |
| Trimmed mean (%) | Mean after removing % from each tail | Moderate | Mild outliers; alternatives to mean |
| Geometric mean | No | Multiplicative data; log-normal distributions | |
| Harmonic mean | No | Rates and ratios |
5.2 Measures of Dispersion
| Measure | Formula | Robust? | Best For |
|---|---|---|---|
| Range | No | Quick overview; not for inference | |
| IQR | Yes | Paired with median; skewed data | |
| Variance | No | Theoretical derivations; inferential tests | |
| Standard deviation | No | Symmetric data; normal distribution | |
| SEM | No | Precision of mean estimate; confidence intervals | |
| CV | No | Comparing variability across different scales | |
| MAD (mean) | Moderate | Robust alternative to SD | |
| MAD (median) | Yes | Maximum robustness; extreme outliers |
5.3 Measures of Shape
| Measure | Formula | Normal Value | Interpretation |
|---|---|---|---|
| Skewness () | Third standardised central moment | 0 | Asymmetry of distribution |
| Excess kurtosis () | Fourth standardised central moment | 0 | Tail heaviness relative to normal |
5.4 Measures of Position
| Measure | Definition | Use |
|---|---|---|
| Minimum () | Smallest observation | Range; outlier detection |
| Maximum () | Largest observation | Range; outlier detection |
| Percentiles () | Value below which % of data fall | Norm-referenced scores; reference ranges |
| Quartiles () | 25th, 50th, 75th percentiles | Five-number summary; box plot |
| Z-score | Standardisation; outlier detection |
5.5 Five-Number Summary
Compact, robust distributional description; forms the basis for box plots.
5.6 Normality Diagnostics
| Diagnostic | Type | Output |
|---|---|---|
| Shapiro-Wilk test | Formal test | statistic and p-value |
| Kolmogorov-Smirnov (Lilliefors) | Formal test | statistic and p-value |
| Skewness z-test | Formal test | and p-value |
| Kurtosis z-test | Formal test | and p-value |
| Histogram | Visual | Shape assessment |
| Q-Q plot | Visual | Quantile-by-quantile deviation from normality |
| Box plot | Visual | Symmetry, spread, and outlier identification |
6. Using the Numerical Descriptives Calculator Component
The Numerical Descriptives Calculator in DataStatPro provides a fully featured tool for computing, diagnosing, visualising, and reporting descriptive statistics for numerical variables.
Step-by-Step Guide
Step 1 — Navigate to the Component
Go to Descriptive Statistics → Numerical Descriptives.
Step 2 — Input Method
Choose how to provide your data:
- Raw data: Upload a CSV/Excel file or paste a column of numeric values. DataStatPro automatically detects the variable type, identifies non-numeric entries, and flags missing values.
- Multiple variables: Select two or more numeric columns to run batch descriptives across all selected variables simultaneously, producing a comparative summary table.
- Grouped analysis: Designate a categorical grouping variable to compute descriptives separately within each group, enabling direct comparison of distributional summaries across subgroups.
Step 3 — Variable Configuration
- Assign a meaningful variable name and unit of measurement for display.
- Specify the measurement scale (interval or ratio) to unlock scale-appropriate measures (e.g., CV requires ratio scale).
- Specify a grouping variable (optional) to produce stratified descriptives.
- Specify a weight variable (optional) for survey-weighted estimates.
- Designate whether log-transformation should be applied for geometric mean reporting (appropriate for log-normal data).
Step 4 — Missing Data Handling
Select one of the following:
- Exclude missing (valid only): All summaries computed on valid observations. reported separately.
- Flag and exclude: Missing values flagged and listed; summaries exclude missing.
Step 5 — Outlier Handling
- Select outlier detection method: Tukey IQR fence (default), Z-score (), or Modified Z-score (Iglewicz-Hoaglin).
- Choose whether to flag only or to compute descriptives both with and without flagged outliers (DataStatPro never removes outliers automatically).
Step 6 — Set Display Options
- ✅ , , and total .
- ✅ Mean, median, mode, trimmed mean (selectable ), geometric mean, harmonic mean.
- ✅ Minimum, maximum, range.
- ✅ , , IQR, and full percentile table (selectable percentile set).
- ✅ Variance (), standard deviation (), SEM ().
- ✅ Coefficient of variation (CV).
- ✅ MAD (from mean and from median), with robust .
- ✅ Skewness () and excess kurtosis () with standard errors and z-tests.
- ✅ Five-number summary table.
- ✅ Shapiro-Wilk and Lilliefors normality tests with p-values.
- ✅ Outlier table with flagging method, z-score, and modified z-score.
- ✅ 95% confidence intervals for the mean (t-based), median (bootstrap), and SD.
- ✅ Histogram with optional normal curve overlay, density curve, and rug plot.
- ✅ Box plot (standard and notched).
- ✅ Violin plot.
- ✅ Q-Q plot with confidence band.
- ✅ Empirical cumulative distribution function (ECDF) plot.
- ✅ Dot plot / strip chart.
- ✅ Stem-and-leaf display (for ).
- ✅ Grouped comparison plots (when grouping variable specified).
- ✅ APA 7th edition results paragraph (auto-generated).
- ✅ Publication-ready descriptives table.
Step 7 — Run the Analysis
Click "Compute Numerical Descriptives". DataStatPro will:
- Validate data: check variable type, identify non-numeric entries, count missing values.
- Compute the complete set of central tendency, dispersion, shape, and position measures.
- Detect outliers using the selected method and produce an outlier report.
- Run Shapiro-Wilk and Lilliefors normality tests.
- Compute 95% CIs for the mean (t-distribution), median (bootstrap), and SD (chi-square).
- Generate all selected visualisations with customisable formatting.
- Produce the APA-compliant results paragraph and formatted descriptives table.
7. Step-by-Step Procedure
7.1 Full Manual Procedure
Step 1 — Define the Variable
State the variable name, unit of measurement, scale (interval or ratio), and the population of observations. Confirm that the variable is genuinely numerical.
Step 2 — Count Total and Missing Observations
Report and the missing rate . Apply the chosen missing data strategy before proceeding.
Step 3 — Sort the Data
Sort the observations in ascending order to obtain the order statistics .
Step 4 — Compute Central Tendency Measures
Mean:
Median:
Mode: Identify the most frequently occurring value(s).
Geometric mean (if applicable):
Step 5 — Compute the Five-Number Summary
Identify: using linear interpolation for and .
Step 6 — Compute Dispersion Measures
Variance:
Standard deviation:
Standard error of the mean:
Coefficient of variation (ratio scale only):
Median absolute deviation:
Step 7 — Compute Shape Measures
Skewness () and excess kurtosis () using bias-corrected formulae (Section 3.4).
Compute and to formally test departure from normality.
Step 8 — Detect Outliers
Apply Tukey's IQR fence:
Flag any outside the fence as a potential outlier. Investigate each flagged observation individually.
Step 9 — Assess Normality
Compute the Shapiro-Wilk statistic (for ). Inspect the histogram and Q-Q plot. Report skewness and kurtosis. Conclude whether the normality assumption is tenable.
Step 10 — Compute Confidence Intervals
95% CI for the mean (t-based):
Where is the critical value from the t-distribution with degrees of freedom ( for ; for ).
95% CI for the standard deviation (chi-square based):
95% CI for the median (bootstrap): Use bootstrap resamples to compute the percentile bootstrap CI.
Step 11 — Produce Visualisations
Select appropriate chart types (see Section 9), annotate with key descriptive values (mean, median, SD), and ensure axes are labelled with variable name and units.
Step 12 — Interpret and Report
Use APA reporting guidelines (Section 15). Always report , , mean, SD, median, IQR, min, max, skewness, and the result of the normality assessment. Report CIs for the mean. For non-normal distributions, emphasise the median and IQR.
8. Interpreting the Output
8.1 Central Tendency: Mean vs. Median
| Relationship | Distribution Shape | Preferred Measure |
|---|---|---|
| Mean Median | Symmetric | Either; report both |
| Mean Median (substantially) | Positively skewed (right tail) | Median |
| Mean Median (substantially) | Negatively skewed (left tail) | Median |
| Mean Median | Outliers or extreme right skew | Median; investigate outliers |
| Large discrepancy; small | Insufficient data to assess | Report both with caution |
8.2 Interpreting the Standard Deviation
| Relationship Between and | Interpretation |
|---|---|
| (small CV) | Values tightly clustered around the mean |
| (CV 100%) | Very high relative variability |
| (CV ) | Extreme variability; check for outliers or zero-inflation |
| Under normality: most values in | Approximately 95% of observations lie in this range |
8.3 Interpreting the IQR and Five-Number Summary
| Five-Number Feature | Interpretation |
|---|---|
| Symmetric: | Symmetric distribution |
| closer to than | Positively skewed |
| closer to than | Negatively skewed |
| Large gap between and Max | Outliers or long right tail |
| Large gap between Min and | Outliers or long left tail |
| Narrow IQR relative to Range | Outliers or extreme tail values dominate range |
8.4 Interpreting Skewness and Kurtosis
| Skewness | Shape Interpretation |
|---|---|
| Approximately symmetric | |
| Moderately positively skewed | |
| Substantially positively skewed | |
| Moderately negatively skewed | |
| Substantially negatively skewed |
| Excess Kurtosis | Tail Interpretation |
|---|---|
| Normal-like tails | |
| Heavier tails than normal; more outlier-prone | |
| Substantially heavy tails (e.g., financial returns) | |
| Lighter tails than normal; bounded distribution |
8.5 Interpreting Normality Tests
| Normality Assessment Result | Implication for Reporting |
|---|---|
| close to 1; | Consistent with normality; mean ± SD appropriate |
| ; | Evidence of non-normality; prefer median and IQR |
| Clear non-normality; transformation or non-parametric methods indicated | |
| Large () with but small $ | G_1 |
⚠️ Normality tests are extremely sensitive to sample size. With , trivial departures from normality will be statistically significant. With , even severe departures may not be detected. Always combine formal tests with visual inspection (histogram and Q-Q plot). The practical question is not whether the data are perfectly normal, but whether the departure is large enough to affect the validity of subsequent analyses.
8.6 The Mean-SD vs. Median-IQR Decision
This is one of the most frequently encountered reporting decisions in numerical descriptives:
| Use Mean ± SD When | Use Median [IQR] When |
|---|---|
| Distribution is approximately normal | Distribution is clearly skewed |
| No substantial outliers | Outliers are present |
| is reasonably large () | is small () |
| Parametric inferential tests are planned | Non-parametric tests are planned |
| Data are ratio-scale and symmetric | Data are bounded (e.g., floor/ceiling effects) |
⚠️ The notation "Mean ± SD" means mean plus or minus one standard deviation. "Mean ± SEM" means mean plus or minus one standard error of the mean. These are very different quantities. Always specify which you are reporting. Most journals recommend reporting SD (not SEM) as the measure of variability when describing the sample.
8.7 Contextualising Descriptives: Reference Values
Numerical descriptives are most useful when compared against reference benchmarks:
| Benchmark Type | Example | Source |
|---|---|---|
| Clinical reference range | Blood pressure: systolic 90–120 mmHg | Clinical guidelines |
| Historical baseline | Company revenue: prior year mean ± SD | Internal records |
| Population norm | BMI: adult population median 25–26 kg/m² | Epidemiological surveys |
| Theoretical value | Fair coin flip: proportion heads = 0.50 | Mathematical model |
| Regulatory limit | Contaminant level: max 10 ppb | Government regulation |
9. Visualising Numerical Data
9.1 Histogram
The histogram is the most fundamental visualisation for continuous numerical data. It divides the value range into contiguous bins of equal width and displays the frequency (or density) of observations in each bin as a bar.
Construction choices:
- Number of bins: Too few bins obscure distributional shape; too many create noisy
fluctuations. Common rules:
- Sturges' rule:
- Freedman-Diaconis rule: Bin width (robust, recommended)
- Scott's rule: Bin width (assumes normality)
- Frequency vs. density: Use density on the y-axis when overlaying a theoretical probability distribution curve.
Reading a histogram:
- Overall shape (symmetric, skewed, bimodal, uniform).
- Location of the peak (mode).
- Spread (width of the bulk of the distribution).
- Outliers (isolated bars far from the main body).
Best practices:
- Overlay a normal curve to assess departures from normality visually.
- Optionally add a rug plot (tick marks below the x-axis for each observation).
- Label the x-axis with the variable name and units; label the y-axis as "Frequency" or "Density".
9.2 Density Plot (Kernel Density Estimate)
The kernel density estimate (KDE) is a smoothed, continuous version of the histogram, constructed by placing a smooth kernel function (typically Gaussian) at each observed value and summing:
Where is the bandwidth (smoothing parameter). A larger produces a smoother curve; a smaller reveals more local features. DataStatPro uses Silverman's rule of thumb for the default bandwidth:
Advantages over histogram: Continuous, smooth, and not dependent on bin boundary choices. Excellent for comparing multiple distributions on the same plot.
9.3 Box Plot
The box plot (box-and-whisker plot) is a compact, robust visualisation of the five-number summary and outliers:
- Box: Spans to ; width represents the IQR.
- Median line: Horizontal line inside the box at .
- Whiskers: Extend to the most extreme observations within the Tukey fences ( and ).
- Outlier points: Individual observations beyond the whisker fences are plotted as dots or circles.
- Notched box plot: V-shaped notches around the median line indicate an approximate 95% CI for the median. Non-overlapping notches suggest the medians of two groups differ significantly.
Best practices:
- Include a jittered strip of individual data points overlaid on the box plot when (to avoid hiding the raw data).
- For group comparisons, align box plots side by side with a common y-axis.
- Always specify whether whiskers represent 1.5×IQR (Tukey, standard), 2×IQR, or min/max.
9.4 Violin Plot
The violin plot combines the compact shape of a box plot with the distributional detail of a density plot. Each "violin" is a mirrored kernel density estimate, showing the full distributional shape. A box plot or five-number summary is often overlaid.
Advantages over box plots: Reveals distributional shape (bimodality, skewness, gaps) that box plots conceal. Particularly useful for comparing distributions across multiple groups.
9.5 Q-Q Plot (Quantile-Quantile Plot)
The Q-Q plot assesses normality by plotting the sample quantiles against the theoretical quantiles of a standard normal distribution:
- If the data are approximately normal, points fall on or near the diagonal reference line.
- Systematic deviations from the diagonal indicate non-normality:
- S-shaped curve: Over-dispersed (heavier tails than normal).
- Banana-shaped curve: Under-dispersed (lighter tails).
- Points above the line at both ends: Positive skew.
- Points below the line at both ends: Negative skew.
- Points far off the line at one end: Outliers.
DataStatPro overlays a 95% confidence band (Kolmogorov-Smirnov envelope) on the Q-Q plot. Points outside the band indicate significant departures from normality.
9.6 Stem-and-Leaf Plot
The stem-and-leaf plot is a text-based display that shows the full distribution while preserving every individual value. Each observation is split into a "stem" (leading digits) and a "leaf" (trailing digit). The stems are listed vertically, and the leaves are arranged horizontally.
Appropriate for: ; exploratory data analysis; teaching contexts. The back-to-back stem-and-leaf plot compares two groups simultaneously.
9.7 Empirical Cumulative Distribution Function (ECDF)
The ECDF plots the proportion of observations less than or equal to each value :
The ECDF is a step function that rises by at each observed value. It is a non-parametric estimate of the true CDF and is used to:
- Identify percentiles visually (read the -value corresponding to any ).
- Compare two distributions (Kolmogorov-Smirnov test is based on the maximum vertical distance between two ECDFs).
- Assess goodness-of-fit to a theoretical distribution.
9.8 Dot Plot / Strip Chart
A strip chart (also called a dot plot or jitter plot) displays every individual observation as a dot along a single axis, with a small random vertical jitter to prevent overplotting. It is the most information-rich plot for small to medium samples ().
Advantages: Shows the actual data; reveals gaps, clusters, and potential outliers not visible in summary plots. Highly recommended as a supplement to box plots.
9.9 Error Bar Plot
An error bar plot displays the mean (or median) as a point and uncertainty as symmetric bars. Common configurations:
| Bar Type | What It Represents | When to Use |
|---|---|---|
| Mean ± SD | Variability of individual observations | Describing sample variability |
| Mean ± SEM | Precision of the mean estimate | Showing estimation precision |
| Mean ± 95% CI | 95% CI for the population mean | Inferential comparisons |
⚠️ Always label the error bar type explicitly in figure captions. An unlabelled error bar is uninterpretable — ± SD, ± SEM, and ± 95% CI have very different meanings and widths. Many published figures fail to specify this information.
9.10 Visualisation Selection Guide
| Purpose | Recommended Chart(s) | |
|---|---|---|
| Full distributional shape | Any | Histogram + density curve |
| Summary of centre and spread | Any | Box plot (+ strip chart for ) |
| Normality assessment | Any | Q-Q plot + histogram |
| Individual data visibility | Strip chart / dot plot | |
| Smooth density estimation | Violin plot or KDE | |
| Cumulative distribution | Any | ECDF plot |
| Group comparisons | Any | Side-by-side box plots; violin plots |
| Precise distributional shape | Stem-and-leaf plot | |
| Reporting central tendency with uncertainty | Any | Error bar plot (specify type) |
10. Confidence Intervals for Numerical Descriptives
10.1 Why Confidence Intervals Are Essential
Sample descriptive statistics are estimates of population parameters. A 95% confidence interval (CI) provides a range of plausible values for the true population parameter, given the observed sample statistic. CIs convey both the direction and the precision of an estimate, making them far more informative than a point estimate alone.
10.2 CI for the Mean: t-Distribution
For any numerical variable, the exact 95% CI for the population mean uses the t-distribution:
Where is the upper critical value of the t-distribution with degrees of freedom. This CI is exact when the population is normal and asymptotically valid for non-normal populations with sufficiently large (by the Central Limit Theorem).
Critical values for common :
| 5 | 2.776 | 4.604 |
| 10 | 2.262 | 3.250 |
| 20 | 2.093 | 2.861 |
| 30 | 2.045 | 2.756 |
| 60 | 2.000 | 2.660 |
| 120 | 1.980 | 2.617 |
| 1.960 | 2.576 |
10.3 CI for the Mean: Bootstrap (Non-Normal Data)
When normality is violated and is small, a bootstrap CI for the mean provides better coverage. The percentile bootstrap proceeds as follows:
- Draw (or more) bootstrap resamples of size with replacement.
- Compute for each resample .
- The 95% CI is the 2.5th and 97.5th percentiles of the bootstrap means: .
DataStatPro uses the bias-corrected and accelerated (BCa) bootstrap by default, which provides better coverage than the simple percentile bootstrap, especially for small samples.
10.4 CI for the Median: Bootstrap
No simple closed-form CI exists for the median. DataStatPro uses the BCa bootstrap CI:
- Draw bootstrap resamples of size with replacement.
- Compute for each resample.
- The 95% CI is the BCa-corrected percentile interval of the bootstrap medians.
10.5 CI for the Standard Deviation: Chi-Square Distribution
The exact 95% CI for the population standard deviation uses the chi-square distribution (assuming normality):
⚠️ The chi-square CI for is sensitive to departures from normality, more so than the t-CI for . Use the bootstrap CI for when the distribution is clearly non-normal.
10.6 CI for the Variance
The exact 95% CI for the population variance :
10.7 CI Width as a Function of and
The width of the 95% CI for the mean is approximately . For a given , quadrupling halves the CI width:
Required to achieve target CI half-width (95% CI, ):
Example: If (e.g., IQ scores), to achieve a CI half-width of ±3 points:
10.8 Confidence Intervals for the Geometric Mean
For log-normally distributed data, compute the CI on the log scale and transform back:
- Compute and .
- 95% CI on log scale: .
- Transform back: .
11. Advanced Topics
11.1 Subgroup Comparisons and Stratified Descriptives
When the variable of interest is examined across levels of a grouping variable (e.g., treatment vs. control), stratified descriptives provide within-group summaries and enable direct comparison of central tendency, spread, and shape across groups.
A useful compact format is the comparative descriptives table:
| Group | Mean ± SD | Median [IQR] | Min – Max | Skewness | |
|---|---|---|---|---|---|
| Group A | ... | ||||
| Group B | ... |
⚠️ Descriptive comparisons between groups do not constitute formal hypothesis tests. A large apparent difference in means may not be statistically significant (underpowered study), and a small apparent difference may be highly significant (very large ). Always follow descriptive comparisons with appropriate inferential tests (t-test, Mann-Whitney U, ANOVA).
11.2 Data Transformations for Non-Normal Variables
When a variable is substantially non-normal, transforming it can improve the interpretability of mean-based descriptives and the validity of downstream parametric tests.
Common transformations:
| Transformation | Formula | Appropriate When |
|---|---|---|
| Log transformation | or | Positive right-skewed data; multiplicative processes |
| Square root | Count data; moderate right skew | |
| Reciprocal | Severely right-skewed rates | |
| Box-Cox | Systematic search for optimal | |
| Arcsine square root | Proportion data (though logit preferred) | |
| Logit | Proportion data bounded in | |
| Rank transformation | Non-parametric basis; extreme outliers |
⚠️ After transformation, descriptives are computed and reported on the transformed scale. The geometric mean on the original scale equals the antilog of the arithmetic mean on the log scale. Always state explicitly that transformed-scale descriptives are being reported and provide the back-transformed mean (geometric mean) for interpretability.
11.3 Robust Descriptive Statistics
When data contain outliers or are drawn from heavy-tailed distributions, robust estimators are preferred over classical mean-based measures:
| Classical Measure | Robust Alternative | Breakdown Point |
|---|---|---|
| Mean | Median | 50% |
| Mean | 10% trimmed mean | 10% |
| Standard deviation | 50% | |
| Standard deviation | 25% | |
| Variance | estimator (Rousseeuw-Croux) | 50% |
| Mean | Huber M-estimator | Tunable |
The breakdown point is the proportion of data that can be replaced by arbitrarily large values before the estimator becomes unreliable.
11.4 Effect Size: Cohen's and Standardised Differences
When comparing the means of two groups, Cohen's standardises the mean difference by the pooled standard deviation, producing an interpretable, scale-free effect size:
Cohen's conventions for :
| | Effect Size Label | | :---- | :---------------- | | | Small | | | Medium | | | Large |
11.5 Standardised Scores and Norm-Referencing
Z-scores standardise raw values to a common scale with mean 0 and SD 1, enabling cross-variable and cross-population comparisons. Many applied contexts use T-scores (mean 50, SD 10) or IQ-type scaled scores (mean 100, SD 15):
Percentile ranks from the z-score (under normality): , where is the standard normal CDF.
11.6 The Central Limit Theorem and its Implications
The Central Limit Theorem (CLT) states that for any population with finite mean and variance , the sampling distribution of approaches a normal distribution as :
Practical implications for descriptives:
- For and mildly non-normal populations, the t-CI for is approximately valid.
- For highly skewed distributions or , bootstrap CIs are preferred.
- The CLT justifies reporting the mean and SD even for non-normal data, provided is sufficiently large and the goal is inference about .
11.7 Detecting Bimodality
A bimodal distribution has two distinct peaks, often indicating the presence of two subpopulations. Indicators of bimodality:
- Histogram shows two visible humps.
- Large positive kurtosis alone does not imply bimodality.
- Hartigan's dip test formally tests for unimodality against multimodal alternatives.
- Bimodality coefficient ; suggests bimodality (Pfister et al., 2013).
- Gaussian mixture models can formally decompose a bimodal distribution into component distributions.
When bimodality is detected, computing a single mean and SD is misleading — report descriptives separately for each subpopulation if they can be identified.
11.8 Process Capability Indices
In quality control and manufacturing, process capability indices relate the distribution of a measured variable to specification limits ( = lower, = upper):
- : Process capable (if centred).
- : Process capable and well-centred.
- : Process produces unacceptable proportion of out-of-specification output.
DataStatPro computes and reports , , (Taguchi index), and the estimated proportion out-of-specification when specification limits are supplied.
11.9 Temporal Trends and Rolling Descriptives
When a numerical variable is measured repeatedly over time, rolling (moving window) descriptives track how central tendency and variability change:
Where is the window width. Rolling means, medians, and standard deviations are plotted over time to reveal trends, seasonal patterns, and structural breaks. DataStatPro supports rolling descriptives with user-specified window width.
11.10 Sensitivity Analysis: Influence of Outliers
To assess the influence of potential outliers on key descriptives, a sensitivity analysis reports descriptives both including and excluding flagged outliers:
| Statistic | With Outliers | Without Outliers | Difference |
|---|---|---|---|
| Mean | |||
| Median | |||
| SD | |||
| Skewness |
A large with a small confirms that the outlier is exerting disproportionate leverage on the mean. Report both full-data and outlier-excluded descriptives when outliers are detected, along with a justification for any exclusions.
12. Worked Examples
Example 1: Symmetric Distribution — Exam Scores
A class of students sits an exam (0–100 marks). Three students were absent (missing); valid responses: .
Raw scores (sorted):
42, 51, 55, 58, 61, 63, 65, 67, 68, 70, 71, 72, 73, 74, 75, 76, 78, 80, 82, 85, 88, 93
Step 1 — Central Tendency:
Median (, even): Average of 11th and 12th values:
Step 2 — Quartiles:
(25th percentile, position ):
(75th percentile, position ):
Step 3 — Dispersion:
Step 4 — Shape:
Step 5 — Normality:
Shapiro-Wilk: , → Consistent with normality.
Step 6 — Outlier Check:
Lower fence: . Score of 42 is just below the fence.
Upper fence: . No upper outliers.
Score 42 is borderline; investigate (absent-then-returning student?). Reported as a mild outlier.
Step 7 — 95% CI for the Mean:
()
Summary Table:
| Statistic | Value |
|---|---|
| Valid | 22 |
| Missing | 3 |
| Mean (95% CI) | 69.41 [64.27, 74.55] |
| Median [IQR] | 71.50 [63.50, 77.50] |
| SD | 11.59 |
| SEM | 2.47 |
| CV | 16.7% |
| Min – Max | 42 – 93 |
| Range | 51 |
| Skewness () | −0.42 |
| Excess kurtosis () | 0.15 |
| Shapiro-Wilk () | 0.974 (.812) |
| Outliers flagged | 1 (score = 42) |
APA write-up: "Exam scores for 22 students (3 absent) ranged from 42 to 93 (, 95% CI [64.27, 74.55], , , ). The distribution was approximately symmetric (, ) and consistent with normality (Shapiro-Wilk , ). One borderline outlier (score = 42) was identified using Tukey's IQR fence."
Example 2: Skewed Distribution — Household Income
A social survey records annual household income (£ thousands) for households. .
Selected descriptive statistics (computed by DataStatPro):
| Statistic | Value |
|---|---|
| Valid | 150 |
| Mean (95% CI) | £62.3k [57.1k, 67.5k] |
| Median [IQR] | £48.5k [34.2k, 72.6k] |
| SD | £32.8k |
| SEM | £2.68k |
| CV | 52.7% |
| Min – Max | £11.2k – £198.4k |
| Skewness () | 1.84 |
| Excess kurtosis () | 3.21 |
| Shapiro-Wilk () | 0.891 () |
| MAD (median) | £18.7k |
| Robust | £27.7k |
| Outliers flagged (Tukey) | 8 (all upper) |
| Geometric mean (95% CI) | £52.6k [48.9k, 56.6k] |
Interpretation: The distribution is substantially positively skewed (), as is typical of income data. The mean (£62.3k) is considerably higher than the median (£48.5k), indicating that a small number of high-income households pull the mean upward. Eight high-income outliers are flagged. The Shapiro-Wilk test confirms significant non-normality ().
For this variable, the median and IQR (£48.5k [34.2k, 72.6k]) are the appropriate summary measures. The geometric mean (£52.6k) is also informative, as income data are approximately log-normally distributed.
APA write-up: "Household income () showed substantial positive skewness (), and the Shapiro-Wilk test indicated significant departure from normality (, ). The median income was £48.5k ( = £34.2k – £72.6k; range: £11.2k – £198.4k). The geometric mean was £52.6k (95% CI: £48.9k – £56.6k). Eight upper outliers were identified using Tukey's IQR fence."
Example 3: Grouped Descriptives — Reaction Time by Caffeine Condition
A psychology experiment measures reaction time (ms) in two conditions: caffeine () and placebo ().
| Statistic | Caffeine | Placebo |
|---|---|---|
| Valid | 30 | 30 |
| Mean (95% CI) | 287.3 [278.5, 296.1] | 312.6 [300.4, 324.8] |
| Median [IQR] | 284.0 [268.5, 302.0] | 309.5 [290.0, 331.0] |
| SD | 23.5 | 32.1 |
| SEM | 4.3 | 5.9 |
| CV | 8.2% | 10.3% |
| Min – Max | 248 – 341 | 251 – 391 |
| Skewness () | 0.41 | 0.62 |
| Excess kurtosis () | −0.18 | 0.55 |
| Shapiro-Wilk () | 0.968 (.490) | 0.956 (.253) |
| Outliers flagged | 0 | 1 |
Cohen's :
A large effect size (Cohen, 1988).
Interpretation: Participants in the caffeine condition responded on average 25.3 ms faster than those in the placebo condition (). Both distributions are approximately normal (Shapiro-Wilk for both groups). The placebo group shows slightly higher variability ( vs. ) and one flagged outlier.
APA write-up: "Reaction times in the caffeine condition ( ms, , 95% CI [278.5, 296.1]) were lower than in the placebo condition ( ms, , 95% CI [300.4, 324.8]). Both distributions were consistent with normality (Shapiro-Wilk , ). The effect size was large ()."
Example 4: Log-Normal Variable — Bacterial Colony Counts
A microbiology study counts bacterial colonies per sample (; counts range from 3 to 4,820). Raw data are highly right-skewed.
| Statistic | Raw Scale | Log Scale |
|---|---|---|
| Mean (95% CI) | 684.2 [463.5, 904.9] | 2.41 [2.28, 2.54] |
| Median [IQR] | 412.5 [98.5, 1021.0] | 2.62 [1.99, 3.01] |
| SD | 710.4 | 0.41 |
| Skewness () | 2.31 | −0.18 |
| Shapiro-Wilk () | 0.823 () | 0.974 (.483) |
| Geometric mean (95% CI) | 257.0 [195.5, 338.1] | — |
Interpretation: Colony counts are log-normally distributed — the raw data are severely right-skewed (, Shapiro-Wilk ), but log- transformed counts are approximately normally distributed (, Shapiro-Wilk ). The appropriate measure of central tendency is the geometric mean of 257.0 colonies (equivalent to the antilog of the mean on the log scale: ). The arithmetic mean (684.2) is substantially inflated by high-count outliers and is not recommended as the primary summary.
APA write-up: "Colony counts () were log-normally distributed (Shapiro-Wilk on log-transformed data: , ; on raw data: , ). The geometric mean colony count was 257.0 (95% CI: 195.5 – 338.1). Descriptive statistics are reported on the log scale: , , , ."
13. Common Mistakes and How to Avoid Them
Mistake 1: Reporting the SEM as a Measure of Variability
Problem: Reporting "Mean ± SEM" and implying that the SEM describes the spread of individual observations. Because , the SEM shrinks with larger samples and always underestimates the variability in the data. This practice gives a misleadingly impression of data homogeneity.
Solution: Use the SD to describe the variability of individual observations in the sample. Use the SEM or the 95% CI to describe the precision of the mean estimate. Always label clearly which you are reporting.
Mistake 2: Applying the Mean and SD to a Skewed Distribution
Problem: Reporting mean ± SD for a right-skewed variable (e.g., income, hospital length of stay, reaction time) where the mean is not representative of a typical observation and the SD interval () may extend below zero.
Solution: For skewed distributions, report the median and IQR as the primary summary. Consider also reporting the geometric mean for log-normally distributed data. Always inspect the histogram and Shapiro-Wilk test before choosing between mean-based and median-based summaries.
Mistake 3: Deleting Outliers Without Justification
Problem: Automatically removing observations that fall outside Tukey's fences or that have , without investigating whether they are genuine data points or errors. Removing legitimate extreme values biases the descriptives and invalidates downstream inferential tests.
Solution: Investigate every flagged outlier individually. Is it a data entry error? A measurement error? Or a genuine extreme value? Delete only errors; retain genuine extreme values. Report descriptives both with and without suspected outliers and note any exclusions explicitly.
Mistake 4: Confusing Standard Deviation with Standard Error
Problem: Using "SD" and "SE" (or "SEM") interchangeably, or not specifying which is reported. This is one of the most common statistical errors in published research.
Solution: Clearly define all notation at first use. Use SD to describe sample variability; use SE or 95% CI to describe estimation precision. Refer to APA style: ", , 95% CI [43.6, 47.0]."
Mistake 5: Over-Interpreting Descriptives from Very Small Samples
Problem: Computing and reporting detailed descriptives (skewness, kurtosis, mode, full percentile table) from samples of or , as if they were stable estimates of population parameters. With very small , all descriptives are highly unstable.
Solution: For , report individual values and at most the minimum, maximum, and median. For , report mean and SD with wide CIs, note the small sample size, and avoid strong distributional claims. Accompany all small-sample descriptives with explicit acknowledgment of imprecision.
Mistake 6: Not Assessing Normality Before Reporting Mean-Based Summaries
Problem: Automatically reporting mean ± SD for every numerical variable without checking whether the distribution is approximately normal. For skewed data, the mean and SD are poor summaries and can be actively misleading.
Solution: Always assess normality as part of the descriptive analysis, using at minimum a histogram and the Shapiro-Wilk test. Select mean ± SD (normal data) or median [IQR] (skewed data) based on the assessment. Report the normality assessment results alongside the descriptives.
Mistake 7: Truncating the y-Axis in Histograms or Bar Charts
Problem: Starting the frequency axis at a value other than zero to exaggerate differences or make the distribution appear more concentrated. This distorts the visual impression of the data.
Solution: Always start the y-axis of a histogram or bar chart at zero. If the range of values is large, consider a secondary plot zooming in on a region of interest, rather than truncating the primary axis.
Mistake 8: Reporting Spurious Precision
Problem: Reporting the mean as 47.38271 when the raw data are measured to the nearest integer. This implies a level of measurement precision that does not exist and does not aid interpretation.
Solution: Report the mean and SD to one more decimal place than the original data. Data recorded to the nearest unit → report mean to one decimal place. Data recorded to one decimal place → report mean to two decimal places.
Mistake 9: Failing to Distinguish Between Descriptive and Inferential Uses of Confidence Intervals
Problem: Reporting a 95% CI for the mean as a range within which "95% of observations fall". This is a fundamentally incorrect interpretation — that description applies to the prediction interval, not the CI for the mean.
Solution: A 95% CI for the mean means: "If we repeated this study many times, 95% of the constructed intervals would contain the true population mean ." It is an interval for a parameter, not for individual observations. Use prediction intervals or reference ranges when the goal is to describe the expected range for an individual observation.
Mistake 10: Comparing Means Across Groups Without Assessing Comparability of Spread
Problem: Reporting that the mean in Group A is higher than in Group B, without noting that Group A has three times the standard deviation. A mean difference is much more practically important when variability is low than when it is high.
Solution: Always report the variability measure (SD or IQR) alongside the central tendency measure for each group. Compute and report Cohen's or the Glass as the standardised effect size when comparing group means.
14. Troubleshooting
| Problem | Likely Cause | Solution |
|---|---|---|
| Mean and median differ greatly | Skewed distribution or outliers | Report median and IQR; investigate outliers; consider transformation |
| SD larger than mean | High variability, right skew, or zero-inflated data | Check distribution shape; consider median/IQR; check for data errors |
| SD = 0 | All observations identical | Expected; report as constant; investigate if unexpected |
| Negative variance | Computation error | Verify formula uses denominator; check data integrity |
| CV reported as negative | Negative mean (e.g., change scores, temperatures in °C) | CV is not meaningful for interval-scale variables with possible negative means; omit CV |
| Skewness or kurtosis very large ($ | G_1 | > 5$) |
| Shapiro-Wilk with large | CLT: trivial departures from normality significant for | Inspect histogram and Q-Q plot; if visually approximately normal, proceed with mean-based summaries |
| Shapiro-Wilk with small | Low power of normality test for | Visual inspection essential; bootstrap CI safer than t-CI for small non-normal samples |
| Confidence interval includes negative values for SD | Incorrectly applied symmetric CI for SD | Use chi-square CI for SD (Section 10.5); SD CI is inherently asymmetric |
| Geometric mean cannot be computed | One or more zero or negative values | Log transformation undefined at 0; add a small constant or use arithmetic mean; check for data errors |
| Percentile estimates differ from other software | Different percentile computation method (Type 7 vs. other) | Specify the method used; DataStatPro uses Type 7 (linear interpolation) |
| IQR = 0 | More than 50% of observations share the same value | Common with discrete or heavily tied data; report as 0 and note the high frequency of tied values |
| Mean dramatically changes after excluding one outlier | Outlier exerts high leverage | Report sensitivity analysis; consider robust estimators; investigate the outlier |
| MAD (median) = 0 | More than 50% of observations share the median value | Common with integer data; robust is not meaningful; note and use other measures |
| is negative | Process mean outside specification limits | Urgent process adjustment needed; means more than 50% of output is out-of-specification |
| Rolling mean diverges unexpectedly | Structural break or data anomaly in the time series | Investigate the time period around the divergence; check for data entry errors |
15. Quick Reference Cheat Sheet
Core Equations
| Formula | Description |
|---|---|
| Arithmetic mean | |
| or average of middle two | Median |
| Sample variance | |
| Sample standard deviation | |
| Standard error of the mean | |
| Coefficient of variation (ratio scale only) | |
| Interquartile range | |
| Median absolute deviation | |
| Robust SD estimator | |
| = bias-corrected third standardised moment | Sample skewness |
| = bias-corrected fourth standardised moment | Excess kurtosis |
| Z-score (standardised value) | |
| Geometric mean | |
| 95% CI for mean | |
| 95% CI for SD | |
| Cohen's (standardised mean difference) |
Measure Applicability by Data Type
| Measure | Continuous Normal | Continuous Skewed | Discrete Count | Ratio-Scale Required |
|---|---|---|---|---|
| Mean, SD, SEM | ✅ Primary | ⚠️ Use with caution | ✅ | — |
| Median, IQR | ✅ Supplement | ✅ Primary | ✅ | — |
| Geometric mean | — | ✅ Log-normal | ✅ Counts | ✅ |
| CV | ✅ | ✅ | ✅ | ✅ |
| MAD (median) | ✅ | ✅ Primary | ✅ | — |
| Skewness, kurtosis | ✅ | ✅ | ✅ | — |
| Percentiles | ✅ | ✅ | ✅ | — |
Mean vs. Median Decision Guide
| Use Mean ± SD When | Use Median [IQR] When |
|---|---|
| Approximately normal (Shapiro-Wilk ) | Substantially skewed ($ |
| No substantial outliers | Outliers present |
| with non-normal data | |
| Parametric tests planned | Non-parametric tests planned |
| Ratio-scale variable | Bounded or censored data |
Normality Assessment Summary
| Tool | Output | Action Threshold |
|---|---|---|
| Shapiro-Wilk test | , -value | : evidence of non-normality |
| Skewness z-test | $ | |
| Kurtosis z-test | $ | |
| Histogram | Visual | Obvious skew, bimodality, gaps |
| Q-Q plot | Visual | Points outside confidence band |
Outlier Detection Thresholds
| Method | Lower Bound | Upper Bound |
|---|---|---|
| Tukey IQR fence (mild) | ||
| Tukey IQR fence (extreme) | ||
| Z-score | ||
| Modified Z-score | $ | M_i |
Cohen's Benchmarks
| | Effect Size | Contextual Note | | :----- | :---------- | :-------------- | | | Small | Barely noticeable in practice | | | Medium | Visible to a careful observer | | | Large | Obvious to a casual observer | | | Very large | Highly practically significant |
Required Sample Size for Target CI Width
where = desired half-width of 95% CI.
| Any |
APA 7th Edition Reporting Templates
Normal distribution (primary: mean and SD): "[Variable] scores ranged from [Min] to [Max] ( = [value], = [value], 95% CI [[LB], [UB]]). The distribution was approximately normal (Shapiro-Wilk = [value], = [value], = [value], = [value])."
Non-normal distribution (primary: median and IQR): "[Variable] scores ranged from [Min] to [Max] ( = [value], = [LB] – [UB]). The distribution was positively/negatively skewed ( = [value]), and the Shapiro-Wilk test indicated significant departure from normality ( = [value], = [value])."
Log-normal variable (geometric mean): "[Variable] data were log-normally distributed (Shapiro-Wilk on log-transformed data: = [value], = [value]). The geometric mean was [value] (95% CI: [LB], [UB])."
Group comparison: "[Group A] ( = [value], = [value], = [value]) showed [higher/lower/similar] [variable] compared to [Group B] ( = [value], = [value], = [value]), with a [small/medium/large] effect size ( = [value])."
With outliers: "[Number] outlier(s) were identified using Tukey's IQR fence method ([list values]). Descriptives are reported for the full sample ( = [total]) and with outliers excluded ( = [excl]): full: = [value], = [value]; excluding outliers: = [value], = [value]."
Reporting Checklist
| Item | Required |
|---|---|
| Valid and missing (with missing rate) | ✅ Always |
| Mean and SD (or median and IQR) | ✅ Always |
| State which: Mean ± SD or Median [IQR] — with justification | ✅ Always |
| Minimum and maximum (or range) | ✅ Always |
| 95% CI for the mean (or median) | ✅ Always |
| Skewness () and excess kurtosis () | ✅ Always |
| Normality assessment (Shapiro-Wilk + histogram/Q-Q) | ✅ Always |
| Quartiles (, , ) | ✅ When reporting median |
| SEM | ✅ When mean precision is the focus (clearly labelled) |
| CV | ✅ For ratio-scale variables; when comparing variability |
| MAD (median) | ✅ For non-normal data; when robustness is relevant |
| Geometric mean | ✅ For log-normal or multiplicative data |
| Outlier detection results | ✅ Always |
| Sensitivity analysis (with/without outliers) | ✅ When outliers are detected |
| Units of measurement stated | ✅ Always |
| Percentile table | ✅ For norm-referenced scores; clinical reference ranges |
| Five-number summary | ✅ When box plot is presented |
| Cohen's (group comparisons) | ✅ When comparing two group means |
| Bootstrap CI | ✅ When and normality is violated |
| Transformation stated and justified | ✅ When data are transformed before reporting |
| Measurement scale stated (interval / ratio) | ✅ Always |
| Missing data mechanism discussed | ✅ When |
| Weighted estimates (if survey data) | ✅ When design weights provided |
This tutorial provides a comprehensive foundation for understanding, computing, interpreting, visualising, and reporting numerical descriptive statistics within the DataStatPro application. For further reading, consult Tukey's "Exploratory Data Analysis" (1977) for robust and exploratory methods, Altman's "Practical Statistics for Medical Research" (1991) for clinical applications, Wilcox's "Introduction to Robust Estimation and Hypothesis Testing" (4th ed., 2017) for robust descriptives, Cohen's "Statistical Power Analysis for the Behavioral Sciences" (2nd ed., 1988) for effect size conventions, and Hoaglin, Mosteller & Tukey's "Understanding Robust and Exploratory Data Analysis" (1983) for advanced exploratory methods. For feature requests or support, contact the DataStatPro team.