Numerical Descriptives and Distributions: Comprehensive Reference Guide

This comprehensive guide covers descriptive statistics for numerical data, including measures of central tendency, variability, distribution shape, normality testing, and data transformation techniques with detailed mathematical formulations and interpretation guidelines.

Overview

Descriptive statistics summarize and describe the main features of numerical datasets. Understanding these measures is fundamental for data analysis, hypothesis testing, and making informed decisions based on empirical evidence.

Measures of Central Tendency

1. Arithmetic Mean

Purpose: The average value of a dataset, representing the central point around which data values cluster.

Formula: $\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i = \frac{x_1 + x_2 + ... + x_n}{n}$

Properties:

Sensitive to outliers
Minimizes sum of squared deviations
Used in parametric statistical tests

Population Mean: $\mu = \frac{1}{N}\sum_{i=1}^{N}x_i$

2. Median

Purpose: The middle value when data is arranged in ascending order, representing the 50th percentile.

Calculation:

Odd n: $Median = x_{(n+1)/2}$
Even n: $Median = \frac{x_{n/2} + x_{(n/2)+1}}{2}$

Properties:

Robust to outliers
Appropriate for skewed distributions
Divides dataset into two equal halves

3. Mode

Purpose: The most frequently occurring value(s) in a dataset.

Types:

Unimodal: One mode
Bimodal: Two modes
Multimodal: Multiple modes
No mode: All values occur with equal frequency

Properties:

Can be used with categorical data
May not exist or may not be unique
Useful for identifying typical values

Measures of Variability

1. Variance

Sample Variance: $s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2$

Population Variance: $\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2$

Properties:

Measures average squared deviation from mean
Units are squared original units
Always non-negative

2. Standard Deviation

Sample Standard Deviation: $s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2}$

Population Standard Deviation: $\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2}$

Properties:

Same units as original data
Approximately 68% of data within 1 SD of mean (normal distribution)
Approximately 95% of data within 2 SD of mean (normal distribution)

3. Range

Formula: $Range = x_{max} - x_{min}$

Properties:

Simple measure of spread
Highly sensitive to outliers
Easy to calculate and interpret

4. Interquartile Range (IQR)

Formula: $IQR = Q_3 - Q_1$

Where:

$Q_1$ = 25th percentile (first quartile)
$Q_3$ = 75th percentile (third quartile)

Properties:

Robust to outliers
Contains middle 50% of data
Used in box plot construction

5. Coefficient of Variation

Formula: $CV = \frac{s}{\bar{x}} \times 100\%$

Properties:

Relative measure of variability
Unitless (allows comparison across different scales)
Useful when comparing variability of different datasets

Distribution Shape Measures

1. Skewness

Sample Skewness: $Skewness = \frac{n}{(n-1)(n-2)}\sum_{i=1}^{n}\left(\frac{x_i - \bar{x}}{s}\right)^3$

Interpretation:

Skewness = 0: Symmetric distribution
Skewness > 0: Right-skewed (positive skew)
Skewness < 0: Left-skewed (negative skew)
|Skewness| > 2: Highly skewed

2. Kurtosis

Sample Kurtosis: $Kurtosis = \frac{n(n+1)}{(n-1)(n-2)(n-3)}\sum_{i=1}^{n}\left(\frac{x_i - \bar{x}}{s}\right)^4 - \frac{3(n-1)^2}{(n-2)(n-3)}$

Interpretation:

Kurtosis = 0: Normal distribution (mesokurtic)
Kurtosis > 0: Heavy-tailed distribution (leptokurtic)
Kurtosis < 0: Light-tailed distribution (platykurtic)

Percentiles and Quartiles

Percentile Calculation

Formula for kth percentile: $P_k = \text{value below which } k\% \text{ of data falls}$

Position calculation: $Position = \frac{k}{100} \times (n + 1)$

Quartiles

Q₁ (25th percentile): $P_{25}$
Q₂ (50th percentile): $P_{50}$ = Median
Q₃ (75th percentile): $P_{75}$

Five-Number Summary

Minimum value
First quartile (Q₁)
Median (Q₂)
Third quartile (Q₃)
Maximum value

Confidence Intervals for Means

Confidence Interval for Population Mean (σ known)

$CI = \bar{x} \pm z_{\alpha/2} \times \frac{\sigma}{\sqrt{n}}$

Confidence Interval for Population Mean (σ unknown)

$CI = \bar{x} \pm t_{\alpha/2,df} \times \frac{s}{\sqrt{n}}$

Where:

$df = n - 1$ (degrees of freedom)
$t_{\alpha/2,df}$ = critical t-value

Interpretation:

95% CI: We are 95% confident the true population mean lies within this interval
Narrower intervals indicate more precise estimates
Larger samples generally produce narrower intervals

Normality Testing

1. Shapiro-Wilk Test

Purpose: Tests whether a sample comes from a normally distributed population.

Test Statistic: $W = \frac{\left(\sum_{i=1}^{n}a_i x_{(i)}\right)^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2}$

Properties:

Most powerful test for normality
Recommended for sample sizes n ≤ 50
Sensitive to outliers

2. Kolmogorov-Smirnov Test

Purpose: Tests whether a sample follows a specified distribution.

Test Statistic: $D = \max_i |F_n(x_i) - F_0(x_i)|$

Where:

$F_n(x)$ = empirical distribution function
$F_0(x)$ = theoretical distribution function

Properties:

Distribution-free test
Can test against any continuous distribution
Less powerful than Shapiro-Wilk for normality

3. Anderson-Darling Test

Purpose: Tests goodness of fit to a specified distribution with emphasis on tail behavior.

Test Statistic: $A^2 = -n - \frac{1}{n}\sum_{i=1}^{n}(2i-1)[\ln F(x_i) + \ln(1-F(x_{n+1-i}))]$

Properties:

More sensitive to deviations in tails
Good for detecting departures from normality
Provides better power than Kolmogorov-Smirnov

Data Transformation Techniques

1. Log Transformation

Formula: $y = \ln(x) \text{ or } y = \log_{10}(x)$

Use Cases:

Right-skewed data
Multiplicative relationships
Stabilizing variance

2. Square Root Transformation

Formula: $y = \sqrt{x}$

Use Cases:

Poisson-distributed data
Count data with small values
Moderate right skew

3. Box-Cox Transformation

Formula:

For $\lambda \neq 0$ : $y(\lambda) = \frac{x^\lambda - 1}{\lambda}$

For $\lambda = 0$ : $y(\lambda) = \ln(x)$

Properties:

Optimal λ chosen to maximize normality
Includes log transformation as special case
Requires positive values

4. Reciprocal Transformation

Formula: $y = \frac{1}{x}$

Use Cases:

Severe right skew
Rate or time data
When larger values need compression

Outlier Detection Methods

1. Z-Score Method

Formula: $z_i = \frac{x_i - \bar{x}}{s}$

Criterion: $|z_i| > 2$ or $|z_i| > 3$ (depending on stringency)

2. Interquartile Range (IQR) Method

Outlier boundaries:

Lower fence: $Q_1 - 1.5 \times IQR$
Upper fence: $Q_3 + 1.5 \times IQR$

Extreme outliers:

Lower extreme: $Q_1 - 3 \times IQR$
Upper extreme: $Q_3 + 3 \times IQR$

3. Modified Z-Score

Formula: $M_i = \frac{0.6745(x_i - \tilde{x})}{MAD}$

Where:

$\tilde{x}$ = median
$MAD$ = median absolute deviation

Criterion: $|M_i| > 3.5$

Practical Guidelines

Choosing Appropriate Measures

For Symmetric Distributions:

Central tendency: Mean
Variability: Standard deviation
Use parametric methods

For Skewed Distributions:

Central tendency: Median
Variability: IQR or MAD
Consider transformations or non-parametric methods

For Distributions with Outliers:

Use robust measures (median, IQR)
Investigate outliers before removal
Consider outlier-resistant methods

Sample Size Considerations

Small Samples (n < 30):

Use t-distribution for confidence intervals
Be cautious with normality assumptions
Consider non-parametric alternatives

Large Samples (n ≥ 30):

Central Limit Theorem applies
Normal approximation often valid
More robust to assumption violations

Reporting Guidelines

Essential Elements:

Sample size (n)
Measures of central tendency and variability
Confidence intervals when appropriate
Assessment of distributional assumptions
Treatment of outliers and missing data

Example: "The sample (n = 120) had a mean age of 45.2 years (SD = 12.8, 95% CI [42.9, 47.5]). The distribution was approximately normal (Shapiro-Wilk p = 0.18) with no extreme outliers identified."

This comprehensive guide provides the foundation for understanding and applying descriptive statistics for numerical data. Proper application of these concepts is essential for accurate data analysis and valid statistical inference.