Numerical Descriptives and Distributions: Comprehensive Reference Guide
This comprehensive guide covers descriptive statistics for numerical data, including measures of central tendency, variability, distribution shape, normality testing, and data transformation techniques with detailed mathematical formulations and interpretation guidelines.
Overview
Descriptive statistics summarize and describe the main features of numerical datasets. Understanding these measures is fundamental for data analysis, hypothesis testing, and making informed decisions based on empirical evidence.
Measures of Central Tendency
1. Arithmetic Mean
Purpose: The average value of a dataset, representing the central point around which data values cluster.
Formula:
Properties:
- Sensitive to outliers
- Minimizes sum of squared deviations
- Used in parametric statistical tests
Population Mean:
2. Median
Purpose: The middle value when data is arranged in ascending order, representing the 50th percentile.
Calculation:
- Odd n:
- Even n:
Properties:
- Robust to outliers
- Appropriate for skewed distributions
- Divides dataset into two equal halves
3. Mode
Purpose: The most frequently occurring value(s) in a dataset.
Types:
- Unimodal: One mode
- Bimodal: Two modes
- Multimodal: Multiple modes
- No mode: All values occur with equal frequency
Properties:
- Can be used with categorical data
- May not exist or may not be unique
- Useful for identifying typical values
Measures of Variability
1. Variance
Sample Variance:
Population Variance:
Properties:
- Measures average squared deviation from mean
- Units are squared original units
- Always non-negative
2. Standard Deviation
Sample Standard Deviation:
Population Standard Deviation:
Properties:
- Same units as original data
- Approximately 68% of data within 1 SD of mean (normal distribution)
- Approximately 95% of data within 2 SD of mean (normal distribution)
3. Range
Formula:
Properties:
- Simple measure of spread
- Highly sensitive to outliers
- Easy to calculate and interpret
4. Interquartile Range (IQR)
Formula:
Where:
- = 25th percentile (first quartile)
- = 75th percentile (third quartile)
Properties:
- Robust to outliers
- Contains middle 50% of data
- Used in box plot construction
5. Coefficient of Variation
Formula:
Properties:
- Relative measure of variability
- Unitless (allows comparison across different scales)
- Useful when comparing variability of different datasets
Distribution Shape Measures
1. Skewness
Sample Skewness:
Interpretation:
- Skewness = 0: Symmetric distribution
- Skewness > 0: Right-skewed (positive skew)
- Skewness < 0: Left-skewed (negative skew)
- |Skewness| > 2: Highly skewed
2. Kurtosis
Sample Kurtosis:
Interpretation:
- Kurtosis = 0: Normal distribution (mesokurtic)
- Kurtosis > 0: Heavy-tailed distribution (leptokurtic)
- Kurtosis < 0: Light-tailed distribution (platykurtic)
Percentiles and Quartiles
Percentile Calculation
Formula for kth percentile:
Position calculation:
Quartiles
- Q₁ (25th percentile):
- Q₂ (50th percentile): = Median
- Q₃ (75th percentile):
Five-Number Summary
- Minimum value
- First quartile (Q₁)
- Median (Q₂)
- Third quartile (Q₃)
- Maximum value
Confidence Intervals for Means
Confidence Interval for Population Mean (σ known)
Confidence Interval for Population Mean (σ unknown)
Where:
- (degrees of freedom)
- = critical t-value
Interpretation:
- 95% CI: We are 95% confident the true population mean lies within this interval
- Narrower intervals indicate more precise estimates
- Larger samples generally produce narrower intervals
Normality Testing
1. Shapiro-Wilk Test
Purpose: Tests whether a sample comes from a normally distributed population.
Test Statistic:
Properties:
- Most powerful test for normality
- Recommended for sample sizes n ≤ 50
- Sensitive to outliers
2. Kolmogorov-Smirnov Test
Purpose: Tests whether a sample follows a specified distribution.
Test Statistic:
Where:
- = empirical distribution function
- = theoretical distribution function
Properties:
- Distribution-free test
- Can test against any continuous distribution
- Less powerful than Shapiro-Wilk for normality
3. Anderson-Darling Test
Purpose: Tests goodness of fit to a specified distribution with emphasis on tail behavior.
Test Statistic:
Properties:
- More sensitive to deviations in tails
- Good for detecting departures from normality
- Provides better power than Kolmogorov-Smirnov
Data Transformation Techniques
1. Log Transformation
Formula:
Use Cases:
- Right-skewed data
- Multiplicative relationships
- Stabilizing variance
2. Square Root Transformation
Formula:
Use Cases:
- Poisson-distributed data
- Count data with small values
- Moderate right skew
3. Box-Cox Transformation
Formula:
For :
For :
Properties:
- Optimal λ chosen to maximize normality
- Includes log transformation as special case
- Requires positive values
4. Reciprocal Transformation
Formula:
Use Cases:
- Severe right skew
- Rate or time data
- When larger values need compression
Outlier Detection Methods
1. Z-Score Method
Formula:
Criterion: or (depending on stringency)
2. Interquartile Range (IQR) Method
Outlier boundaries:
- Lower fence:
- Upper fence:
Extreme outliers:
- Lower extreme:
- Upper extreme:
3. Modified Z-Score
Formula:
Where:
- = median
- = median absolute deviation
Criterion:
Practical Guidelines
Choosing Appropriate Measures
For Symmetric Distributions:
- Central tendency: Mean
- Variability: Standard deviation
- Use parametric methods
For Skewed Distributions:
- Central tendency: Median
- Variability: IQR or MAD
- Consider transformations or non-parametric methods
For Distributions with Outliers:
- Use robust measures (median, IQR)
- Investigate outliers before removal
- Consider outlier-resistant methods
Sample Size Considerations
Small Samples (n < 30):
- Use t-distribution for confidence intervals
- Be cautious with normality assumptions
- Consider non-parametric alternatives
Large Samples (n ≥ 30):
- Central Limit Theorem applies
- Normal approximation often valid
- More robust to assumption violations
Reporting Guidelines
Essential Elements:
- Sample size (n)
- Measures of central tendency and variability
- Confidence intervals when appropriate
- Assessment of distributional assumptions
- Treatment of outliers and missing data
Example: "The sample (n = 120) had a mean age of 45.2 years (SD = 12.8, 95% CI [42.9, 47.5]). The distribution was approximately normal (Shapiro-Wilk p = 0.18) with no extreme outliers identified."
This comprehensive guide provides the foundation for understanding and applying descriptive statistics for numerical data. Proper application of these concepts is essential for accurate data analysis and valid statistical inference.