Knowledge Base / Numerical Descriptives and Distributions Descriptive Statistics 7 min read

Numerical Descriptives and Distributions

Comprehensive reference guide for descriptive statistics of numerical data.

Numerical Descriptives and Distributions: Comprehensive Reference Guide

This comprehensive guide covers descriptive statistics for numerical data, including measures of central tendency, variability, distribution shape, normality testing, and data transformation techniques with detailed mathematical formulations and interpretation guidelines.

Overview

Descriptive statistics summarize and describe the main features of numerical datasets. Understanding these measures is fundamental for data analysis, hypothesis testing, and making informed decisions based on empirical evidence.

Measures of Central Tendency

1. Arithmetic Mean

Purpose: The average value of a dataset, representing the central point around which data values cluster.

Formula: xˉ=1ni=1nxi=x1+x2+...+xnn\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i = \frac{x_1 + x_2 + ... + x_n}{n}

Properties:

Population Mean: μ=1Ni=1Nxi\mu = \frac{1}{N}\sum_{i=1}^{N}x_i

2. Median

Purpose: The middle value when data is arranged in ascending order, representing the 50th percentile.

Calculation:

Properties:

3. Mode

Purpose: The most frequently occurring value(s) in a dataset.

Types:

Properties:

Measures of Variability

1. Variance

Sample Variance: s2=1n1i=1n(xixˉ)2s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2

Population Variance: σ2=1Ni=1N(xiμ)2\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2

Properties:

2. Standard Deviation

Sample Standard Deviation: s=1n1i=1n(xixˉ)2s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2}

Population Standard Deviation: σ=1Ni=1N(xiμ)2\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2}

Properties:

3. Range

Formula: Range=xmaxxminRange = x_{max} - x_{min}

Properties:

4. Interquartile Range (IQR)

Formula: IQR=Q3Q1IQR = Q_3 - Q_1

Where:

Properties:

5. Coefficient of Variation

Formula: CV=sxˉ×100%CV = \frac{s}{\bar{x}} \times 100\%

Properties:

Distribution Shape Measures

1. Skewness

Sample Skewness: Skewness=n(n1)(n2)i=1n(xixˉs)3Skewness = \frac{n}{(n-1)(n-2)}\sum_{i=1}^{n}\left(\frac{x_i - \bar{x}}{s}\right)^3

Interpretation:

2. Kurtosis

Sample Kurtosis: Kurtosis=n(n+1)(n1)(n2)(n3)i=1n(xixˉs)43(n1)2(n2)(n3)Kurtosis = \frac{n(n+1)}{(n-1)(n-2)(n-3)}\sum_{i=1}^{n}\left(\frac{x_i - \bar{x}}{s}\right)^4 - \frac{3(n-1)^2}{(n-2)(n-3)}

Interpretation:

Percentiles and Quartiles

Percentile Calculation

Formula for kth percentile: Pk=value below which k% of data fallsP_k = \text{value below which } k\% \text{ of data falls}

Position calculation: Position=k100×(n+1)Position = \frac{k}{100} \times (n + 1)

Quartiles

Five-Number Summary

  1. Minimum value
  2. First quartile (Q₁)
  3. Median (Q₂)
  4. Third quartile (Q₃)
  5. Maximum value

Confidence Intervals for Means

Confidence Interval for Population Mean (σ known)

CI=xˉ±zα/2×σnCI = \bar{x} \pm z_{\alpha/2} \times \frac{\sigma}{\sqrt{n}}

Confidence Interval for Population Mean (σ unknown)

CI=xˉ±tα/2,df×snCI = \bar{x} \pm t_{\alpha/2,df} \times \frac{s}{\sqrt{n}}

Where:

Interpretation:

Normality Testing

1. Shapiro-Wilk Test

Purpose: Tests whether a sample comes from a normally distributed population.

Test Statistic: W=(i=1naix(i))2i=1n(xixˉ)2W = \frac{\left(\sum_{i=1}^{n}a_i x_{(i)}\right)^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2}

Properties:

2. Kolmogorov-Smirnov Test

Purpose: Tests whether a sample follows a specified distribution.

Test Statistic: D=maxiFn(xi)F0(xi)D = \max_i |F_n(x_i) - F_0(x_i)|

Where:

Properties:

3. Anderson-Darling Test

Purpose: Tests goodness of fit to a specified distribution with emphasis on tail behavior.

Test Statistic: A2=n1ni=1n(2i1)[lnF(xi)+ln(1F(xn+1i))]A^2 = -n - \frac{1}{n}\sum_{i=1}^{n}(2i-1)[\ln F(x_i) + \ln(1-F(x_{n+1-i}))]

Properties:

Data Transformation Techniques

1. Log Transformation

Formula: y=ln(x) or y=log10(x)y = \ln(x) \text{ or } y = \log_{10}(x)

Use Cases:

2. Square Root Transformation

Formula: y=xy = \sqrt{x}

Use Cases:

3. Box-Cox Transformation

Formula:

For λ0\lambda \neq 0: y(λ)=xλ1λy(\lambda) = \frac{x^\lambda - 1}{\lambda}

For λ=0\lambda = 0: y(λ)=ln(x)y(\lambda) = \ln(x)

Properties:

4. Reciprocal Transformation

Formula: y=1xy = \frac{1}{x}

Use Cases:

Outlier Detection Methods

1. Z-Score Method

Formula: zi=xixˉsz_i = \frac{x_i - \bar{x}}{s}

Criterion: zi>2|z_i| > 2 or zi>3|z_i| > 3 (depending on stringency)

2. Interquartile Range (IQR) Method

Outlier boundaries:

Extreme outliers:

3. Modified Z-Score

Formula: Mi=0.6745(xix~)MADM_i = \frac{0.6745(x_i - \tilde{x})}{MAD}

Where:

Criterion: Mi>3.5|M_i| > 3.5

Practical Guidelines

Choosing Appropriate Measures

For Symmetric Distributions:

For Skewed Distributions:

For Distributions with Outliers:

Sample Size Considerations

Small Samples (n < 30):

Large Samples (n ≥ 30):

Reporting Guidelines

Essential Elements:

Example: "The sample (n = 120) had a mean age of 45.2 years (SD = 12.8, 95% CI [42.9, 47.5]). The distribution was approximately normal (Shapiro-Wilk p = 0.18) with no extreme outliers identified."

This comprehensive guide provides the foundation for understanding and applying descriptive statistics for numerical data. Proper application of these concepts is essential for accurate data analysis and valid statistical inference.