Knowledge Base / Categorical Descriptives and Association Descriptive Statistics 7 min read

Categorical Descriptives and Association

Comprehensive reference guide for categorical data analysis and association measures.

Categorical Descriptives and Association: Comprehensive Reference Guide

This comprehensive guide covers descriptive statistics and association measures for categorical data, including frequency analysis, chi-square tests, measures of association, odds ratios, and specialized tests for categorical data analysis.

Overview

Categorical data analysis involves examining the distribution and relationships between variables measured on nominal or ordinal scales. These methods are fundamental for understanding patterns, associations, and dependencies in categorical datasets.

Frequency Tables and Cross-Tabulations

1. Frequency Tables

Purpose: Summarize the distribution of a single categorical variable.

Components:

Example Structure:

CategoryFrequencyRelative FrequencyPercentageCumulative %
A250.2525%25%
B400.4040%65%
C350.3535%100%

2. Cross-Tabulation (Contingency Tables)

Purpose: Examine the relationship between two categorical variables.

2×2 Contingency Table:

Variable B
Variable AB₁B₂Total
A₁aba + b
A₂cdc + d
Totala + cb + dn

Expected Frequencies: Eij=(Row Totali)×(Column Totalj)nE_{ij} = \frac{(\text{Row Total}_i) \times (\text{Column Total}_j)}{n}

Proportions and Percentages

1. Sample Proportion

Formula: p^=xn\hat{p} = \frac{x}{n}

Where:

2. Confidence Interval for Proportion

Normal Approximation (large samples): CI=p^±zα/2p^(1p^)nCI = \hat{p} \pm z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}

Wilson Score Interval (better for small samples): CI=p^+z22n±zp^(1p^)n+z24n21+z2nCI = \frac{\hat{p} + \frac{z^2}{2n} \pm z\sqrt{\frac{\hat{p}(1-\hat{p})}{n} + \frac{z^2}{4n^2}}}{1 + \frac{z^2}{n}}

3. Difference in Proportions

Formula: p^1p^2\hat{p}_1 - \hat{p}_2

Confidence Interval: CI=(p^1p^2)±zα/2p^1(1p^1)n1+p^2(1p^2)n2CI = (\hat{p}_1 - \hat{p}_2) \pm z_{\alpha/2}\sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}

Chi-Square Tests

1. Chi-Square Test of Independence

Purpose: Tests whether two categorical variables are independent.

Null Hypothesis: H₀: Variables are independent
Alternative Hypothesis: H₁: Variables are associated

Test Statistic: χ2=i=1rj=1c(OijEij)2Eij\chi^2 = \sum_{i=1}^{r}\sum_{j=1}^{c}\frac{(O_{ij} - E_{ij})^2}{E_{ij}}

Where:

Degrees of Freedom: df=(r1)(c1)df = (r-1)(c-1)

Assumptions:

2. Chi-Square Goodness of Fit Test

Purpose: Tests whether observed frequencies match expected frequencies from a theoretical distribution.

Test Statistic: χ2=i=1k(OiEi)2Ei\chi^2 = \sum_{i=1}^{k}\frac{(O_i - E_i)^2}{E_i}

Degrees of Freedom: df=k1number of estimated parametersdf = k - 1 - \text{number of estimated parameters}

3. Yates' Continuity Correction

For 2×2 tables: χ2=i,j(OijEij0.5)2Eij\chi^2 = \sum_{i,j}\frac{(|O_{ij} - E_{ij}| - 0.5)^2}{E_{ij}}

Use when: Any expected frequency is between 5 and 10

Measures of Association

1. Cramér's V

Purpose: Measures strength of association between two categorical variables.

Formula: V=χ2n×min(r1,c1)V = \sqrt{\frac{\chi^2}{n \times \min(r-1, c-1)}}

Interpretation:

2. Phi Coefficient (φ)

Purpose: Measures association for 2×2 tables.

Formula: ϕ=χ2n=adbc(a+b)(c+d)(a+c)(b+d)\phi = \sqrt{\frac{\chi^2}{n}} = \frac{ad - bc}{\sqrt{(a+b)(c+d)(a+c)(b+d)}}

Properties:

3. Lambda (λ)

Purpose: Proportional reduction in error measure based on modal categories.

Symmetric Lambda: λ=(fr,max+fc,max)(fmax+fmax)2n(fmax+fmax)\lambda = \frac{(\sum f_{r,max} + \sum f_{c,max}) - (f_{max} + f_{max})}{2n - (f_{max} + f_{max})}

Asymmetric Lambda (Y dependent on X): λYX=fr,maxfc,maxnfc,max\lambda_{Y|X} = \frac{\sum f_{r,max} - f_{c,max}}{n - f_{c,max}}

Interpretation:

4. Contingency Coefficient

Formula: C=χ2χ2+nC = \sqrt{\frac{\chi^2}{\chi^2 + n}}

Properties:

Odds Ratios and Relative Risk

1. Odds Ratio (OR)

For 2×2 table: OR=adbc=odds in group 1odds in group 2OR = \frac{ad}{bc} = \frac{\text{odds in group 1}}{\text{odds in group 2}}

Log Odds Ratio: ln(OR)=ln(a)+ln(d)ln(b)ln(c)\ln(OR) = \ln(a) + \ln(d) - \ln(b) - \ln(c)

Confidence Interval for ln(OR): CI=ln(OR)±zα/21a+1b+1c+1dCI = \ln(OR) \pm z_{\alpha/2}\sqrt{\frac{1}{a} + \frac{1}{b} + \frac{1}{c} + \frac{1}{d}}

Interpretation:

2. Relative Risk (RR)

Formula: RR=a/(a+b)c/(c+d)=risk in exposed grouprisk in unexposed groupRR = \frac{a/(a+b)}{c/(c+d)} = \frac{\text{risk in exposed group}}{\text{risk in unexposed group}}

Confidence Interval for ln(RR): CI=ln(RR)±zα/21a1a+b+1c1c+dCI = \ln(RR) \pm z_{\alpha/2}\sqrt{\frac{1}{a} - \frac{1}{a+b} + \frac{1}{c} - \frac{1}{c+d}}

Interpretation:

3. Number Needed to Treat (NNT)

Formula: NNT=1p1p2=1Absolute Risk ReductionNNT = \frac{1}{|p_1 - p_2|} = \frac{1}{\text{Absolute Risk Reduction}}

Interpretation: Number of patients that need to be treated to prevent one additional adverse outcome.

Specialized Tests for Categorical Data

1. Fisher's Exact Test

Purpose: Exact test for 2×2 tables when chi-square assumptions are violated.

Test Statistic: Uses hypergeometric distribution

Probability: P=(a+b)!(c+d)!(a+c)!(b+d)!a!b!c!d!n!P = \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{a!b!c!d!n!}

Use When:

2. McNemar's Test

Purpose: Tests for changes in paired categorical data (before/after designs).

Test Statistic: χ2=(bc)2b+c\chi^2 = \frac{(b-c)^2}{b+c}

With Continuity Correction: χ2=(bc1)2b+c\chi^2 = \frac{(|b-c|-1)^2}{b+c}

Table Structure:

After +After -Total
Before +aba+b
Before -cdc+d
Totala+cb+dn

Assumptions:

3. Cochran's Q Test

Purpose: Extension of McNemar's test for more than two time points.

Test Statistic: Q=k(k1)j=1k(CjCˉ)2i=1nRii=1nRi2Q = \frac{k(k-1)\sum_{j=1}^{k}(C_j - \bar{C})^2}{\sum_{i=1}^{n}R_i - \sum_{i=1}^{n}R_i^2}

Where:

Effect Size Measures

1. Cohen's w

For goodness of fit: w=i=1k(p0ip1i)2p0iw = \sqrt{\sum_{i=1}^{k}\frac{(p_{0i} - p_{1i})^2}{p_{0i}}}

For independence: w=χ2nw = \sqrt{\frac{\chi^2}{n}}

Interpretation:

2. Cohen's h

For difference in proportions: h=2(arcsinp1arcsinp2)h = 2(\arcsin\sqrt{p_1} - \arcsin\sqrt{p_2})

Interpretation:

Sample Size and Power Considerations

1. Sample Size for Chi-Square Test

Formula: n=(zα/2+zβ)2w2n = \frac{(z_{\alpha/2} + z_\beta)^2}{w^2}

Where:

2. Sample Size for Proportion

Single proportion: n=zα/22×p(1p)E2n = \frac{z_{\alpha/2}^2 \times p(1-p)}{E^2}

Two proportions: n=2pˉ(1pˉ)(zα/2+zβ)2(p1p2)2n = \frac{2\bar{p}(1-\bar{p})(z_{\alpha/2} + z_\beta)^2}{(p_1 - p_2)^2}

Practical Guidelines

Choosing Appropriate Tests

For Independence:

For Paired Data:

For Association Strength:

Assumption Checking

Chi-Square Test:

Fisher's Exact Test:

Reporting Guidelines

Essential Elements:

Example: "A chi-square test of independence revealed a significant association between treatment group and outcome, χ²(1, N = 200) = 8.45, p = 0.004, Cramér's V = 0.21, indicating a medium effect size. The odds ratio was 2.34 (95% CI [1.32, 4.15]), suggesting patients in the treatment group had 2.34 times higher odds of positive outcomes."

This comprehensive guide provides the foundation for understanding and applying descriptive statistics and association measures for categorical data analysis.