Categorical Descriptives and Association: Comprehensive Reference Guide
This comprehensive guide covers descriptive statistics and association measures for categorical data, including frequency analysis, chi-square tests, measures of association, odds ratios, and specialized tests for categorical data analysis.
Overview
Categorical data analysis involves examining the distribution and relationships between variables measured on nominal or ordinal scales. These methods are fundamental for understanding patterns, associations, and dependencies in categorical datasets.
Frequency Tables and Cross-Tabulations
1. Frequency Tables
Purpose: Summarize the distribution of a single categorical variable.
Components:
- Frequency (f): Count of observations in each category
- Relative Frequency:
- Percentage:
- Cumulative Frequency: Running total of frequencies
Example Structure:
| Category | Frequency | Relative Frequency | Percentage | Cumulative % |
|---|---|---|---|---|
| A | 25 | 0.25 | 25% | 25% |
| B | 40 | 0.40 | 40% | 65% |
| C | 35 | 0.35 | 35% | 100% |
2. Cross-Tabulation (Contingency Tables)
Purpose: Examine the relationship between two categorical variables.
2×2 Contingency Table:
| Variable B | |||
|---|---|---|---|
| Variable A | B₁ | B₂ | Total |
| A₁ | a | b | a + b |
| A₂ | c | d | c + d |
| Total | a + c | b + d | n |
Expected Frequencies:
Proportions and Percentages
1. Sample Proportion
Formula:
Where:
- x = number of successes
- n = sample size
2. Confidence Interval for Proportion
Normal Approximation (large samples):
Wilson Score Interval (better for small samples):
3. Difference in Proportions
Formula:
Confidence Interval:
Chi-Square Tests
1. Chi-Square Test of Independence
Purpose: Tests whether two categorical variables are independent.
Null Hypothesis: H₀: Variables are independent
Alternative Hypothesis: H₁: Variables are associated
Test Statistic:
Where:
- = observed frequency in cell (i,j)
- = expected frequency in cell (i,j)
Degrees of Freedom:
Assumptions:
- Independent observations
- Expected frequencies ≥ 5 in at least 80% of cells
- No expected frequencies < 1
2. Chi-Square Goodness of Fit Test
Purpose: Tests whether observed frequencies match expected frequencies from a theoretical distribution.
Test Statistic:
Degrees of Freedom:
3. Yates' Continuity Correction
For 2×2 tables:
Use when: Any expected frequency is between 5 and 10
Measures of Association
1. Cramér's V
Purpose: Measures strength of association between two categorical variables.
Formula:
Interpretation:
- 0 ≤ V ≤ 1
- V = 0: No association
- V = 1: Perfect association
- Small: V < 0.1, Medium: 0.1 ≤ V < 0.3, Large: V ≥ 0.3
2. Phi Coefficient (φ)
Purpose: Measures association for 2×2 tables.
Formula:
Properties:
- -1 ≤ φ ≤ 1
- φ = 0: No association
- |φ| = 1: Perfect association
3. Lambda (λ)
Purpose: Proportional reduction in error measure based on modal categories.
Symmetric Lambda:
Asymmetric Lambda (Y dependent on X):
Interpretation:
- 0 ≤ λ ≤ 1
- λ = 0: No reduction in error
- λ = 1: Perfect prediction
4. Contingency Coefficient
Formula:
Properties:
- 0 ≤ C < 1
- Maximum value depends on table size
- Less interpretable than Cramér's V
Odds Ratios and Relative Risk
1. Odds Ratio (OR)
For 2×2 table:
Log Odds Ratio:
Confidence Interval for ln(OR):
Interpretation:
- OR = 1: No association
- OR > 1: Positive association
- OR < 1: Negative association
2. Relative Risk (RR)
Formula:
Confidence Interval for ln(RR):
Interpretation:
- RR = 1: No difference in risk
- RR > 1: Increased risk in exposed group
- RR < 1: Decreased risk in exposed group
3. Number Needed to Treat (NNT)
Formula:
Interpretation: Number of patients that need to be treated to prevent one additional adverse outcome.
Specialized Tests for Categorical Data
1. Fisher's Exact Test
Purpose: Exact test for 2×2 tables when chi-square assumptions are violated.
Test Statistic: Uses hypergeometric distribution
Probability:
Use When:
- Small sample sizes
- Expected frequencies < 5
- Need exact p-values
2. McNemar's Test
Purpose: Tests for changes in paired categorical data (before/after designs).
Test Statistic:
With Continuity Correction:
Table Structure:
| After + | After - | Total | |
|---|---|---|---|
| Before + | a | b | a+b |
| Before - | c | d | c+d |
| Total | a+c | b+d | n |
Assumptions:
- Paired observations
- Dichotomous variables
- Large sample (b + c ≥ 25)
3. Cochran's Q Test
Purpose: Extension of McNemar's test for more than two time points.
Test Statistic:
Where:
- k = number of time points
- = column totals
- = row totals
Effect Size Measures
1. Cohen's w
For goodness of fit:
For independence:
Interpretation:
- Small: w = 0.1
- Medium: w = 0.3
- Large: w = 0.5
2. Cohen's h
For difference in proportions:
Interpretation:
- Small: h = 0.2
- Medium: h = 0.5
- Large: h = 0.8
Sample Size and Power Considerations
1. Sample Size for Chi-Square Test
Formula:
Where:
- w = effect size (Cohen's w)
- = z-value for desired power
2. Sample Size for Proportion
Single proportion:
Two proportions:
Practical Guidelines
Choosing Appropriate Tests
For Independence:
- Large samples: Chi-square test
- Small samples: Fisher's exact test
- Ordered categories: Mantel-Haenszel test
For Paired Data:
- Two time points: McNemar's test
- Multiple time points: Cochran's Q test
For Association Strength:
- 2×2 tables: Phi coefficient, Odds ratio
- Larger tables: Cramér's V
- Ordinal data: Spearman's rank correlation
Assumption Checking
Chi-Square Test:
- Check expected frequencies
- Ensure independence of observations
- Consider continuity correction for 2×2 tables
Fisher's Exact Test:
- Use when chi-square assumptions violated
- Computationally intensive for large tables
- Provides exact p-values
Reporting Guidelines
Essential Elements:
- Sample sizes and frequencies
- Test statistics and p-values
- Effect sizes and confidence intervals
- Description of categories and coding
Example: "A chi-square test of independence revealed a significant association between treatment group and outcome, χ²(1, N = 200) = 8.45, p = 0.004, Cramér's V = 0.21, indicating a medium effect size. The odds ratio was 2.34 (95% CI [1.32, 4.15]), suggesting patients in the treatment group had 2.34 times higher odds of positive outcomes."
This comprehensive guide provides the foundation for understanding and applying descriptive statistics and association measures for categorical data analysis.