Categorical Descriptives: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of summarising categorical data all the way through advanced interpretation, reporting, visualisation, assumption checking, and practical usage within the DataStatPro application. Whether you are encountering categorical descriptive statistics for the first time or deepening your understanding of how to characterise, display, and communicate the distribution of categorical variables, this guide builds your knowledge systematically from the ground up.

Prerequisites and Background Concepts
What are Categorical Descriptives?
The Mathematics Behind Categorical Descriptives
Considerations and Data Quality Checks
Types of Categorical Descriptive Measures
Using the Categorical Descriptives Calculator Component
Step-by-Step Procedure
Interpreting the Output
Visualising Categorical Data
Confidence Intervals for Proportions
Advanced Topics
Worked Examples
Common Mistakes and How to Avoid Them
Troubleshooting
Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into categorical descriptives, it is essential to be comfortable with the following foundational statistical concepts. Each is briefly reviewed below.

1.1 What is a Variable?

A variable is any characteristic, attribute, or quantity that can take on different values across observations. In statistics, variables are the building blocks of data, and every analytic method — including descriptive statistics — begins with a clear understanding of the variables at hand.

Observation: A single unit of study (e.g., one person, one product, one country).
Dataset: A rectangular array of observations (rows) and variables (columns).
Value: The specific outcome recorded for a variable on a given observation.

1.2 Scales of Measurement

All variables fall into one of four measurement scales, which determine which statistical operations are valid:

Scale	Properties	Examples
Nominal	Named categories; no order	Blood type, gender, country, species
Ordinal	Ordered categories; no equal spacing	Satisfaction (low/medium/high), education level
Interval	Equal spacing; no true zero	Temperature (°C), year
Ratio	Equal spacing; true zero	Height, weight, income, reaction time

Categorical descriptives apply to nominal and ordinal variables (and, when treated as categorical, to discretised interval or ratio variables).

1.3 Categorical Variables Defined

A categorical variable assigns each observation to exactly one of a finite set of mutually exclusive, exhaustive categories. Two key sub-types exist:

Nominal variables: Categories carry labels only, with no inherent ordering. There is no meaningful sense in which one category is "more" or "less" than another. Examples: eye colour (blue, green, brown, hazel), marital status (single, married, divorced, widowed), preferred mode of transport.
Ordinal variables: Categories carry labels and a meaningful rank order, but the intervals between adjacent categories need not be equal. Examples: pain level (none, mild, moderate, severe), academic grade (A, B, C, D, F), Likert-scale items (strongly disagree to strongly agree).

⚠️ Descriptive statistics appropriate for nominal variables (frequency, proportion, mode) are always valid for ordinal variables, but ordinal variables additionally support rank-based summaries. Never apply ordinal summaries — such as the median — to purely nominal variables.

1.4 Frequency and Frequency Distributions

The most fundamental summary of a categorical variable is its frequency distribution: a complete enumeration of each category and the number of observations falling into it.

Absolute frequency $f_k$ : The raw count of observations in category $k$ .
Relative frequency $p_k$ : The proportion of all observations in category $k$ , defined as $p_k = f_k / N$ .
Percentage $\%_k$ : The relative frequency expressed as a percentage, $\%_k = 100 \times p_k$ .
Cumulative frequency $F_k$ : The count of all observations in categories up to and including $k$ (meaningful only for ordinal variables).
Cumulative relative frequency $P_k$ : The proportion of observations in categories up to and including $k$ .

1.5 The Mode

The mode is the category (or categories) that appears most frequently in the data. It is the only measure of central tendency that is valid for nominal-scale data.

A distribution with one clear most-frequent category is unimodal.
A distribution with two equally or near-equally frequent categories is bimodal.
A distribution in which all categories occur with equal frequency is uniform — the mode is undefined or uninformative in this case.

1.6 Population vs. Sample

All descriptive statistics computed from data describe the sample at hand. When the goal is to make inferences about a broader population, sample statistics become estimates subject to sampling variability. Confidence intervals (Section 10) quantify this uncertainty.

Population proportion: $\pi_k$ — the true proportion of the population in category $k$ .
Sample proportion: $\hat{p}_k = f_k / N$ — the estimate of $\pi_k$ from the sample.

1.7 The Concept of a Probability Distribution for Categorical Data

For a categorical variable with $K$ categories, the probability distribution specifies the probability $\pi_k$ assigned to each category $k$ , subject to:

$\sum_{k=1}^{K} \pi_k = 1, \qquad 0 \leq \pi_k \leq 1 \quad \text{for all } k$

This is the categorical distribution (a generalisation of the Bernoulli distribution to $K > 2$ outcomes). When $K = 2$ (binary variable), it reduces to the Bernoulli distribution with parameter $\pi_1 = p$ and $\pi_2 = 1 - p$ .

1.8 Missing Data in Categorical Variables

Missing values are observations for which no category was recorded. They are fundamentally different from a valid category and must be handled deliberately:

Complete case analysis: Exclude observations with missing values from all calculations. Simple but potentially biasing.
Include as a category: Treat missing as its own explicit category (appropriate when missingness is informative, e.g., "refused to answer").
Imputation: Replace missing values with estimated values using mode imputation or multiple imputation methods.

DataStatPro reports the number and percentage of missing values separately from the frequency distribution of valid responses.

2. What are Categorical Descriptives?

2.1 The Core Purpose

Categorical descriptive statistics are numerical and graphical summaries that characterise the distribution of a categorical variable. Their purpose is to answer, in a rigorous and communicable way, the fundamental question: How are observations distributed across the categories of this variable?

Unlike continuous descriptives (mean, standard deviation, skewness), which describe the location and spread of a numeric scale, categorical descriptives quantify the frequency, proportion, and relative dominance of discrete categories.

2.2 What Categorical Descriptives Tell You

Summary	Core Question Answered
Frequency table	How many observations fall into each category?
Proportions / percentages	What share of observations does each category represent?
Mode	Which category is most common?
Variation ratio	How heterogeneous is the distribution of categories?
Entropy	How uncertain or diverse is the distribution?
Concentration index	How much are observations concentrated in one category?
Cumulative frequencies	What proportion of observations fall at or below a given level? (ordinal only)
Confidence intervals	What is the plausible range for the true population proportion?

2.3 When to Use Categorical Descriptives

Condition	Requirement
Variable scale	Nominal or ordinal
Data format	Observations assigned to mutually exclusive categories
Purpose	Summarise the marginal distribution of one variable
Sample size	Any; CIs become more informative with larger $N$
Reporting	Always precede inferential tests with descriptive summaries

2.4 Real-World Applications

Field	Variable	Categories	Descriptive Goal
Public Health	Vaccination status	Vaccinated / Partially vaccinated / Unvaccinated	Estimate population coverage
Marketing	Brand preference	Brand A / B / C / D / None	Identify dominant preference
HR & Organisational	Employment type	Full-time / Part-time / Contract / Casual	Describe workforce composition
Clinical Trials	Adverse event severity	Mild / Moderate / Severe / Life-threatening	Profile safety outcomes
Education	Letter grade	A / B / C / D / F	Characterise grade distribution
Sociology	Religious affiliation	Multiple denominations	Map social structure
Quality Control	Defect category	Type I / II / III / None	Identify dominant failure modes
Political Science	Voting intention	Party A / B / C / Undecided	Track electoral preference

Goal	Appropriate Method
Summarise one categorical variable	Categorical descriptives
Test association between two categorical variables	Chi-square test of association
Test whether one distribution matches a known distribution	Chi-square goodness-of-fit test
Summarise a continuous variable	Continuous descriptives (mean, SD, median, IQR)
Compare proportions across two or more groups	Two-proportion z-test; chi-square test
Summarise the joint distribution of two categorical variables	Contingency table (cross-tabulation)
Model a binary outcome	Logistic regression

3. The Mathematics Behind Categorical Descriptives

3.1 Notation

Consider a categorical variable $X$ with $K$ mutually exclusive categories labelled $x_1, x_2, \ldots, x_K$ . A sample of $N$ observations yields:

$f_k$ = count of observations in category $k$ (absolute frequency)
$N = \sum_{k=1}^{K} f_k$ = total number of valid observations
$\hat{p}_k = f_k / N$ = sample proportion in category $k$ (relative frequency)

3.2 Frequency and Proportion

Absolute frequency:

$f_k = \#\{i : X_i = x_k\} \quad \text{for } k = 1, 2, \ldots, K$

Relative frequency (proportion):

$\hat{p}_k = \frac{f_k}{N}$

Percentage:

$\%_k = 100 \times \hat{p}_k = \frac{100 \times f_k}{N}$

Verification: $\sum_{k=1}^K f_k = N$ and $\sum_{k=1}^K \hat{p}_k = 1$ .

3.3 Cumulative Frequencies (Ordinal Variables)

For an ordinal variable with categories ordered $x_1 < x_2 < \cdots < x_K$ :

Cumulative absolute frequency:

$F_k = \sum_{j=1}^{k} f_j$

Cumulative relative frequency:

$P_k = \sum_{j=1}^{k} \hat{p}_j = \frac{F_k}{N}$

By definition, $F_K = N$ and $P_K = 1$ .

3.4 The Mode

The mode $M_o$ is the category with the highest absolute frequency:

$M_o = x_{k^*} \quad \text{where} \quad k^* = \underset{k}{\arg\max}\; f_k$

When two or more categories share the maximum frequency, the distribution is multimodal and all maximum-frequency categories are reported as co-modes.

3.5 The Variation Ratio

The variation ratio ( $VR$ ) measures the proportion of observations that do not fall into the modal category. It is the simplest measure of dispersion for nominal data:

$VR = 1 - \frac{f_{mode}}{N} = 1 - \hat{p}_{mode}$

$VR = 0$ : All observations are in one category (no variation).
$VR = 1 - 1/K$ : Maximum variation; all categories are equally frequent (uniform distribution).
$VR$ ranges from $0$ to $(K-1)/K$ .

3.6 Shannon's Entropy

Shannon's entropy $H$ (from information theory) quantifies the uncertainty or diversity in a categorical distribution:

$H = -\sum_{k=1}^{K} \hat{p}_k \log_2(\hat{p}_k)$

Measured in bits (using $\log_2$ ). Convention: $0 \times \log_2(0) = 0$ .

$H = 0$ : Minimum entropy — all observations in one category (complete certainty).
$H = \log_2 K$ : Maximum entropy — all categories equally probable (maximum uncertainty).

Normalised entropy (also called relative entropy or evenness index) rescales $H$ to $[0, 1]$ :

$H_{norm} = \frac{H}{\log_2 K} = \frac{-\sum_{k=1}^{K} \hat{p}_k \log_2(\hat{p}_k)}{\log_2 K}$

$H_{norm} = 0$ indicates complete concentration; $H_{norm} = 1$ indicates maximum diversity across categories.

⚠️ Natural logarithm ( $\ln$ ) is often used instead of $\log_2$ , yielding entropy in nats. The choice of logarithm base affects the numerical value of $H$ but not relative comparisons. DataStatPro uses $\log_2$ (bits) by default.

3.7 Herfindahl–Hirschman Index (HHI) and Simpson's Concentration Index

The Herfindahl–Hirschman Index quantifies the degree to which observations are concentrated in a small number of categories:

$HHI = \sum_{k=1}^{K} \hat{p}_k^2$

$HHI = 1/K$ : Minimum concentration — all categories equally frequent.
$HHI = 1$ : Maximum concentration — all observations in a single category.

The complement, Simpson's diversity index $D$ , measures the probability that two randomly selected observations belong to different categories:

$D = 1 - HHI = 1 - \sum_{k=1}^{K} \hat{p}_k^2$

$D = 0$ : No diversity (all in one category).
$D = (K-1)/K$ : Maximum diversity (uniform distribution).

3.8 Qualitative Variation Index (IQV)

The Index of Qualitative Variation (IQV), also attributed to Gibbs and Martin (1962), standardises Simpson's diversity index to $[0, 1]$ regardless of the number of categories:

$IQV = \frac{K}{K-1} \times D = \frac{K}{K-1} \left(1 - \sum_{k=1}^{K} \hat{p}_k^2\right)$

$IQV = 0$ : All observations in one category.
$IQV = 1$ : All categories equally represented (maximum heterogeneity).

IQV facilitates comparisons of categorical dispersion across variables with different numbers of categories.

3.9 The Median for Ordinal Variables

For ordinal variables, the median is the category at which the cumulative relative frequency first reaches or exceeds 0.50:

$\text{Median} = x_{k^*} \quad \text{where} \quad k^* = \min\{k : P_k \geq 0.50\}$

The median is more informative than the mode for ordinal data when the distribution is asymmetric, as it captures the central ordering of responses.

3.10 Percentiles and Quartiles for Ordinal Variables

Percentiles for ordinal variables are defined analogously to the median, using the cumulative frequency distribution:

$\text{Percentile}_q = x_{k^*} \quad \text{where} \quad k^* = \min\{k : P_k \geq q/100\}$

The interquartile range ( $IQR$ ) describes the middle 50% of ordinal responses and spans from the 25th percentile ( $Q_1$ ) to the 75th percentile ( $Q_3$ ):

$IQR = Q_3 - Q_1 \quad \text{(in category units)}$

⚠️ Arithmetic differences between ordinal category labels are not meaningful unless numeric scores are assigned. The IQR for ordinal data should be reported as a range of category labels, not as a single numeric value.

3.11 The Geometric Mean of Proportions (Diversity)

For comparing proportional distributions across samples of different sizes, the geometric mean proportion can be used to summarise the average per-category representation:

$\bar{p}_{geom} = \left(\prod_{k=1}^{K} \hat{p}_k\right)^{1/K}$

This is directly related to entropy: $H = -K \times \ln(\bar{p}_{geom})$ .

4. Considerations and Data Quality Checks

4.1 Mutual Exclusivity and Exhaustiveness

The fundamental validity requirement for a categorical variable is that its categories are:

Mutually exclusive: Each observation belongs to exactly one category. If a respondent can select multiple categories (multi-select questions), the variable violates mutual exclusivity and must be restructured (e.g., as multiple binary indicator variables) before standard categorical descriptives can be applied.
Exhaustive: Every possible observation must map to some category. If the category set does not cover all possibilities, an "Other" category must be added.

How to check: Confirm that $\sum_{k=1}^K f_k = N$ (valid observations). If $\sum_{k=1}^K f_k < N$ , some observations are unaccounted for.

4.2 Category Labelling Consistency

Inconsistent labelling causes artificial category inflation. Common problems include:

Problem	Example	Solution
Case inconsistency	"male" vs. "Male" vs. "MALE"	Standardise case before analysis
Leading/trailing spaces	" Yes" vs. "Yes"	Strip whitespace
Synonymous labels	"N/A" vs. "Not Applicable"	Merge into one category
Abbreviations	"F" vs. "Female"	Choose one consistent label
Encoding issues	"Caf" vs. "Café"	Fix encoding and standardise

DataStatPro flags potential label inconsistencies and offers a category merge tool.

4.3 Missing Data Assessment

Before reporting any descriptive statistics, the extent and pattern of missing data must be evaluated:

Metric	Formula	Interpretation
Missing count	$N_{miss} = N_{total} - N_{valid}$	Number of absent responses
Missing rate	$N_{miss} / N_{total}$	Proportion of data missing
Valid $N$ rate	$N_{valid} / N_{total}$	Proportion of usable responses

Missing data mechanisms:

MCAR (Missing Completely At Random): Missingness is unrelated to any variable. Complete case analysis is unbiased.
MAR (Missing At Random): Missingness depends on observed variables, not the missing value itself. Imputation or weighting is appropriate.
MNAR (Missing Not At Random): Missingness depends on the value that is missing (e.g., people with extreme views refusing to disclose). Most problematic; requires sensitivity analyses.

4.4 Sample Size Adequacy

While categorical descriptives can be computed for any $N \geq 1$ , interpretability and precision depend on sample size:

$N$	Guidance
$< 10$	Proportions are highly unstable; report counts only
$10 - 30$	Proportions are reported with wide CIs; interpret cautiously
$30 - 100$	Proportions reasonably stable; report CIs using Wilson's method
$> 100$	Proportions stable; standard CIs appropriate; diversity measures reliable
$> 1000$	Fine-grained proportions meaningful; subgroup breakdowns feasible

4.5 Rare Categories

Categories with very few observations (e.g., $f_k < 5$ ) pose challenges:

Instability: Proportions based on tiny counts fluctuate widely across samples.
Privacy: Small cell counts in sensitive data may enable re-identification.
Misleading visuals: Tiny slices in pie charts or bars are hard to read.

Options for handling rare categories:

Retain and flag: Report as-is with a note on small $n$ .
Collapse: Merge rare categories with theoretically similar ones.
"Other" grouping: Create a residual "Other" category for all categories below a frequency threshold.
Suppress: Omit categories below a frequency threshold from public reports.

4.6 Ordered vs. Unordered Presentation

For nominal variables, the order in which categories are displayed is arbitrary. Common orderings include:

Alphabetical (neutral, reproducible).
By descending frequency (highlights dominant categories).
By theoretical grouping (e.g., clinical severity).

For ordinal variables, categories must always be presented in their natural rank order (ascending or descending) to preserve the meaning of cumulative frequencies and the median.

4.7 Weighted Data

In survey research, observations are frequently assigned weights to correct for unequal selection probabilities or to make the sample representative of a target population. When weights $w_i$ are present:

$\hat{f}_k^{(w)} = \sum_{i: X_i = x_k} w_i \qquad \hat{p}_k^{(w)} = \frac{\hat{f}_k^{(w)}}{\sum_{i=1}^N w_i}$

DataStatPro supports weighted frequency tables when a weight variable is specified. Both unweighted and weighted results are reported side by side.

5. Types of Categorical Descriptive Measures

5.1 Measures of Frequency

The most direct summaries — counts and proportions — form the foundation of all categorical description.

Measure	Symbol	Formula	Scale
Absolute frequency	$f_k$	Count in category $k$	Nominal, Ordinal
Relative frequency	$\hat{p}_k$	$f_k / N$	Nominal, Ordinal
Percentage	$\%_k$	$100 \times f_k / N$	Nominal, Ordinal
Cumulative frequency	$F_k$	$\sum_{j \leq k} f_j$	Ordinal only
Cumulative %	$P_k$	$100 \times F_k / N$	Ordinal only

5.2 Measures of Central Tendency

Measure	Formula / Definition	Applicable Scale
Mode	Category with maximum $f_k$	Nominal, Ordinal
Median	Category where $P_k \geq 50\%$ first	Ordinal only
Percentiles ( $Q_1$ , $Q_3$ )	Category where $P_k \geq$ target first	Ordinal only

5.3 Measures of Dispersion / Heterogeneity

Measure	Formula	Range	Scale
Variation ratio	$VR = 1 - \hat{p}_{mode}$	$[0,\, (K-1)/K]$	Nominal, Ordinal
Shannon entropy	$H = -\sum \hat{p}_k \log_2 \hat{p}_k$	$[0,\, \log_2 K]$	Nominal, Ordinal
Normalised entropy	$H_{norm} = H / \log_2 K$	$[0, 1]$	Nominal, Ordinal
Simpson's diversity	$D = 1 - \sum \hat{p}_k^2$	$[0,\, (K-1)/K]$	Nominal, Ordinal
HHI (concentration)	$HHI = \sum \hat{p}_k^2$	$[1/K,\, 1]$	Nominal, Ordinal
IQV	$\frac{K}{K-1}(1 - \sum \hat{p}_k^2)$	$[0, 1]$	Nominal, Ordinal
IQR (category range)	$Q_3 - Q_1$	Category units	Ordinal only

5.4 Comparative Descriptives: Subgroup Breakdowns

When a grouping variable $G$ partitions observations into subgroups, categorical descriptives can be computed separately within each group, enabling comparison:

$\hat{p}_{k|g} = \frac{f_{kg}}{N_g}$

Where $f_{kg}$ is the count in category $k$ within group $g$ and $N_g$ is the total in group $g$ . This is the foundation of cross-tabulation and is reported as a conditional frequency table (see Section 11.1).

5.5 Descriptives for Binary (Dichotomous) Variables

A binary variable is a special case of a categorical variable with $K = 2$ categories (typically coded 0/1 or "No"/"Yes"). All standard categorical descriptives apply, but additional simplifications hold:

The distribution is fully described by a single proportion $\hat{p} = f_1 / N$ (the proportion in the positive/event category); the complementary proportion is $1 - \hat{p}$ .
$VR = 1 - \max(\hat{p},\; 1 - \hat{p})$ — maximum at $\hat{p} = 0.5$ .
$HHI = \hat{p}^2 + (1-\hat{p})^2$ ; $D = 2\hat{p}(1-\hat{p})$ — maximum at $\hat{p} = 0.5$ .
$H = -\hat{p}\log_2\hat{p} - (1-\hat{p})\log_2(1-\hat{p})$ — the binary entropy function.

6. Using the Categorical Descriptives Calculator Component

The Categorical Descriptives Calculator in DataStatPro provides a fully featured tool for computing, diagnosing, visualising, and reporting descriptive statistics for categorical variables.

Step-by-Step Guide

Step 1 — Navigate to the Component

Go to Descriptive Statistics → Categorical Descriptives.

Step 2 — Input Method

Choose how to provide your data:

Raw data: Upload or paste a column of categorical observations. DataStatPro automatically detects the variable type, counts unique categories, and handles missing values.
Pre-aggregated frequency table: Enter category labels and their counts directly into the table grid. Useful when you already have a summary table and wish to compute additional descriptive measures from it.
Multiple variables: Select two or more categorical columns simultaneously to run batch descriptives across all selected variables in one pass.

Step 3 — Variable Configuration

Assign a meaningful variable name and category labels for display.
Specify the measurement scale (nominal or ordinal). If ordinal, define the correct rank ordering of categories using the drag-and-drop interface.
Designate whether the variable is binary to unlock specialised binary proportion summaries and exact confidence intervals.
Specify a grouping variable (optional) to produce stratified breakdowns and conditional frequency tables.
Specify a weight variable (optional) to produce weighted frequency estimates.

Step 4 — Missing Data Handling

Select one of the following:

Exclude missing (valid $N$ only): All summaries based on $N_{valid}$ .
Include missing as category: Missing values form an explicit "Missing" category.
Report missing separately: Missing counts reported in a separate table; all summaries exclude missing.

Step 5 — Set Display Options

✅ Absolute frequencies ( $f_k$ ).
✅ Relative frequencies / proportions ( $\hat{p}_k$ ).
✅ Percentages ( $\%_k$ ) with optional decimal places.
✅ Cumulative frequencies and cumulative percentages (ordinal variables).
✅ Valid $N$ , missing $N$ , and total $N$ .
✅ Mode (with multi-mode detection).
✅ Median and quartiles (ordinal variables).
✅ Variation ratio, Shannon entropy (raw and normalised), HHI, Simpson's $D$ , IQV.
✅ 95% confidence intervals for all proportions (Wilson, Clopper-Pearson, Agresti-Coull, or Wald — selectable in Settings).
✅ Comparison to a reference distribution (goodness-of-fit chi-square).
✅ Weighted estimates (when weight variable specified).
✅ Bar chart (simple, stacked, or grouped).
✅ Pie chart with customisable colour palette.
✅ Donut chart.
✅ Waffle chart (unit square representation).
✅ Lollipop chart.
✅ Diverging bar chart (for ordinal Likert-scale variables).
✅ Cumulative frequency plot (ordinal variables).
✅ APA 7th edition results paragraph (auto-generated).
✅ Publication-ready frequency table (formatted for direct insertion into manuscripts).

Step 6 — Run the Analysis

Click "Compute Categorical Descriptives". DataStatPro will:

Validate data: check mutual exclusivity, label consistency, and missing values.
Compute the full frequency distribution (absolute, relative, cumulative).
Identify the mode(s) and, for ordinal variables, the median and quartiles.
Compute all selected heterogeneity measures ( $VR$ , $H$ , $H_{norm}$ , $D$ , $HHI$ , $IQV$ ).
Compute Wilson 95% CIs for all proportions.
Generate all selected visualisations with customisable formatting.
Produce the APA-compliant results paragraph and formatted frequency table.

7. Step-by-Step Procedure

7.1 Full Manual Procedure

Step 1 — Identify and Define the Variable

State the variable name, its measurement scale (nominal or ordinal), the population of observations, and all valid categories. Confirm mutual exclusivity and exhaustiveness.

Step 2 — Count Total and Missing Observations

$N_{total} = N_{valid} + N_{miss}$

Report $N_{miss}$ and the missing rate $N_{miss}/N_{total}$ explicitly. Decide on missing data handling before proceeding.

Step 3 — Tally Absolute Frequencies

For each category $k = 1, 2, \ldots, K$ :

$f_k = \#\{i : X_i = x_k\}$

Verify: $\sum_{k=1}^K f_k = N_{valid}$ .

Step 4 — Compute Relative Frequencies and Percentages

$\hat{p}_k = \frac{f_k}{N_{valid}}, \qquad \%_k = 100 \times \hat{p}_k$

Step 5 — Compute Cumulative Frequencies (Ordinal Variables Only)

$F_k = \sum_{j=1}^{k} f_j, \qquad P_k = \frac{F_k}{N_{valid}} \times 100$

Verify: $F_K = N_{valid}$ and $P_K = 100\%$ .

Step 6 — Identify the Mode

$M_o = x_{k^*}, \quad k^* = \underset{k}{\arg\max}\; f_k$

If multiple categories share the maximum $f_k$ , report all co-modes.

Step 7 — Identify the Median (Ordinal Variables Only)

Locate the first category $k^*$ such that $P_{k^*} \geq 50\%$ :

$\text{Median} = x_{k^*}$

Step 8 — Compute Heterogeneity Measures

Variation ratio:

$VR = 1 - \hat{p}_{mode}$

Shannon entropy:

$H = -\sum_{k=1}^{K} \hat{p}_k \log_2(\hat{p}_k)$

Normalised entropy:

$H_{norm} = \frac{H}{\log_2 K}$

HHI and Simpson's diversity:

$HHI = \sum_{k=1}^{K} \hat{p}_k^2, \qquad D = 1 - HHI$

IQV:

$IQV = \frac{K}{K-1} \times D$

Step 9 — Compute Confidence Intervals for Proportions

For each $\hat{p}_k$ , compute a 95% Wilson CI (recommended):

$\text{Wilson CI} = \frac{\hat{p}_k + \frac{z^2}{2N} \pm z\sqrt{\frac{\hat{p}_k(1-\hat{p}_k)}{N} + \frac{z^2}{4N^2}}}{1 + \frac{z^2}{N}}$

Where $z = 1.960$ for 95% confidence.

Step 10 — Construct the Frequency Table

Assemble all computed values into a publication-ready frequency table:

Category	$f$	$\%$	Cumulative $\%$	95% CI
$x_1$	$f_1$	$\%_1$	$P_1$	$[LB_1, UB_1]$
$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$
$x_K$	$f_K$	$\%_K$	100.0%	$[LB_K, UB_K]$
Total	$N$	100.0%	—	—
Missing	$N_{miss}$	—	—	—

Step 11 — Produce Visualisations

Select appropriate chart types (see Section 9) and annotate with frequencies or percentages. Ensure all axes are labelled and a title is provided.

Step 12 — Interpret and Report

Use APA reporting guidelines (Section 15). Always report $N_{valid}$ , $N_{miss}$ , the complete frequency table, the mode, and at minimum the variation ratio or Shannon entropy. For ordinal variables, also report the median and quartiles.

8. Interpreting the Output

8.1 The Frequency Table

The frequency table is the primary output. Read it as follows:

Observation	Interpretation
One category has $\hat{p}_k \gg 1/K$	Distribution is concentrated; modal category dominates
All $\hat{p}_k \approx 1/K$	Distribution is approximately uniform; no dominant category
$\hat{p}_k = 1$ for one category	All observations in one category; no variation
Small $f_k$ for some categories	Rare categories; consider collapsing or flagging
Large $N_{miss}$	Potential bias; investigate mechanism of missingness

8.2 Mode Interpretation

Mode Pattern	Interpretation
Single clear mode with high $\hat{p}_{mode}$	Strong consensus around one category
Single mode with $\hat{p}_{mode}$ close to $1/K$	Weakly dominant mode; near-uniform distribution
Two co-modes	Bimodal distribution; two competing dominant categories
All categories equal	Uniform distribution; mode is uninformative

8.3 Heterogeneity Measures Interpretation

Measure	Low Value Indicates	High Value Indicates
$VR$	Most observations in modal category	Observations spread across many categories
$H$ (Shannon)	Predictable, concentrated distribution	Diverse, uncertain distribution
$H_{norm}$ near 0	Near-complete concentration	Near-perfect diversity
$HHI$ near 1	Near-monopoly in one category	Spread across categories
$D$ (Simpson) near 0	Low diversity; one category dominates	High diversity; categories well-represented
$IQV$ near 0	Homogeneous distribution	Heterogeneous distribution

8.4 Cumulative Frequency Interpretation (Ordinal Variables)

Cumulative Metric	Interpretation
$P_k = 50\%$ at low-rank category	Most responses at the lower end; negatively skewed
$P_k = 50\%$ at middle category	Symmetric; median in the middle
$P_k = 50\%$ at high-rank category	Most responses at the upper end; positively skewed
Wide IQR (many categories span Q1 to Q3)	High ordinal variability
Narrow IQR	Tight concentration around the median category

8.5 Confidence Interval Interpretation

CI Pattern	Interpretation
Narrow CI around $\hat{p}_k$	Precise estimate; large $N$ or extreme proportion
Wide CI around $\hat{p}_k$	Imprecise estimate; small $N$ or $\hat{p}_k$ near 0.50
CI excludes a reference value $\pi_0$	Statistically significant difference from $\pi_0$
CIs for two categories overlap	No statistically significant difference between their proportions

8.6 Contextualising Heterogeneity: Reference Benchmarks

$H_{norm}$	Verbal Label	Description
$0.00 - 0.20$	Very low diversity	One category overwhelmingly dominant
$0.21 - 0.40$	Low diversity	A few categories contain most observations
$0.41 - 0.60$	Moderate diversity	Several categories reasonably represented
$0.61 - 0.80$	High diversity	No single category clearly dominant
$0.81 - 1.00$	Very high diversity	Observations distributed nearly uniformly

⚠️ These benchmarks are heuristic guides, not universal standards. Domain context is essential — a $H_{norm} = 0.30$ may indicate healthy diversity in clinical adverse event categories but near-monopoly in a competitive market context. Always interpret heterogeneity measures relative to the theoretical range for the specific $K$ in your variable.

9. Visualising Categorical Data

9.1 Bar Chart

The bar chart (also called a bar graph) is the most widely recommended visualisation for categorical data. Each category is represented by a rectangular bar whose height (or length, for horizontal bars) is proportional to its frequency or proportion.

Best practices:

Use vertical bars for a small number of categories ( $K \leq 6$ ) with short labels.
Use horizontal bars when labels are long or when $K > 6$ .
Start the frequency axis at zero — truncating the axis distorts relative comparisons.
Sort bars by descending frequency for nominal variables (unless there is a theoretically meaningful order).
For ordinal variables, preserve the natural category order.
Label each bar with its count, percentage, or both for clarity.
Use a single, consistent colour for one-variable displays; reserve colour variation for grouped or stacked charts.

Appropriate for: Nominal and ordinal variables; any $K$ ; frequency and percentage comparisons.

9.2 Grouped Bar Chart

The grouped bar chart (clustered bar chart) displays the frequency distributions of a categorical variable separately for each level of a grouping variable, with groups of bars placed side by side.

Best practices:

Limit to $K \leq 5$ categories and $G \leq 4$ groups to avoid clutter.
Use a distinct colour for each group; provide a clear legend.
Report percentages within groups (row percentages) when comparing group profiles.

Appropriate for: Comparing the distribution of one categorical variable across multiple groups.

9.3 Stacked Bar Chart

The stacked bar chart represents the proportion of each category stacked within a single bar (or within each group bar). The 100% stacked bar chart is particularly useful for comparing proportional breakdowns across groups.

Best practices:

Use 100% stacked bars when comparing proportional composition across groups.
Place the most important or interpretively central category consistently (either first or last in the stack).
Avoid too many categories in a stack ( $K > 5$ makes stacks hard to read).

Appropriate for: Visualising proportional composition; comparing distributions across groups.

9.4 Pie Chart

The pie chart encodes frequency as the angle (and area) of each slice. It is appropriate only when the number of categories is small ( $K \leq 5$ ) and the primary goal is showing part-to-whole relationships.

Limitations:

Human perception of angular differences is less accurate than of bar lengths.
Very small slices are illegible.
Comparison across multiple pie charts is difficult.

When to avoid: When $K > 5$ , when categories are similar in size, or when precise comparisons between categories are required. Prefer a bar chart in most cases.

9.5 Donut Chart

The donut chart is a variant of the pie chart with a hollow centre. The centre space can be used to display the total $N$ or a key summary statistic. It shares the limitations of pie charts and should be used with equal care.

9.6 Waffle Chart

The waffle chart (or unit chart) represents proportions as filled cells in a $10 \times 10$ (or similar) grid, where each cell represents 1% (or $1/N$ ) of the total. Waffle charts are highly accessible and intuitive for general audiences.

Appropriate for: Communicating proportions to non-technical audiences; displaying one or two categories in a clear, visual format.

9.7 Lollipop Chart

The lollipop chart is a space-efficient alternative to the bar chart. Each category is represented by a thin line ("stick") topped with a dot ("lollipop"), whose position encodes frequency or proportion.

Best practices:

Sort by descending frequency for nominal variables.
Particularly effective for $K > 10$ categories where bars become visually dense.

9.8 Diverging Bar Chart (Likert Scales)

For Likert-scale ordinal variables (e.g., 5-point agree–disagree scales), the diverging bar chart (also called a diverging stacked bar chart) centres the neutral category at zero and extends positive-direction categories to the right and negative-direction categories to the left.

Construction:

Define a neutral midpoint (e.g., "Neither agree nor disagree").
Positive categories extend rightward from the midpoint.
Negative categories extend leftward from the midpoint.
Each half-bar's length is proportional to the percentage in that response category.

Why it is effective: Enables simultaneous visual assessment of the overall agreement/disagreement balance and the distribution across all response options.

9.9 Cumulative Frequency Plot (Ordinal Variables)

The cumulative frequency (ogive) plot graphs cumulative percentage on the $y$ -axis against ordered category levels on the $x$ -axis. It is the primary tool for visually identifying the median (where the curve crosses 50%), quartiles, and the shape of the ordinal distribution.

Appropriate for: Ordinal variables; assessing cumulative burden or threshold effects.

9.10 Visualisation Selection Guide

Variable Type	$K$	Primary Audience	Recommended Chart
Nominal	2–5	Technical	Bar chart
Nominal	2–4	General	Pie chart or waffle chart
Nominal	6+	Any	Horizontal bar or lollipop
Ordinal (non-Likert)	Any	Any	Bar chart (ordered) or cumulative plot
Ordinal (Likert)	4–7	Any	Diverging bar chart
Binary	2	Any	Single bar, donut, or waffle
Grouped (nominal × group)	3–6 × 2–4	Technical	Grouped or stacked bar chart

10. Confidence Intervals for Proportions

10.1 Why Confidence Intervals Are Essential

Sample proportions are estimates of population proportions. A 95% confidence interval (CI) provides a range of plausible values for the true population proportion $\pi_k$ , given the observed $\hat{p}_k$ and sample size $N$ . CIs are not optional — they are integral to responsible reporting of proportions.

10.2 Wilson Score Interval (Recommended)

The Wilson score interval is the recommended method for most applications, performing well across all sample sizes and values of $\hat{p}_k$ :

$CI_{Wilson} = \frac{\hat{p}_k + \frac{z^2}{2N} \pm z\sqrt{\frac{\hat{p}_k(1-\hat{p}_k)}{N} + \frac{z^2}{4N^2}}}{1 + \frac{z^2}{N}}$

Where $z = 1.960$ for 95% CI. The Wilson interval maintains coverage probability close to the nominal 95% level even for small $N$ or extreme $\hat{p}_k$ (near 0 or 1).

10.3 Clopper-Pearson Exact Interval

The Clopper-Pearson interval is an exact (conservative) method based on the binomial distribution:

$CI_{CP} = \left[B\!\left(\frac{\alpha}{2};\; f_k,\; N - f_k + 1\right),\; B\!\left(1 - \frac{\alpha}{2};\; f_k + 1,\; N - f_k\right)\right]$

Where $B(q;\; a,\; b)$ is the $q$ -th quantile of the Beta $(a,b)$ distribution. The Clopper-Pearson interval guarantees that the true coverage is at least $1 - \alpha$ , but is typically wider (more conservative) than necessary. Recommended when a conservative guarantee is required (e.g., regulatory contexts).

10.4 Agresti-Coull Interval

The Agresti-Coull interval is a simple approximation that adjusts the observed proportion by adding $z^2/2$ pseudo-successes and $z^2/2$ pseudo-failures:

$\tilde{p}_k = \frac{f_k + z^2/2}{N + z^2}, \qquad \tilde{N} = N + z^2$

$CI_{AC} = \tilde{p}_k \pm z\sqrt{\frac{\tilde{p}_k(1-\tilde{p}_k)}{\tilde{N}}}$

For $z = 1.96$ , this adds approximately 2 pseudo-successes and 2 pseudo-failures. The Agresti-Coull interval is computationally simple, nearly as accurate as Wilson's method, and performs well for $N \geq 10$ .

10.5 Wald Interval (Not Recommended for Small Samples)

The Wald interval is the classic textbook method:

$CI_{Wald} = \hat{p}_k \pm z\sqrt{\frac{\hat{p}_k(1-\hat{p}_k)}{N}}$

Limitations: The Wald interval can produce lower bounds below 0 or upper bounds above 1 for extreme proportions. It has poor coverage properties for small $N$ or when $\hat{p}_k$ is near 0 or 1. Use Wilson or Agresti-Coull instead.

10.6 CI Method Comparison

Method	Recommended For	Coverage	Notes
Wilson Score	General use; any $N$	Excellent	Best default choice
Clopper-Pearson	Small $N$ ; conservative guarantee required	Conservative	Wider than necessary for large $N$
Agresti-Coull	Simplicity; $N \geq 10$	Very good	Slightly wider than Wilson
Wald	Large $N$ ( $> 100$ ); $\hat{p}_k$ not extreme	Good only for large $N$	Fails for small $N$ or extreme $\hat{p}_k$

10.7 Simultaneous CIs for Multiple Proportions

When reporting CIs for all $K$ proportions simultaneously, the familywise confidence level is not maintained at 95% — each individual CI has 95% coverage but the joint coverage is lower. To maintain joint 95% coverage:

Bonferroni-adjusted CI: Use $z = z_{\alpha/(2K)}$ instead of $z_{0.025}$ .

For $K = 4$ categories at 95% joint confidence: $z = z_{0.00625} \approx 2.50$ .

10.8 CI Width as a Function of $N$ and $\hat{p}_k$

Wilson 95% CI width is approximately $2 \times 1.96 \times \sqrt{\hat{p}(1-\hat{p})/N}$ , which is maximised at $\hat{p} = 0.5$ .

Approximate CI width for $\hat{p} = 0.50$ :

$N$	Approximate 95% CI Width
20	±0.219
50	±0.138
100	±0.098
200	±0.069
500	±0.044
1000	±0.031
5000	±0.014

11. Advanced Topics

11.1 Conditional Frequency Tables and Subgroup Comparisons

When a categorical outcome variable is examined across levels of a grouping variable, the result is a conditional frequency table (cross-tabulation). Each row or column shows the distribution of the outcome variable within a subgroup:

$\hat{p}_{k|g} = \frac{f_{kg}}{N_g}$

Comparing $\hat{p}_{k|g}$ across groups reveals whether the distribution of categories differs between subgroups. Formal inferential testing of such differences is the domain of the chi-square test of association.

⚠️ When comparing conditional distributions across groups, report within-group percentages (row percentages when groups define rows), not overall percentages. Reporting overall percentages when groups differ in size produces misleading comparisons.

11.2 Standardisation and Reweighting

When comparing categorical distributions across samples with different population structures (e.g., different age compositions), direct standardisation weights each group's proportions to a common reference population:

$\hat{p}_k^{std} = \sum_{g=1}^G w_g^{ref} \times \hat{p}_{k|g}$

Where $w_g^{ref}$ is the proportion of the reference population in group $g$ . This removes the confounding effect of group composition and enables fair comparisons across samples.

11.3 Goodness-of-Fit: Comparing to a Theoretical Distribution

When a theoretical or historical distribution $\{\pi_{0,1}, \pi_{0,2}, \ldots, \pi_{0,K}\}$ exists for a variable, the observed proportions can be tested against it using the chi-square goodness-of-fit test:

$\chi^2 = \sum_{k=1}^{K} \frac{(f_k - E_k)^2}{E_k}, \quad E_k = N \times \pi_{0,k}$

With $\nu = K - 1$ degrees of freedom. DataStatPro integrates this test directly into the Categorical Descriptives output when a reference distribution is supplied.

11.4 Detecting Digit Preference and Response Bias

In survey data, systematic response biases cause disproportionate selection of certain categories:

Acquiescence bias: Tendency to agree regardless of question content, inflating higher Likert categories.
Centrality bias: Over-selection of the neutral/middle category.
Extremity bias: Over-selection of the highest and lowest categories.
Social desirability bias: Over-reporting of socially preferred responses.

Detection methods include comparing the observed distribution to an expected uniform distribution and inspecting standardised residuals from a goodness-of-fit test.

11.5 Benford's Law for Categorical First Digits

In datasets with naturally occurring numeric counts (e.g., city populations, financial transaction amounts), Benford's Law predicts that the first significant digit $d$ follows the distribution:

$P(D = d) = \log_{10}\!\left(1 + \frac{1}{d}\right), \quad d = 1, 2, \ldots, 9$

Significant departures from Benford's Law — assessed via a chi-square goodness-of-fit test on the first-digit frequency distribution — can flag data fabrication or anomalies in certain contexts.

11.6 Entropy-Based Feature Selection

In machine learning and data science, information gain and related entropy-based metrics use Shannon entropy to assess the predictive value of a categorical feature $X$ for an outcome variable $Y$ :

$IG(Y; X) = H(Y) - H(Y \mid X)$

Where $H(Y \mid X) = \sum_{k} \hat{p}(X=k) \times H(Y \mid X=k)$ is the conditional entropy of $Y$ given $X$ . Features with high information gain are more useful predictors. DataStatPro reports $H(Y)$ , $H(Y|X)$ , and $IG(Y;X)$ in the advanced output panel when an outcome variable is specified.

11.7 Sampling Weights and Complex Survey Design

Nationally representative surveys typically use complex sampling designs (stratification, clustering, unequal selection probabilities). In such cases:

Design-weighted proportions $\hat{p}_k^{(w)}$ correctly estimate population proportions.
Unweighted proportions estimate the sample distribution only.
Variance estimation must account for the sampling design (Taylor linearisation or bootstrap replication), not just the binomial formula.

DataStatPro supports Taylor-linearised variance estimation for weighted proportions when design variables (stratum, cluster, weight) are specified.

11.8 Temporal Trends in Categorical Variables

When the same categorical variable is measured at multiple time points, tracking the change in proportions over time reveals trends. Visualisation options include:

Line chart of proportions over time (one line per category).
Stacked area chart (visualises changing composition).
Small multiples (one bar chart per time point).

Formal testing of temporal trends in proportions can be done using the Cochran-Armitage trend test (for binary outcomes) or regression models for categorical outcomes.

11.9 Inter-Rater Reliability for Categorical Classifications

When the same observations are classified independently by two or more raters, the agreement between raters is quantified by Cohen's kappa ( $\kappa$ ) or Fleiss' kappa (for three or more raters):

$\kappa = \frac{P_o - P_e}{1 - P_e}$

Where $P_o = \sum_k \hat{p}_{kk}$ is the observed agreement proportion (sum of diagonal proportions in the $K \times K$ agreement table) and $P_e = \sum_k \hat{p}_{k\cdot}\hat{p}_{\cdot k}$ is the expected agreement by chance.

$\kappa$	Verbal Label
$< 0.00$	Less than chance
$0.00 - 0.20$	Slight
$0.21 - 0.40$	Fair
$0.41 - 0.60$	Moderate
$0.61 - 0.80$	Substantial
$0.81 - 1.00$	Almost perfect

12. Worked Examples

Example 1: Nominal Variable — Preferred Mode of Transport (Binary and $K = 4$ )

A transport planning survey collects preferred commuting mode from $N = 250$ adults. There are 3 missing responses. The valid responses ( $N_{valid} = 247$ ) are:

Mode	Count
Car	102
Public Transport	78
Cycling	41
Walking	26

Step 1 — Frequencies and Proportions:

Mode	$f$	$\hat{p}$	$\%$	95% Wilson CI
Car	102	.413	41.3%	[35.5%, 47.4%]
Public Transport	78	.316	31.6%	[26.1%, 37.6%]
Cycling	41	.166	16.6%	[12.4%, 22.0%]
Walking	26	.105	10.5%	[7.2%, 15.1%]
Total (valid)	247	1.000	100%
Missing	3

Step 2 — Mode:

$M_o = \text{Car}$ ( $f = 102$ , $\hat{p} = .413$ ). Unimodal.

Step 3 — Heterogeneity Measures:

$VR = 1 - 0.413 = 0.587$

$H = -(0.413\log_2 0.413 + 0.316\log_2 0.316 + 0.166\log_2 0.166 + 0.105\log_2 0.105)$

$= -(0.413 \times (-1.275) + 0.316 \times (-1.662) + 0.166 \times (-2.590) + 0.105 \times (-3.252))$

$= -(−0.527 − 0.525 − 0.430 − 0.341) = 1.823 \text{ bits}$

$H_{norm} = \frac{1.823}{\log_2 4} = \frac{1.823}{2.000} = 0.912$

$HHI = 0.413^2 + 0.316^2 + 0.166^2 + 0.105^2 = 0.171 + 0.100 + 0.028 + 0.011 = 0.310$

$D = 1 - 0.310 = 0.690$

$IQV = \frac{4}{3} \times 0.690 = 0.920$

Interpretation: The distribution shows moderate concentration — car is the dominant mode (41.3%), but a substantial minority use public transport (31.6%). The very high $H_{norm} = 0.912$ and $IQV = 0.920$ indicate a highly diverse distribution, with responses spread across all four categories.

APA write-up: "Among 247 valid respondents (3 missing), the most frequently preferred commuting mode was car ( $n = 102$ , 41.3%, 95% CI [35.5%, 47.4%]), followed by public transport ( $n = 78$ , 31.6%, 95% CI [26.1%, 37.6%]), cycling ( $n = 41$ , 16.6%, 95% CI [12.4%, 22.0%]), and walking ( $n = 26$ , 10.5%, 95% CI [7.2%, 15.1%]). The distribution showed high heterogeneity (Shannon entropy = 1.82 bits, $H_{norm}$ = 0.91, IQV = 0.92)."

Example 2: Ordinal Variable — Patient Satisfaction (5-Point Scale)

A hospital surveys $N = 180$ patients on their satisfaction with care (5-point ordinal scale: Very Dissatisfied, Dissatisfied, Neutral, Satisfied, Very Satisfied). There are no missing values.

Satisfaction	$f$	$\%$	Cumulative $f$	Cumulative $\%$	95% Wilson CI
Very Dissatisfied	8	4.4%	8	4.4%	[2.2%, 8.6%]
Dissatisfied	19	10.6%	27	15.0%	[6.9%, 15.9%]
Neutral	28	15.6%	55	30.6%	[10.9%, 21.5%]
Satisfied	72	40.0%	127	70.6%	[33.3%, 47.1%]
Very Satisfied	53	29.4%	180	100.0%	[23.2%, 36.5%]
Total	180	100%

Mode: Satisfied ( $f = 72$ , $\hat{p} = .400$ ).

Median: The cumulative percentage first reaches 50% at "Satisfied" ( $P_4 = 70.6\%$ ). Median = Satisfied.

Quartiles:

$Q_1$ (25th percentile): First category where $P_k \geq 25\%$ → $P_3 = 30.6\%$ → $Q_1 =$ Neutral
$Q_3$ (75th percentile): First category where $P_k \geq 75\%$ → $P_5 = 100\%$ → $Q_3 =$ Very Satisfied
$IQR =$ Neutral to Very Satisfied (spans 3 category levels)

Heterogeneity:

$VR = 1 - 0.400 = 0.600$

$H = -(0.044\log_2 0.044 + 0.106\log_2 0.106 + 0.156\log_2 0.156 + 0.400\log_2 0.400 + 0.294\log_2 0.294)$

$\approx -(−0.203 − 0.353 − 0.428 − 0.529 − 0.519) = 2.032 \text{ bits}$

$H_{norm} = \frac{2.032}{\log_2 5} = \frac{2.032}{2.322} = 0.875$

Interpretation: The distribution is positively skewed toward higher satisfaction. The median and mode both fall at "Satisfied", and over 69% of patients rated their care as Satisfied or Very Satisfied. The $H_{norm} = 0.875$ indicates moderately high variability across response options.

APA write-up: "Patient satisfaction ratings ( $N = 180$ ) showed a positively skewed distribution. The modal response was Satisfied ( $n = 72$ , 40.0%, 95% CI [33.3%, 47.1%]) and the median was Satisfied ( $Q_1 =$ Neutral; $Q_3 =$ Very Satisfied). A combined 69.4% of patients rated their care as Satisfied or Very Satisfied. The distribution showed moderately high heterogeneity ( $H_{norm} = 0.88$ , $VR = 0.60$ )."

Example 3: Binary Variable — Vaccination Status

A public health register records vaccination status (vaccinated/not vaccinated) for $N = 1\,200$ individuals. $N_{miss} = 14$ .

Status	$f$	$\%$	95% Wilson CI
Vaccinated	934	79.3%	[76.8%, 81.6%]
Not Vaccinated	244	20.7%	[18.4%, 23.2%]
Total (valid)	1,178	100%
Missing	14	1.2%

Mode: Vaccinated ( $\hat{p} = .793$ ).

Binary entropy:

$H = -(0.793\log_2 0.793 + 0.207\log_2 0.207) = -(−0.305 − 0.531) = 0.836 \text{ bits}$

$H_{norm} = \frac{0.836}{1.000} = 0.836$

$VR = 1 - 0.793 = 0.207$

Interpretation: Approximately 79.3% of the population is vaccinated, with the 95% CI indicating the true coverage is between 76.8% and 81.6%. This falls below the commonly cited 95% herd immunity threshold. The low $VR = 0.207$ and moderate $H_{norm} = 0.836$ confirm that one category (vaccinated) dominates, but a non-negligible 20.7% remain unvaccinated.

APA write-up: "Of 1,178 individuals with valid vaccination records (14 missing, 1.2%), 934 (79.3%, 95% CI [76.8%, 81.6%]) were vaccinated and 244 (20.7%, 95% CI [18.4%, 23.2%]) were not vaccinated. Coverage fell below the 95% target threshold."

Example 4: Subgroup Breakdown — Grade Distribution by Teaching Method

Building on the teaching method data from the chi-square tutorial, a researcher reports the grade distribution for each teaching method ( $N = 210$ , 70 per method).

Conditional Frequency Table (Row Percentages):

Method	A	B	C	D/F	Total
Lecture	12 (17.1%)	23 (32.9%)	22 (31.4%)	13 (18.6%)	70
Flipped	24 (34.3%)	27 (38.6%)	16 (22.9%)	3 (4.3%)	70
Online	9 (12.9%)	17 (24.3%)	28 (40.0%)	16 (22.9%)	70

Modes: Lecture = B; Flipped = B; Online = C.

Medians: Lecture = B; Flipped = B; Online = C.

Shannon Entropy by Group:

Method	$H$ (bits)	$H_{norm}$	$VR$	Interpretation
Lecture	1.969	0.984	0.671	High variability; grades spread across all levels
Flipped	1.777	0.888	0.657	Moderate variability; concentrated at upper grades
Online	1.939	0.969	0.600	High variability; concentrated at lower grades

Interpretation: All three methods show moderate-to-high grade variability. The flipped classroom has the highest proportion of A grades (34.3%) and the lowest D/F rate (4.3%), while the online method shows the highest C and D/F rates. Formal testing of these differences is provided by the chi-square test of association.

13. Common Mistakes and How to Avoid Them

Mistake 1: Computing a Mean or Standard Deviation for a Nominal Variable

Problem: Assigning arbitrary numeric codes to categories (e.g., 1 = Male, 2 = Female, 3 = Non-binary) and computing their mean or standard deviation. The resulting number is arithmetically computable but statistically meaningless — the numeric codes carry no magnitude information.

Solution: For nominal variables, report only frequency, proportion, and mode. If a numeric summary of a categorical variable is needed for modelling, create appropriate dummy/indicator variables.

Mistake 2: Treating Ordinal Variables as Fully Continuous

Problem: Computing the arithmetic mean of ordinal scale responses (e.g., mean Likert score = 3.47) as if the intervals between categories were equal. The mean assumes equal spacing; ordinal categories have no such guarantee.

Solution: For ordinal variables, report the median and IQR as the primary central tendency and spread measures. Report the mode as a supplementary measure. Computing means of ordinal variables is acceptable as a pragmatic convention in some fields (notably Likert-scale research), but must be explicitly acknowledged and defended.

Mistake 3: Failing to Report Missing Values

Problem: Computing and reporting proportions from valid observations only, without disclosing the number of missing values. This gives readers no way to assess whether missingness is substantial enough to bias the results.

Solution: Always report both $N_{valid}$ and $N_{miss}$ (and the missing rate). Investigate whether missingness is systematic. Report results for both complete-case and missing-included analyses when missingness is substantial ( $> 5\%$ ).

Mistake 4: Reporting Only Counts Without Proportions (or Vice Versa)

Problem: Reporting only absolute frequencies makes comparisons across groups of different sizes misleading. Reporting only proportions without counts obscures the precision of estimates (a proportion of 50% based on $N = 4$ is very different from one based on $N = 4000$ ).

Solution: Always report both absolute frequency and proportion (or percentage) in frequency tables. Include $N_{valid}$ so readers can recover the raw counts from percentages.

Mistake 5: Using a Pie Chart for More Than 5 Categories

Problem: A pie chart with 6 or more slices becomes illegible. Slices of similar size are virtually indistinguishable, and small categories vanish. Over-reliance on pie charts is one of the most widely cited visualisation errors.

Solution: For $K > 5$ , use a horizontal bar chart sorted by frequency or a lollipop chart. Reserve pie charts for $K \leq 4$ when part-to-whole relationships are the primary message.

Mistake 6: Ignoring Category Order for Ordinal Variables

Problem: Sorting ordinal categories alphabetically or by frequency rather than in their natural rank order (e.g., displaying satisfaction responses as: High, Low, Medium). This disrupts the cumulative frequency interpretation, makes cumulative plots meaningless, and confuses readers.

Solution: For ordinal variables, always display categories in their meaningful rank order. DataStatPro enforces the user-specified rank ordering in all tables and charts when the variable is designated as ordinal.

Mistake 7: Reporting the Wald Interval for Small Samples or Extreme Proportions

Problem: The Wald CI $\hat{p} \pm 1.96\sqrt{\hat{p}(1-\hat{p})/N}$ can produce intervals below 0 or above 1, and has poor coverage when $N < 30$ or $\hat{p} < 0.10$ or $\hat{p} > 0.90$ .

Solution: Use the Wilson score interval as the default for all sample sizes. Use the Clopper-Pearson exact interval when a conservative guarantee is required. DataStatPro defaults to Wilson intervals.

Mistake 8: Comparing Distributions Using Overall Proportions Instead of Conditional Proportions

Problem: When comparing the distribution of a variable across groups of unequal size, reporting overall proportions conflates the group composition with the variable distribution. For example, if 90% of respondents are female and more females prefer Brand A, overall Brand A preference will be high not because it is universally preferred, but because females are overrepresented.

Solution: Always compute and report conditional (within-group) proportions when comparing distributions across groups. Use row percentages when groups are defined by rows.

Mistake 9: Conflating Diversity Measures Across Variables with Different $K$

Problem: Comparing the raw Shannon entropy $H$ of a 3-category variable ( $H_{max} = \log_2 3 = 1.585$ bits) with that of a 6-category variable ( $H_{max} = \log_2 6 = 2.585$ bits). A higher raw $H$ may simply reflect more categories, not greater relative diversity.

Solution: Compare variables using normalised entropy $H_{norm} = H / \log_2 K$ or IQV, both of which are bounded $[0, 1]$ regardless of $K$ .

Mistake 10: Concluding Category Absence from a Zero Frequency

Problem: A category with $f_k = 0$ in the sample may exist in the population but was not observed due to small sample size or sampling variability. Declaring the category "absent" may be premature.

Solution: Report zero-frequency categories explicitly in the table. Compute the 95% CI for $\hat{p}_k = 0$ using the Clopper-Pearson upper bound: $CI_{upper} = B(1 - \alpha/2;\; 1,\; N)$ , which gives a plausible upper bound for the true proportion.

14. Troubleshooting

Problem	Likely Cause	Solution
Proportions do not sum to 1.000	Rounding in display; missing values excluded	Report "rounding may cause totals to differ from 100%"; verify $\sum f_k = N_{valid}$
More categories than expected	Label inconsistencies (case, spaces, synonyms)	Use DataStatPro's category merge/clean tool; standardise labels
Mode is reported as multiple categories	Two or more categories share the maximum frequency	Report all co-modes and describe as a bimodal or multimodal distribution
Cumulative percentage does not reach 100%	Missing category or rounding	Verify all categories are included; check $\sum f_k = N_{valid}$
Confidence interval lower bound is negative (Wald)	Small $N$ or extreme $\hat{p}$ with Wald method	Switch to Wilson or Clopper-Pearson interval; both are bounded $[0,1]$
$H_{norm} = 0$	All observations in one category	Expected result; no variation present; report $VR = 0$
$H_{norm} = 1$	Perfectly uniform distribution	All categories equally represented; may reflect a small sample or genuine uniformity
$IQV > 1$	Computation error or incorrect $K$	Verify formula; $IQV$ is bounded $[0,1]$ by construction
Median not defined	Perfectly even split at a boundary (cumulative % jumps from $< 50\%$ to $> 50\%$ without exactly hitting 50%)	Report median as the lower of the two surrounding categories; use interpolation if numeric scores assigned
Weighted proportions differ greatly from unweighted	Significant over/undersampling of some groups	Expected in complex surveys; report both; weighted estimates are for population inference
All cells have very small counts ( $< 5$ )	Very small total $N$	Report counts only; CIs will be very wide; caution against over-interpreting proportions
Chart shows categories in wrong order	Default alphabetical sorting applied to ordinal variable	Specify ordinal order in DataStatPro's variable settings; reorder manually in chart editor
Missing rate is very high ( $> 20\%$ )	Non-response bias; data collection issue	Investigate MCAR/MAR/MNAR mechanism; perform sensitivity analysis; consider imputation
Entropy calculated as negative	Use of $0 \times \log(0)$ not set to 0	Apply L'Hôpital's convention: $0 \times \log_2(0) \equiv 0$ ; check software implementation

15. Quick Reference Cheat Sheet

Core Equations

Formula	Description
$\hat{p}_k = f_k / N_{valid}$	Sample proportion in category $k$
$\%_k = 100 \times f_k / N_{valid}$	Percentage in category $k$
$F_k = \sum_{j \leq k} f_j$	Cumulative frequency up to category $k$ (ordinal)
$M_o = x_{k^},\; k^ = \arg\max f_k$	Mode: most frequent category
$\text{Median} = x_{k^},\; k^ = \min\{k : P_k \geq 0.50\}$	Median category (ordinal only)
$VR = 1 - \hat{p}_{mode}$	Variation ratio
$H = -\sum_{k} \hat{p}_k \log_2 \hat{p}_k$	Shannon entropy (bits)
$H_{norm} = H / \log_2 K$	Normalised entropy $\in [0,1]$
$HHI = \sum_{k} \hat{p}_k^2$	Herfindahl–Hirschman Index
$D = 1 - \sum_{k} \hat{p}_k^2$	Simpson's diversity index
$IQV = \frac{K}{K-1}(1 - \sum_{k} \hat{p}_k^2)$	Index of Qualitative Variation $\in [0,1]$
Wilson CI: see Section 10.2	Recommended 95% CI for proportions

Measure Applicability by Scale

Measure	Nominal	Ordinal
Frequency, proportion, %	✅	✅
Cumulative frequency / %	❌	✅
Mode	✅	✅
Median	❌	✅
Quartiles / IQR	❌	✅
$VR$ , $H$ , $H_{norm}$ , $D$ , $HHI$ , $IQV$	✅	✅
Mean, SD	❌	❌ (unless numeric scores assigned)

Chart Selection Guide

Variable Type	$K$	Audience	Recommended Chart
Nominal	2–5	Technical	Vertical bar chart
Nominal	2–4	General	Pie chart / waffle chart
Nominal	6+	Any	Horizontal bar / lollipop
Ordinal (non-Likert)	Any	Any	Bar chart (ordered)
Ordinal (Likert)	4–7	Any	Diverging bar chart
Binary	2	General	Single bar / donut / waffle
Grouped comparisons	3–6 × 2–4	Technical	Grouped / stacked bar
Trend over time	Any	Any	Line chart of proportions

Heterogeneity Benchmarks ( $H_{norm}$ )

$H_{norm}$	Diversity Level
$0.00 - 0.20$	Very low (near-complete concentration)
$0.21 - 0.40$	Low
$0.41 - 0.60$	Moderate
$0.61 - 0.80$	High
$0.81 - 1.00$	Very high (near-uniform)

Confidence Interval Method Selection

Situation	Recommended Method
General use; any $N$	Wilson score interval
Conservative guarantee; small $N$	Clopper-Pearson exact
Simplicity; $N \geq 10$	Agresti-Coull
Large $N > 100$ ; $\hat{p}$ not extreme	Wald (acceptable but not preferred)
Joint CIs for all $K$ proportions	Bonferroni-adjusted Wilson

Approximate Wilson 95% CI Width ( $\hat{p} = 0.50$ )

$N$	± Width
20	±0.219
50	±0.138
100	±0.098
200	±0.069
500	±0.044
1000	±0.031

APA 7th Edition Reporting Templates

Single nominal variable: "The most frequently reported [variable name] was [modal category] ( $n$ = [value], [%]%, 95% CI [[LB]%, [UB]%]), followed by [second category] ( $n$ = [value], [%]%, 95% CI [[LB]%, [UB]%]). The full distribution is presented in Table X."

Single ordinal variable: "Responses to [variable name] were distributed as follows: [lowest category] ( $n$ = [value], [%]%), …, [highest category] ( $n$ = [value], [%]%). The median response was [median category] ( $Q_1$ = [Q1 category]; $Q_3$ = [Q3 category])."

Binary variable: "A total of [f] participants ([%]%, 95% CI [[LB]%, [UB]%]) reported [positive category]; the remaining [f] ([%]%, 95% CI [[LB]%, [UB]%]) reported [negative category]."

With missing data: "Of [N total] participants, [N valid] provided valid responses ([N miss] missing, [miss %]%). Among valid responses, …"

With heterogeneity measures: "The distribution showed [low / moderate / high] heterogeneity (Shannon entropy = [value] bits, $H_{norm}$ = [value], IQV = [value])."

With subgroup comparison: "Conditional frequency distributions by [group variable] are presented in Table X. [Category] was most prevalent in [subgroup] ([%]%, 95% CI [[LB]%, [UB]%]) compared to [other subgroup] ([%]%, 95% CI [[LB]%, [UB]%])."

Reporting Checklist

Item	Required
Valid $N$ and missing $N$ (with missing rate)	✅ Always
All category labels clearly defined	✅ Always
Absolute frequencies for all categories	✅ Always
Percentages for all categories	✅ Always
Proportions (or % totalling 100%)	✅ Always
Mode (with indication of multimodality if present)	✅ Always
Median and quartiles	✅ For ordinal variables
Cumulative frequencies / percentages	✅ For ordinal variables
95% CI for all proportions (Wilson recommended)	✅ Always
Shannon entropy ( $H$ and $H_{norm}$ )	✅ When heterogeneity is of substantive interest
$VR$ or $IQV$	✅ When heterogeneity is of substantive interest
HHI / Simpson's $D$	✅ For concentration / diversity analyses
Appropriate chart with axis labels and title	✅ Always
Reference to published frequency table in text	✅ Always
Measurement scale stated (nominal / ordinal)	✅ Always
Missing data mechanism discussed	✅ When $N_{miss} > 5\%$
Weighted estimates (if survey data)	✅ When design weights provided
CI method stated	✅ When $N < 100$ or any $\hat{p}_k < 0.10$
Goodness-of-fit against reference distribution	✅ When comparing to theoretical or historical baseline
Subgroup conditional distributions	✅ When a grouping variable is present

This tutorial provides a comprehensive foundation for understanding, computing, interpreting, visualising, and reporting categorical descriptive statistics within the DataStatPro application. For further reading, consult Agresti's "An Introduction to Categorical Data Analysis" (3rd ed., 2018), Tukey's "Exploratory Data Analysis" (1977), Wickham's "ggplot2: Elegant Graphics for Data Analysis" (3rd ed., 2024) for visualisation principles, and Shannon & Weaver's "The Mathematical Theory of Communication" (1949) for entropy foundations. For feature requests or support, contact the DataStatPro team.

Categorical Descriptives

Categorical Descriptives: Zero to Hero Tutorial

Table of Contents

1. Prerequisites and Background Concepts

1.1 What is a Variable?

1.2 Scales of Measurement

1.3 Categorical Variables Defined

1.4 Frequency and Frequency Distributions

1.5 The Mode

1.6 Population vs. Sample

1.7 The Concept of a Probability Distribution for Categorical Data

1.8 Missing Data in Categorical Variables

2. What are Categorical Descriptives?

2.1 The Core Purpose

2.2 What Categorical Descriptives Tell You

2.3 When to Use Categorical Descriptives

2.4 Real-World Applications

2.5 Distinguishing Categorical Descriptives from Related Analyses

3. The Mathematics Behind Categorical Descriptives

3.1 Notation

3.2 Frequency and Proportion

3.3 Cumulative Frequencies (Ordinal Variables)

3.4 The Mode

3.5 The Variation Ratio

3.6 Shannon's Entropy

3.7 Herfindahl–Hirschman Index (HHI) and Simpson's Concentration Index

3.8 Qualitative Variation Index (IQV)

3.9 The Median for Ordinal Variables

3.10 Percentiles and Quartiles for Ordinal Variables

3.11 The Geometric Mean of Proportions (Diversity)

4. Considerations and Data Quality Checks

4.1 Mutual Exclusivity and Exhaustiveness

4.2 Category Labelling Consistency

4.3 Missing Data Assessment

4.4 Sample Size Adequacy

4.5 Rare Categories

4.6 Ordered vs. Unordered Presentation

4.7 Weighted Data

5. Types of Categorical Descriptive Measures

5.1 Measures of Frequency

5.2 Measures of Central Tendency

5.3 Measures of Dispersion / Heterogeneity

5.4 Comparative Descriptives: Subgroup Breakdowns

5.5 Descriptives for Binary (Dichotomous) Variables

6. Using the Categorical Descriptives Calculator Component

Step-by-Step Guide

7. Step-by-Step Procedure

7.1 Full Manual Procedure

Step 1 — Identify and Define the Variable

Step 2 — Count Total and Missing Observations

Step 3 — Tally Absolute Frequencies

Step 4 — Compute Relative Frequencies and Percentages

Step 5 — Compute Cumulative Frequencies (Ordinal Variables Only)

Step 6 — Identify the Mode

Step 7 — Identify the Median (Ordinal Variables Only)

Step 8 — Compute Heterogeneity Measures

Step 9 — Compute Confidence Intervals for Proportions

Step 10 — Construct the Frequency Table

Step 11 — Produce Visualisations

Step 12 — Interpret and Report

8. Interpreting the Output

8.1 The Frequency Table

8.2 Mode Interpretation

8.3 Heterogeneity Measures Interpretation

8.4 Cumulative Frequency Interpretation (Ordinal Variables)

8.5 Confidence Interval Interpretation

8.6 Contextualising Heterogeneity: Reference Benchmarks

9. Visualising Categorical Data

9.1 Bar Chart

9.2 Grouped Bar Chart

9.3 Stacked Bar Chart

9.4 Pie Chart

9.5 Donut Chart

9.6 Waffle Chart

9.7 Lollipop Chart

9.8 Diverging Bar Chart (Likert Scales)

9.9 Cumulative Frequency Plot (Ordinal Variables)

9.10 Visualisation Selection Guide

10. Confidence Intervals for Proportions

10.1 Why Confidence Intervals Are Essential

10.8 CI Width as a Function of $N$ and $\hat{p}_k$

Example 1: Nominal Variable — Preferred Mode of Transport (Binary and $K = 4$ )

Mistake 9: Conflating Diversity Measures Across Variables with Different $K$

Heterogeneity Benchmarks ( $H_{norm}$ )

Approximate Wilson 95% CI Width ( $\hat{p} = 0.50$ )