Categorical Descriptives

Comprehensive reference guide for categorical data analysis and association measures.

Categorical Descriptives: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of summarising categorical data all the way through advanced interpretation, reporting, visualisation, assumption checking, and practical usage within the DataStatPro application. Whether you are encountering categorical descriptive statistics for the first time or deepening your understanding of how to characterise, display, and communicate the distribution of categorical variables, this guide builds your knowledge systematically from the ground up.


Table of Contents

  1. Prerequisites and Background Concepts
  2. What are Categorical Descriptives?
  3. The Mathematics Behind Categorical Descriptives
  4. Considerations and Data Quality Checks
  5. Types of Categorical Descriptive Measures
  6. Using the Categorical Descriptives Calculator Component
  7. Step-by-Step Procedure
  8. Interpreting the Output
  9. Visualising Categorical Data
  10. Confidence Intervals for Proportions
  11. Advanced Topics
  12. Worked Examples
  13. Common Mistakes and How to Avoid Them
  14. Troubleshooting
  15. Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into categorical descriptives, it is essential to be comfortable with the following foundational statistical concepts. Each is briefly reviewed below.

1.1 What is a Variable?

A variable is any characteristic, attribute, or quantity that can take on different values across observations. In statistics, variables are the building blocks of data, and every analytic method — including descriptive statistics — begins with a clear understanding of the variables at hand.

  • Observation: A single unit of study (e.g., one person, one product, one country).
  • Dataset: A rectangular array of observations (rows) and variables (columns).
  • Value: The specific outcome recorded for a variable on a given observation.

1.2 Scales of Measurement

All variables fall into one of four measurement scales, which determine which statistical operations are valid:

ScalePropertiesExamples
NominalNamed categories; no orderBlood type, gender, country, species
OrdinalOrdered categories; no equal spacingSatisfaction (low/medium/high), education level
IntervalEqual spacing; no true zeroTemperature (°C), year
RatioEqual spacing; true zeroHeight, weight, income, reaction time

Categorical descriptives apply to nominal and ordinal variables (and, when treated as categorical, to discretised interval or ratio variables).

1.3 Categorical Variables Defined

A categorical variable assigns each observation to exactly one of a finite set of mutually exclusive, exhaustive categories. Two key sub-types exist:

  • Nominal variables: Categories carry labels only, with no inherent ordering. There is no meaningful sense in which one category is "more" or "less" than another. Examples: eye colour (blue, green, brown, hazel), marital status (single, married, divorced, widowed), preferred mode of transport.

  • Ordinal variables: Categories carry labels and a meaningful rank order, but the intervals between adjacent categories need not be equal. Examples: pain level (none, mild, moderate, severe), academic grade (A, B, C, D, F), Likert-scale items (strongly disagree to strongly agree).

⚠️ Descriptive statistics appropriate for nominal variables (frequency, proportion, mode) are always valid for ordinal variables, but ordinal variables additionally support rank-based summaries. Never apply ordinal summaries — such as the median — to purely nominal variables.

1.4 Frequency and Frequency Distributions

The most fundamental summary of a categorical variable is its frequency distribution: a complete enumeration of each category and the number of observations falling into it.

  • Absolute frequency fkf_k: The raw count of observations in category kk.
  • Relative frequency pkp_k: The proportion of all observations in category kk, defined as pk=fk/Np_k = f_k / N.
  • Percentage %k\%_k: The relative frequency expressed as a percentage, %k=100×pk\%_k = 100 \times p_k.
  • Cumulative frequency FkF_k: The count of all observations in categories up to and including kk (meaningful only for ordinal variables).
  • Cumulative relative frequency PkP_k: The proportion of observations in categories up to and including kk.

1.5 The Mode

The mode is the category (or categories) that appears most frequently in the data. It is the only measure of central tendency that is valid for nominal-scale data.

  • A distribution with one clear most-frequent category is unimodal.
  • A distribution with two equally or near-equally frequent categories is bimodal.
  • A distribution in which all categories occur with equal frequency is uniform — the mode is undefined or uninformative in this case.

1.6 Population vs. Sample

All descriptive statistics computed from data describe the sample at hand. When the goal is to make inferences about a broader population, sample statistics become estimates subject to sampling variability. Confidence intervals (Section 10) quantify this uncertainty.

  • Population proportion: πk\pi_k — the true proportion of the population in category kk.
  • Sample proportion: p^k=fk/N\hat{p}_k = f_k / N — the estimate of πk\pi_k from the sample.

1.7 The Concept of a Probability Distribution for Categorical Data

For a categorical variable with KK categories, the probability distribution specifies the probability πk\pi_k assigned to each category kk, subject to:

k=1Kπk=1,0πk1for all k\sum_{k=1}^{K} \pi_k = 1, \qquad 0 \leq \pi_k \leq 1 \quad \text{for all } k

This is the categorical distribution (a generalisation of the Bernoulli distribution to K>2K > 2 outcomes). When K=2K = 2 (binary variable), it reduces to the Bernoulli distribution with parameter π1=p\pi_1 = p and π2=1p\pi_2 = 1 - p.

1.8 Missing Data in Categorical Variables

Missing values are observations for which no category was recorded. They are fundamentally different from a valid category and must be handled deliberately:

  • Complete case analysis: Exclude observations with missing values from all calculations. Simple but potentially biasing.
  • Include as a category: Treat missing as its own explicit category (appropriate when missingness is informative, e.g., "refused to answer").
  • Imputation: Replace missing values with estimated values using mode imputation or multiple imputation methods.

DataStatPro reports the number and percentage of missing values separately from the frequency distribution of valid responses.


2. What are Categorical Descriptives?

2.1 The Core Purpose

Categorical descriptive statistics are numerical and graphical summaries that characterise the distribution of a categorical variable. Their purpose is to answer, in a rigorous and communicable way, the fundamental question: How are observations distributed across the categories of this variable?

Unlike continuous descriptives (mean, standard deviation, skewness), which describe the location and spread of a numeric scale, categorical descriptives quantify the frequency, proportion, and relative dominance of discrete categories.

2.2 What Categorical Descriptives Tell You

SummaryCore Question Answered
Frequency tableHow many observations fall into each category?
Proportions / percentagesWhat share of observations does each category represent?
ModeWhich category is most common?
Variation ratioHow heterogeneous is the distribution of categories?
EntropyHow uncertain or diverse is the distribution?
Concentration indexHow much are observations concentrated in one category?
Cumulative frequenciesWhat proportion of observations fall at or below a given level? (ordinal only)
Confidence intervalsWhat is the plausible range for the true population proportion?

2.3 When to Use Categorical Descriptives

ConditionRequirement
Variable scaleNominal or ordinal
Data formatObservations assigned to mutually exclusive categories
PurposeSummarise the marginal distribution of one variable
Sample sizeAny; CIs become more informative with larger NN
ReportingAlways precede inferential tests with descriptive summaries

2.4 Real-World Applications

FieldVariableCategoriesDescriptive Goal
Public HealthVaccination statusVaccinated / Partially vaccinated / UnvaccinatedEstimate population coverage
MarketingBrand preferenceBrand A / B / C / D / NoneIdentify dominant preference
HR & OrganisationalEmployment typeFull-time / Part-time / Contract / CasualDescribe workforce composition
Clinical TrialsAdverse event severityMild / Moderate / Severe / Life-threateningProfile safety outcomes
EducationLetter gradeA / B / C / D / FCharacterise grade distribution
SociologyReligious affiliationMultiple denominationsMap social structure
Quality ControlDefect categoryType I / II / III / NoneIdentify dominant failure modes
Political ScienceVoting intentionParty A / B / C / UndecidedTrack electoral preference
GoalAppropriate Method
Summarise one categorical variableCategorical descriptives
Test association between two categorical variablesChi-square test of association
Test whether one distribution matches a known distributionChi-square goodness-of-fit test
Summarise a continuous variableContinuous descriptives (mean, SD, median, IQR)
Compare proportions across two or more groupsTwo-proportion z-test; chi-square test
Summarise the joint distribution of two categorical variablesContingency table (cross-tabulation)
Model a binary outcomeLogistic regression

3. The Mathematics Behind Categorical Descriptives

3.1 Notation

Consider a categorical variable XX with KK mutually exclusive categories labelled x1,x2,,xKx_1, x_2, \ldots, x_K. A sample of NN observations yields:

  • fkf_k = count of observations in category kk (absolute frequency)
  • N=k=1KfkN = \sum_{k=1}^{K} f_k = total number of valid observations
  • p^k=fk/N\hat{p}_k = f_k / N = sample proportion in category kk (relative frequency)

3.2 Frequency and Proportion

Absolute frequency:

fk=#{i:Xi=xk}for k=1,2,,Kf_k = \#\{i : X_i = x_k\} \quad \text{for } k = 1, 2, \ldots, K

Relative frequency (proportion):

p^k=fkN\hat{p}_k = \frac{f_k}{N}

Percentage:

%k=100×p^k=100×fkN\%_k = 100 \times \hat{p}_k = \frac{100 \times f_k}{N}

Verification: k=1Kfk=N\sum_{k=1}^K f_k = N and k=1Kp^k=1\sum_{k=1}^K \hat{p}_k = 1.

3.3 Cumulative Frequencies (Ordinal Variables)

For an ordinal variable with categories ordered x1<x2<<xKx_1 < x_2 < \cdots < x_K:

Cumulative absolute frequency:

Fk=j=1kfjF_k = \sum_{j=1}^{k} f_j

Cumulative relative frequency:

Pk=j=1kp^j=FkNP_k = \sum_{j=1}^{k} \hat{p}_j = \frac{F_k}{N}

By definition, FK=NF_K = N and PK=1P_K = 1.

3.4 The Mode

The mode MoM_o is the category with the highest absolute frequency:

Mo=xkwherek=argmaxk  fkM_o = x_{k^*} \quad \text{where} \quad k^* = \underset{k}{\arg\max}\; f_k

When two or more categories share the maximum frequency, the distribution is multimodal and all maximum-frequency categories are reported as co-modes.

3.5 The Variation Ratio

The variation ratio (VRVR) measures the proportion of observations that do not fall into the modal category. It is the simplest measure of dispersion for nominal data:

VR=1fmodeN=1p^modeVR = 1 - \frac{f_{mode}}{N} = 1 - \hat{p}_{mode}

  • VR=0VR = 0: All observations are in one category (no variation).
  • VR=11/KVR = 1 - 1/K: Maximum variation; all categories are equally frequent (uniform distribution).
  • VRVR ranges from 00 to (K1)/K(K-1)/K.

3.6 Shannon's Entropy

Shannon's entropy HH (from information theory) quantifies the uncertainty or diversity in a categorical distribution:

H=k=1Kp^klog2(p^k)H = -\sum_{k=1}^{K} \hat{p}_k \log_2(\hat{p}_k)

Measured in bits (using log2\log_2). Convention: 0×log2(0)=00 \times \log_2(0) = 0.

  • H=0H = 0: Minimum entropy — all observations in one category (complete certainty).
  • H=log2KH = \log_2 K: Maximum entropy — all categories equally probable (maximum uncertainty).

Normalised entropy (also called relative entropy or evenness index) rescales HH to [0,1][0, 1]:

Hnorm=Hlog2K=k=1Kp^klog2(p^k)log2KH_{norm} = \frac{H}{\log_2 K} = \frac{-\sum_{k=1}^{K} \hat{p}_k \log_2(\hat{p}_k)}{\log_2 K}

Hnorm=0H_{norm} = 0 indicates complete concentration; Hnorm=1H_{norm} = 1 indicates maximum diversity across categories.

⚠️ Natural logarithm (ln\ln) is often used instead of log2\log_2, yielding entropy in nats. The choice of logarithm base affects the numerical value of HH but not relative comparisons. DataStatPro uses log2\log_2 (bits) by default.

3.7 Herfindahl–Hirschman Index (HHI) and Simpson's Concentration Index

The Herfindahl–Hirschman Index quantifies the degree to which observations are concentrated in a small number of categories:

HHI=k=1Kp^k2HHI = \sum_{k=1}^{K} \hat{p}_k^2

  • HHI=1/KHHI = 1/K: Minimum concentration — all categories equally frequent.
  • HHI=1HHI = 1: Maximum concentration — all observations in a single category.

The complement, Simpson's diversity index DD, measures the probability that two randomly selected observations belong to different categories:

D=1HHI=1k=1Kp^k2D = 1 - HHI = 1 - \sum_{k=1}^{K} \hat{p}_k^2

  • D=0D = 0: No diversity (all in one category).
  • D=(K1)/KD = (K-1)/K: Maximum diversity (uniform distribution).

3.8 Qualitative Variation Index (IQV)

The Index of Qualitative Variation (IQV), also attributed to Gibbs and Martin (1962), standardises Simpson's diversity index to [0,1][0, 1] regardless of the number of categories:

IQV=KK1×D=KK1(1k=1Kp^k2)IQV = \frac{K}{K-1} \times D = \frac{K}{K-1} \left(1 - \sum_{k=1}^{K} \hat{p}_k^2\right)

  • IQV=0IQV = 0: All observations in one category.
  • IQV=1IQV = 1: All categories equally represented (maximum heterogeneity).

IQV facilitates comparisons of categorical dispersion across variables with different numbers of categories.

3.9 The Median for Ordinal Variables

For ordinal variables, the median is the category at which the cumulative relative frequency first reaches or exceeds 0.50:

Median=xkwherek=min{k:Pk0.50}\text{Median} = x_{k^*} \quad \text{where} \quad k^* = \min\{k : P_k \geq 0.50\}

The median is more informative than the mode for ordinal data when the distribution is asymmetric, as it captures the central ordering of responses.

3.10 Percentiles and Quartiles for Ordinal Variables

Percentiles for ordinal variables are defined analogously to the median, using the cumulative frequency distribution:

Percentileq=xkwherek=min{k:Pkq/100}\text{Percentile}_q = x_{k^*} \quad \text{where} \quad k^* = \min\{k : P_k \geq q/100\}

The interquartile range (IQRIQR) describes the middle 50% of ordinal responses and spans from the 25th percentile (Q1Q_1) to the 75th percentile (Q3Q_3):

IQR=Q3Q1(in category units)IQR = Q_3 - Q_1 \quad \text{(in category units)}

⚠️ Arithmetic differences between ordinal category labels are not meaningful unless numeric scores are assigned. The IQR for ordinal data should be reported as a range of category labels, not as a single numeric value.

3.11 The Geometric Mean of Proportions (Diversity)

For comparing proportional distributions across samples of different sizes, the geometric mean proportion can be used to summarise the average per-category representation:

pˉgeom=(k=1Kp^k)1/K\bar{p}_{geom} = \left(\prod_{k=1}^{K} \hat{p}_k\right)^{1/K}

This is directly related to entropy: H=K×ln(pˉgeom)H = -K \times \ln(\bar{p}_{geom}).


4. Considerations and Data Quality Checks

4.1 Mutual Exclusivity and Exhaustiveness

The fundamental validity requirement for a categorical variable is that its categories are:

  1. Mutually exclusive: Each observation belongs to exactly one category. If a respondent can select multiple categories (multi-select questions), the variable violates mutual exclusivity and must be restructured (e.g., as multiple binary indicator variables) before standard categorical descriptives can be applied.

  2. Exhaustive: Every possible observation must map to some category. If the category set does not cover all possibilities, an "Other" category must be added.

How to check: Confirm that k=1Kfk=N\sum_{k=1}^K f_k = N (valid observations). If k=1Kfk<N\sum_{k=1}^K f_k < N, some observations are unaccounted for.

4.2 Category Labelling Consistency

Inconsistent labelling causes artificial category inflation. Common problems include:

ProblemExampleSolution
Case inconsistency"male" vs. "Male" vs. "MALE"Standardise case before analysis
Leading/trailing spaces" Yes" vs. "Yes"Strip whitespace
Synonymous labels"N/A" vs. "Not Applicable"Merge into one category
Abbreviations"F" vs. "Female"Choose one consistent label
Encoding issues"Caf" vs. "Café"Fix encoding and standardise

DataStatPro flags potential label inconsistencies and offers a category merge tool.

4.3 Missing Data Assessment

Before reporting any descriptive statistics, the extent and pattern of missing data must be evaluated:

MetricFormulaInterpretation
Missing countNmiss=NtotalNvalidN_{miss} = N_{total} - N_{valid}Number of absent responses
Missing rateNmiss/NtotalN_{miss} / N_{total}Proportion of data missing
Valid NN rateNvalid/NtotalN_{valid} / N_{total}Proportion of usable responses

Missing data mechanisms:

  • MCAR (Missing Completely At Random): Missingness is unrelated to any variable. Complete case analysis is unbiased.
  • MAR (Missing At Random): Missingness depends on observed variables, not the missing value itself. Imputation or weighting is appropriate.
  • MNAR (Missing Not At Random): Missingness depends on the value that is missing (e.g., people with extreme views refusing to disclose). Most problematic; requires sensitivity analyses.

4.4 Sample Size Adequacy

While categorical descriptives can be computed for any N1N \geq 1, interpretability and precision depend on sample size:

NNGuidance
<10< 10Proportions are highly unstable; report counts only
103010 - 30Proportions are reported with wide CIs; interpret cautiously
3010030 - 100Proportions reasonably stable; report CIs using Wilson's method
>100> 100Proportions stable; standard CIs appropriate; diversity measures reliable
>1000> 1000Fine-grained proportions meaningful; subgroup breakdowns feasible

4.5 Rare Categories

Categories with very few observations (e.g., fk<5f_k < 5) pose challenges:

  • Instability: Proportions based on tiny counts fluctuate widely across samples.
  • Privacy: Small cell counts in sensitive data may enable re-identification.
  • Misleading visuals: Tiny slices in pie charts or bars are hard to read.

Options for handling rare categories:

  1. Retain and flag: Report as-is with a note on small nn.
  2. Collapse: Merge rare categories with theoretically similar ones.
  3. "Other" grouping: Create a residual "Other" category for all categories below a frequency threshold.
  4. Suppress: Omit categories below a frequency threshold from public reports.

4.6 Ordered vs. Unordered Presentation

For nominal variables, the order in which categories are displayed is arbitrary. Common orderings include:

  • Alphabetical (neutral, reproducible).
  • By descending frequency (highlights dominant categories).
  • By theoretical grouping (e.g., clinical severity).

For ordinal variables, categories must always be presented in their natural rank order (ascending or descending) to preserve the meaning of cumulative frequencies and the median.

4.7 Weighted Data

In survey research, observations are frequently assigned weights to correct for unequal selection probabilities or to make the sample representative of a target population. When weights wiw_i are present:

f^k(w)=i:Xi=xkwip^k(w)=f^k(w)i=1Nwi\hat{f}_k^{(w)} = \sum_{i: X_i = x_k} w_i \qquad \hat{p}_k^{(w)} = \frac{\hat{f}_k^{(w)}}{\sum_{i=1}^N w_i}

DataStatPro supports weighted frequency tables when a weight variable is specified. Both unweighted and weighted results are reported side by side.


5. Types of Categorical Descriptive Measures

5.1 Measures of Frequency

The most direct summaries — counts and proportions — form the foundation of all categorical description.

MeasureSymbolFormulaScale
Absolute frequencyfkf_kCount in category kkNominal, Ordinal
Relative frequencyp^k\hat{p}_kfk/Nf_k / NNominal, Ordinal
Percentage%k\%_k100×fk/N100 \times f_k / NNominal, Ordinal
Cumulative frequencyFkF_kjkfj\sum_{j \leq k} f_jOrdinal only
Cumulative %PkP_k100×Fk/N100 \times F_k / NOrdinal only

5.2 Measures of Central Tendency

MeasureFormula / DefinitionApplicable Scale
ModeCategory with maximum fkf_kNominal, Ordinal
MedianCategory where Pk50%P_k \geq 50\% firstOrdinal only
Percentiles (Q1Q_1, Q3Q_3)Category where PkP_k \geq target firstOrdinal only

5.3 Measures of Dispersion / Heterogeneity

MeasureFormulaRangeScale
Variation ratioVR=1p^modeVR = 1 - \hat{p}_{mode}[0,(K1)/K][0,\, (K-1)/K]Nominal, Ordinal
Shannon entropyH=p^klog2p^kH = -\sum \hat{p}_k \log_2 \hat{p}_k[0,log2K][0,\, \log_2 K]Nominal, Ordinal
Normalised entropyHnorm=H/log2KH_{norm} = H / \log_2 K[0,1][0, 1]Nominal, Ordinal
Simpson's diversityD=1p^k2D = 1 - \sum \hat{p}_k^2[0,(K1)/K][0,\, (K-1)/K]Nominal, Ordinal
HHI (concentration)HHI=p^k2HHI = \sum \hat{p}_k^2[1/K,1][1/K,\, 1]Nominal, Ordinal
IQVKK1(1p^k2)\frac{K}{K-1}(1 - \sum \hat{p}_k^2)[0,1][0, 1]Nominal, Ordinal
IQR (category range)Q3Q1Q_3 - Q_1Category unitsOrdinal only

5.4 Comparative Descriptives: Subgroup Breakdowns

When a grouping variable GG partitions observations into subgroups, categorical descriptives can be computed separately within each group, enabling comparison:

p^kg=fkgNg\hat{p}_{k|g} = \frac{f_{kg}}{N_g}

Where fkgf_{kg} is the count in category kk within group gg and NgN_g is the total in group gg. This is the foundation of cross-tabulation and is reported as a conditional frequency table (see Section 11.1).

5.5 Descriptives for Binary (Dichotomous) Variables

A binary variable is a special case of a categorical variable with K=2K = 2 categories (typically coded 0/1 or "No"/"Yes"). All standard categorical descriptives apply, but additional simplifications hold:

  • The distribution is fully described by a single proportion p^=f1/N\hat{p} = f_1 / N (the proportion in the positive/event category); the complementary proportion is 1p^1 - \hat{p}.
  • VR=1max(p^,  1p^)VR = 1 - \max(\hat{p},\; 1 - \hat{p}) — maximum at p^=0.5\hat{p} = 0.5.
  • HHI=p^2+(1p^)2HHI = \hat{p}^2 + (1-\hat{p})^2; D=2p^(1p^)D = 2\hat{p}(1-\hat{p}) — maximum at p^=0.5\hat{p} = 0.5.
  • H=p^log2p^(1p^)log2(1p^)H = -\hat{p}\log_2\hat{p} - (1-\hat{p})\log_2(1-\hat{p}) — the binary entropy function.

6. Using the Categorical Descriptives Calculator Component

The Categorical Descriptives Calculator in DataStatPro provides a fully featured tool for computing, diagnosing, visualising, and reporting descriptive statistics for categorical variables.

Step-by-Step Guide

Step 1 — Navigate to the Component

Go to Descriptive Statistics → Categorical Descriptives.

Step 2 — Input Method

Choose how to provide your data:

  • Raw data: Upload or paste a column of categorical observations. DataStatPro automatically detects the variable type, counts unique categories, and handles missing values.
  • Pre-aggregated frequency table: Enter category labels and their counts directly into the table grid. Useful when you already have a summary table and wish to compute additional descriptive measures from it.
  • Multiple variables: Select two or more categorical columns simultaneously to run batch descriptives across all selected variables in one pass.

Step 3 — Variable Configuration

  • Assign a meaningful variable name and category labels for display.
  • Specify the measurement scale (nominal or ordinal). If ordinal, define the correct rank ordering of categories using the drag-and-drop interface.
  • Designate whether the variable is binary to unlock specialised binary proportion summaries and exact confidence intervals.
  • Specify a grouping variable (optional) to produce stratified breakdowns and conditional frequency tables.
  • Specify a weight variable (optional) to produce weighted frequency estimates.

Step 4 — Missing Data Handling

Select one of the following:

  • Exclude missing (valid NN only): All summaries based on NvalidN_{valid}.
  • Include missing as category: Missing values form an explicit "Missing" category.
  • Report missing separately: Missing counts reported in a separate table; all summaries exclude missing.

Step 5 — Set Display Options

  • ✅ Absolute frequencies (fkf_k).
  • ✅ Relative frequencies / proportions (p^k\hat{p}_k).
  • ✅ Percentages (%k\%_k) with optional decimal places.
  • ✅ Cumulative frequencies and cumulative percentages (ordinal variables).
  • ✅ Valid NN, missing NN, and total NN.
  • ✅ Mode (with multi-mode detection).
  • ✅ Median and quartiles (ordinal variables).
  • ✅ Variation ratio, Shannon entropy (raw and normalised), HHI, Simpson's DD, IQV.
  • ✅ 95% confidence intervals for all proportions (Wilson, Clopper-Pearson, Agresti-Coull, or Wald — selectable in Settings).
  • ✅ Comparison to a reference distribution (goodness-of-fit chi-square).
  • ✅ Weighted estimates (when weight variable specified).
  • ✅ Bar chart (simple, stacked, or grouped).
  • ✅ Pie chart with customisable colour palette.
  • ✅ Donut chart.
  • ✅ Waffle chart (unit square representation).
  • ✅ Lollipop chart.
  • ✅ Diverging bar chart (for ordinal Likert-scale variables).
  • ✅ Cumulative frequency plot (ordinal variables).
  • ✅ APA 7th edition results paragraph (auto-generated).
  • ✅ Publication-ready frequency table (formatted for direct insertion into manuscripts).

Step 6 — Run the Analysis

Click "Compute Categorical Descriptives". DataStatPro will:

  1. Validate data: check mutual exclusivity, label consistency, and missing values.
  2. Compute the full frequency distribution (absolute, relative, cumulative).
  3. Identify the mode(s) and, for ordinal variables, the median and quartiles.
  4. Compute all selected heterogeneity measures (VRVR, HH, HnormH_{norm}, DD, HHIHHI, IQVIQV).
  5. Compute Wilson 95% CIs for all proportions.
  6. Generate all selected visualisations with customisable formatting.
  7. Produce the APA-compliant results paragraph and formatted frequency table.

7. Step-by-Step Procedure

7.1 Full Manual Procedure

Step 1 — Identify and Define the Variable

State the variable name, its measurement scale (nominal or ordinal), the population of observations, and all valid categories. Confirm mutual exclusivity and exhaustiveness.

Step 2 — Count Total and Missing Observations

Ntotal=Nvalid+NmissN_{total} = N_{valid} + N_{miss}

Report NmissN_{miss} and the missing rate Nmiss/NtotalN_{miss}/N_{total} explicitly. Decide on missing data handling before proceeding.

Step 3 — Tally Absolute Frequencies

For each category k=1,2,,Kk = 1, 2, \ldots, K:

fk=#{i:Xi=xk}f_k = \#\{i : X_i = x_k\}

Verify: k=1Kfk=Nvalid\sum_{k=1}^K f_k = N_{valid}.

Step 4 — Compute Relative Frequencies and Percentages

p^k=fkNvalid,%k=100×p^k\hat{p}_k = \frac{f_k}{N_{valid}}, \qquad \%_k = 100 \times \hat{p}_k

Step 5 — Compute Cumulative Frequencies (Ordinal Variables Only)

Fk=j=1kfj,Pk=FkNvalid×100F_k = \sum_{j=1}^{k} f_j, \qquad P_k = \frac{F_k}{N_{valid}} \times 100

Verify: FK=NvalidF_K = N_{valid} and PK=100%P_K = 100\%.

Step 6 — Identify the Mode

Mo=xk,k=argmaxk  fkM_o = x_{k^*}, \quad k^* = \underset{k}{\arg\max}\; f_k

If multiple categories share the maximum fkf_k, report all co-modes.

Step 7 — Identify the Median (Ordinal Variables Only)

Locate the first category kk^* such that Pk50%P_{k^*} \geq 50\%:

Median=xk\text{Median} = x_{k^*}

Step 8 — Compute Heterogeneity Measures

Variation ratio:

VR=1p^modeVR = 1 - \hat{p}_{mode}

Shannon entropy:

H=k=1Kp^klog2(p^k)H = -\sum_{k=1}^{K} \hat{p}_k \log_2(\hat{p}_k)

Normalised entropy:

Hnorm=Hlog2KH_{norm} = \frac{H}{\log_2 K}

HHI and Simpson's diversity:

HHI=k=1Kp^k2,D=1HHIHHI = \sum_{k=1}^{K} \hat{p}_k^2, \qquad D = 1 - HHI

IQV:

IQV=KK1×DIQV = \frac{K}{K-1} \times D

Step 9 — Compute Confidence Intervals for Proportions

For each p^k\hat{p}_k, compute a 95% Wilson CI (recommended):

Wilson CI=p^k+z22N±zp^k(1p^k)N+z24N21+z2N\text{Wilson CI} = \frac{\hat{p}_k + \frac{z^2}{2N} \pm z\sqrt{\frac{\hat{p}_k(1-\hat{p}_k)}{N} + \frac{z^2}{4N^2}}}{1 + \frac{z^2}{N}}

Where z=1.960z = 1.960 for 95% confidence.

Step 10 — Construct the Frequency Table

Assemble all computed values into a publication-ready frequency table:

Categoryff%\%Cumulative %\%95% CI
x1x_1f1f_1%1\%_1P1P_1[LB1,UB1][LB_1, UB_1]
\vdots\vdots\vdots\vdots\vdots
xKx_KfKf_K%K\%_K100.0%[LBK,UBK][LB_K, UB_K]
TotalNN100.0%
MissingNmissN_{miss}

Step 11 — Produce Visualisations

Select appropriate chart types (see Section 9) and annotate with frequencies or percentages. Ensure all axes are labelled and a title is provided.

Step 12 — Interpret and Report

Use APA reporting guidelines (Section 15). Always report NvalidN_{valid}, NmissN_{miss}, the complete frequency table, the mode, and at minimum the variation ratio or Shannon entropy. For ordinal variables, also report the median and quartiles.


8. Interpreting the Output

8.1 The Frequency Table

The frequency table is the primary output. Read it as follows:

ObservationInterpretation
One category has p^k1/K\hat{p}_k \gg 1/KDistribution is concentrated; modal category dominates
All p^k1/K\hat{p}_k \approx 1/KDistribution is approximately uniform; no dominant category
p^k=1\hat{p}_k = 1 for one categoryAll observations in one category; no variation
Small fkf_k for some categoriesRare categories; consider collapsing or flagging
Large NmissN_{miss}Potential bias; investigate mechanism of missingness

8.2 Mode Interpretation

Mode PatternInterpretation
Single clear mode with high p^mode\hat{p}_{mode}Strong consensus around one category
Single mode with p^mode\hat{p}_{mode} close to 1/K1/KWeakly dominant mode; near-uniform distribution
Two co-modesBimodal distribution; two competing dominant categories
All categories equalUniform distribution; mode is uninformative

8.3 Heterogeneity Measures Interpretation

MeasureLow Value IndicatesHigh Value Indicates
VRVRMost observations in modal categoryObservations spread across many categories
HH (Shannon)Predictable, concentrated distributionDiverse, uncertain distribution
HnormH_{norm} near 0Near-complete concentrationNear-perfect diversity
HHIHHI near 1Near-monopoly in one categorySpread across categories
DD (Simpson) near 0Low diversity; one category dominatesHigh diversity; categories well-represented
IQVIQV near 0Homogeneous distributionHeterogeneous distribution

8.4 Cumulative Frequency Interpretation (Ordinal Variables)

Cumulative MetricInterpretation
Pk=50%P_k = 50\% at low-rank categoryMost responses at the lower end; negatively skewed
Pk=50%P_k = 50\% at middle categorySymmetric; median in the middle
Pk=50%P_k = 50\% at high-rank categoryMost responses at the upper end; positively skewed
Wide IQR (many categories span Q1 to Q3)High ordinal variability
Narrow IQRTight concentration around the median category

8.5 Confidence Interval Interpretation

CI PatternInterpretation
Narrow CI around p^k\hat{p}_kPrecise estimate; large NN or extreme proportion
Wide CI around p^k\hat{p}_kImprecise estimate; small NN or p^k\hat{p}_k near 0.50
CI excludes a reference value π0\pi_0Statistically significant difference from π0\pi_0
CIs for two categories overlapNo statistically significant difference between their proportions

8.6 Contextualising Heterogeneity: Reference Benchmarks

HnormH_{norm}Verbal LabelDescription
0.000.200.00 - 0.20Very low diversityOne category overwhelmingly dominant
0.210.400.21 - 0.40Low diversityA few categories contain most observations
0.410.600.41 - 0.60Moderate diversitySeveral categories reasonably represented
0.610.800.61 - 0.80High diversityNo single category clearly dominant
0.811.000.81 - 1.00Very high diversityObservations distributed nearly uniformly

⚠️ These benchmarks are heuristic guides, not universal standards. Domain context is essential — a Hnorm=0.30H_{norm} = 0.30 may indicate healthy diversity in clinical adverse event categories but near-monopoly in a competitive market context. Always interpret heterogeneity measures relative to the theoretical range for the specific KK in your variable.


9. Visualising Categorical Data

9.1 Bar Chart

The bar chart (also called a bar graph) is the most widely recommended visualisation for categorical data. Each category is represented by a rectangular bar whose height (or length, for horizontal bars) is proportional to its frequency or proportion.

Best practices:

  • Use vertical bars for a small number of categories (K6K \leq 6) with short labels.
  • Use horizontal bars when labels are long or when K>6K > 6.
  • Start the frequency axis at zero — truncating the axis distorts relative comparisons.
  • Sort bars by descending frequency for nominal variables (unless there is a theoretically meaningful order).
  • For ordinal variables, preserve the natural category order.
  • Label each bar with its count, percentage, or both for clarity.
  • Use a single, consistent colour for one-variable displays; reserve colour variation for grouped or stacked charts.

Appropriate for: Nominal and ordinal variables; any KK; frequency and percentage comparisons.

9.2 Grouped Bar Chart

The grouped bar chart (clustered bar chart) displays the frequency distributions of a categorical variable separately for each level of a grouping variable, with groups of bars placed side by side.

Best practices:

  • Limit to K5K \leq 5 categories and G4G \leq 4 groups to avoid clutter.
  • Use a distinct colour for each group; provide a clear legend.
  • Report percentages within groups (row percentages) when comparing group profiles.

Appropriate for: Comparing the distribution of one categorical variable across multiple groups.

9.3 Stacked Bar Chart

The stacked bar chart represents the proportion of each category stacked within a single bar (or within each group bar). The 100% stacked bar chart is particularly useful for comparing proportional breakdowns across groups.

Best practices:

  • Use 100% stacked bars when comparing proportional composition across groups.
  • Place the most important or interpretively central category consistently (either first or last in the stack).
  • Avoid too many categories in a stack (K>5K > 5 makes stacks hard to read).

Appropriate for: Visualising proportional composition; comparing distributions across groups.

9.4 Pie Chart

The pie chart encodes frequency as the angle (and area) of each slice. It is appropriate only when the number of categories is small (K5K \leq 5) and the primary goal is showing part-to-whole relationships.

Limitations:

  • Human perception of angular differences is less accurate than of bar lengths.
  • Very small slices are illegible.
  • Comparison across multiple pie charts is difficult.

When to avoid: When K>5K > 5, when categories are similar in size, or when precise comparisons between categories are required. Prefer a bar chart in most cases.

9.5 Donut Chart

The donut chart is a variant of the pie chart with a hollow centre. The centre space can be used to display the total NN or a key summary statistic. It shares the limitations of pie charts and should be used with equal care.

9.6 Waffle Chart

The waffle chart (or unit chart) represents proportions as filled cells in a 10×1010 \times 10 (or similar) grid, where each cell represents 1% (or 1/N1/N) of the total. Waffle charts are highly accessible and intuitive for general audiences.

Appropriate for: Communicating proportions to non-technical audiences; displaying one or two categories in a clear, visual format.

9.7 Lollipop Chart

The lollipop chart is a space-efficient alternative to the bar chart. Each category is represented by a thin line ("stick") topped with a dot ("lollipop"), whose position encodes frequency or proportion.

Best practices:

  • Sort by descending frequency for nominal variables.
  • Particularly effective for K>10K > 10 categories where bars become visually dense.

9.8 Diverging Bar Chart (Likert Scales)

For Likert-scale ordinal variables (e.g., 5-point agree–disagree scales), the diverging bar chart (also called a diverging stacked bar chart) centres the neutral category at zero and extends positive-direction categories to the right and negative-direction categories to the left.

Construction:

  1. Define a neutral midpoint (e.g., "Neither agree nor disagree").
  2. Positive categories extend rightward from the midpoint.
  3. Negative categories extend leftward from the midpoint.
  4. Each half-bar's length is proportional to the percentage in that response category.

Why it is effective: Enables simultaneous visual assessment of the overall agreement/disagreement balance and the distribution across all response options.

9.9 Cumulative Frequency Plot (Ordinal Variables)

The cumulative frequency (ogive) plot graphs cumulative percentage on the yy-axis against ordered category levels on the xx-axis. It is the primary tool for visually identifying the median (where the curve crosses 50%), quartiles, and the shape of the ordinal distribution.

Appropriate for: Ordinal variables; assessing cumulative burden or threshold effects.

9.10 Visualisation Selection Guide

Variable TypeKKPrimary AudienceRecommended Chart
Nominal2–5TechnicalBar chart
Nominal2–4GeneralPie chart or waffle chart
Nominal6+AnyHorizontal bar or lollipop
Ordinal (non-Likert)AnyAnyBar chart (ordered) or cumulative plot
Ordinal (Likert)4–7AnyDiverging bar chart
Binary2AnySingle bar, donut, or waffle
Grouped (nominal × group)3–6 × 2–4TechnicalGrouped or stacked bar chart

10. Confidence Intervals for Proportions

10.1 Why Confidence Intervals Are Essential

Sample proportions are estimates of population proportions. A 95% confidence interval (CI) provides a range of plausible values for the true population proportion πk\pi_k, given the observed p^k\hat{p}_k and sample size NN. CIs are not optional — they are integral to responsible reporting of proportions.

The Wilson score interval is the recommended method for most applications, performing well across all sample sizes and values of p^k\hat{p}_k:

CIWilson=p^k+z22N±zp^k(1p^k)N+z24N21+z2NCI_{Wilson} = \frac{\hat{p}_k + \frac{z^2}{2N} \pm z\sqrt{\frac{\hat{p}_k(1-\hat{p}_k)}{N} + \frac{z^2}{4N^2}}}{1 + \frac{z^2}{N}}

Where z=1.960z = 1.960 for 95% CI. The Wilson interval maintains coverage probability close to the nominal 95% level even for small NN or extreme p^k\hat{p}_k (near 0 or 1).

10.3 Clopper-Pearson Exact Interval

The Clopper-Pearson interval is an exact (conservative) method based on the binomial distribution:

CICP=[B ⁣(α2;  fk,  Nfk+1),  B ⁣(1α2;  fk+1,  Nfk)]CI_{CP} = \left[B\!\left(\frac{\alpha}{2};\; f_k,\; N - f_k + 1\right),\; B\!\left(1 - \frac{\alpha}{2};\; f_k + 1,\; N - f_k\right)\right]

Where B(q;  a,  b)B(q;\; a,\; b) is the qq-th quantile of the Beta(a,b)(a,b) distribution. The Clopper-Pearson interval guarantees that the true coverage is at least 1α1 - \alpha, but is typically wider (more conservative) than necessary. Recommended when a conservative guarantee is required (e.g., regulatory contexts).

10.4 Agresti-Coull Interval

The Agresti-Coull interval is a simple approximation that adjusts the observed proportion by adding z2/2z^2/2 pseudo-successes and z2/2z^2/2 pseudo-failures:

p~k=fk+z2/2N+z2,N~=N+z2\tilde{p}_k = \frac{f_k + z^2/2}{N + z^2}, \qquad \tilde{N} = N + z^2

CIAC=p~k±zp~k(1p~k)N~CI_{AC} = \tilde{p}_k \pm z\sqrt{\frac{\tilde{p}_k(1-\tilde{p}_k)}{\tilde{N}}}

For z=1.96z = 1.96, this adds approximately 2 pseudo-successes and 2 pseudo-failures. The Agresti-Coull interval is computationally simple, nearly as accurate as Wilson's method, and performs well for N10N \geq 10.

The Wald interval is the classic textbook method:

CIWald=p^k±zp^k(1p^k)NCI_{Wald} = \hat{p}_k \pm z\sqrt{\frac{\hat{p}_k(1-\hat{p}_k)}{N}}

Limitations: The Wald interval can produce lower bounds below 0 or upper bounds above 1 for extreme proportions. It has poor coverage properties for small NN or when p^k\hat{p}_k is near 0 or 1. Use Wilson or Agresti-Coull instead.

10.6 CI Method Comparison

MethodRecommended ForCoverageNotes
Wilson ScoreGeneral use; any NNExcellentBest default choice
Clopper-PearsonSmall NN; conservative guarantee requiredConservativeWider than necessary for large NN
Agresti-CoullSimplicity; N10N \geq 10Very goodSlightly wider than Wilson
WaldLarge NN (>100> 100); p^k\hat{p}_k not extremeGood only for large NNFails for small NN or extreme p^k\hat{p}_k

10.7 Simultaneous CIs for Multiple Proportions

When reporting CIs for all KK proportions simultaneously, the familywise confidence level is not maintained at 95% — each individual CI has 95% coverage but the joint coverage is lower. To maintain joint 95% coverage:

Bonferroni-adjusted CI: Use z=zα/(2K)z = z_{\alpha/(2K)} instead of z0.025z_{0.025}.

For K=4K = 4 categories at 95% joint confidence: z=z0.006252.50z = z_{0.00625} \approx 2.50.

10.8 CI Width as a Function of NN and p^k\hat{p}_k

Wilson 95% CI width is approximately 2×1.96×p^(1p^)/N2 \times 1.96 \times \sqrt{\hat{p}(1-\hat{p})/N}, which is maximised at p^=0.5\hat{p} = 0.5.

Approximate CI width for p^=0.50\hat{p} = 0.50:

NNApproximate 95% CI Width
20±0.219
50±0.138
100±0.098
200±0.069
500±0.044
1000±0.031
5000±0.014

11. Advanced Topics

11.1 Conditional Frequency Tables and Subgroup Comparisons

When a categorical outcome variable is examined across levels of a grouping variable, the result is a conditional frequency table (cross-tabulation). Each row or column shows the distribution of the outcome variable within a subgroup:

p^kg=fkgNg\hat{p}_{k|g} = \frac{f_{kg}}{N_g}

Comparing p^kg\hat{p}_{k|g} across groups reveals whether the distribution of categories differs between subgroups. Formal inferential testing of such differences is the domain of the chi-square test of association.

⚠️ When comparing conditional distributions across groups, report within-group percentages (row percentages when groups define rows), not overall percentages. Reporting overall percentages when groups differ in size produces misleading comparisons.

11.2 Standardisation and Reweighting

When comparing categorical distributions across samples with different population structures (e.g., different age compositions), direct standardisation weights each group's proportions to a common reference population:

p^kstd=g=1Gwgref×p^kg\hat{p}_k^{std} = \sum_{g=1}^G w_g^{ref} \times \hat{p}_{k|g}

Where wgrefw_g^{ref} is the proportion of the reference population in group gg. This removes the confounding effect of group composition and enables fair comparisons across samples.

11.3 Goodness-of-Fit: Comparing to a Theoretical Distribution

When a theoretical or historical distribution {π0,1,π0,2,,π0,K}\{\pi_{0,1}, \pi_{0,2}, \ldots, \pi_{0,K}\} exists for a variable, the observed proportions can be tested against it using the chi-square goodness-of-fit test:

χ2=k=1K(fkEk)2Ek,Ek=N×π0,k\chi^2 = \sum_{k=1}^{K} \frac{(f_k - E_k)^2}{E_k}, \quad E_k = N \times \pi_{0,k}

With ν=K1\nu = K - 1 degrees of freedom. DataStatPro integrates this test directly into the Categorical Descriptives output when a reference distribution is supplied.

11.4 Detecting Digit Preference and Response Bias

In survey data, systematic response biases cause disproportionate selection of certain categories:

  • Acquiescence bias: Tendency to agree regardless of question content, inflating higher Likert categories.
  • Centrality bias: Over-selection of the neutral/middle category.
  • Extremity bias: Over-selection of the highest and lowest categories.
  • Social desirability bias: Over-reporting of socially preferred responses.

Detection methods include comparing the observed distribution to an expected uniform distribution and inspecting standardised residuals from a goodness-of-fit test.

11.5 Benford's Law for Categorical First Digits

In datasets with naturally occurring numeric counts (e.g., city populations, financial transaction amounts), Benford's Law predicts that the first significant digit dd follows the distribution:

P(D=d)=log10 ⁣(1+1d),d=1,2,,9P(D = d) = \log_{10}\!\left(1 + \frac{1}{d}\right), \quad d = 1, 2, \ldots, 9

Significant departures from Benford's Law — assessed via a chi-square goodness-of-fit test on the first-digit frequency distribution — can flag data fabrication or anomalies in certain contexts.

11.6 Entropy-Based Feature Selection

In machine learning and data science, information gain and related entropy-based metrics use Shannon entropy to assess the predictive value of a categorical feature XX for an outcome variable YY:

IG(Y;X)=H(Y)H(YX)IG(Y; X) = H(Y) - H(Y \mid X)

Where H(YX)=kp^(X=k)×H(YX=k)H(Y \mid X) = \sum_{k} \hat{p}(X=k) \times H(Y \mid X=k) is the conditional entropy of YY given XX. Features with high information gain are more useful predictors. DataStatPro reports H(Y)H(Y), H(YX)H(Y|X), and IG(Y;X)IG(Y;X) in the advanced output panel when an outcome variable is specified.

11.7 Sampling Weights and Complex Survey Design

Nationally representative surveys typically use complex sampling designs (stratification, clustering, unequal selection probabilities). In such cases:

  • Design-weighted proportions p^k(w)\hat{p}_k^{(w)} correctly estimate population proportions.
  • Unweighted proportions estimate the sample distribution only.
  • Variance estimation must account for the sampling design (Taylor linearisation or bootstrap replication), not just the binomial formula.

DataStatPro supports Taylor-linearised variance estimation for weighted proportions when design variables (stratum, cluster, weight) are specified.

When the same categorical variable is measured at multiple time points, tracking the change in proportions over time reveals trends. Visualisation options include:

  • Line chart of proportions over time (one line per category).
  • Stacked area chart (visualises changing composition).
  • Small multiples (one bar chart per time point).

Formal testing of temporal trends in proportions can be done using the Cochran-Armitage trend test (for binary outcomes) or regression models for categorical outcomes.

11.9 Inter-Rater Reliability for Categorical Classifications

When the same observations are classified independently by two or more raters, the agreement between raters is quantified by Cohen's kappa (κ\kappa) or Fleiss' kappa (for three or more raters):

κ=PoPe1Pe\kappa = \frac{P_o - P_e}{1 - P_e}

Where Po=kp^kkP_o = \sum_k \hat{p}_{kk} is the observed agreement proportion (sum of diagonal proportions in the K×KK \times K agreement table) and Pe=kp^kp^kP_e = \sum_k \hat{p}_{k\cdot}\hat{p}_{\cdot k} is the expected agreement by chance.

κ\kappaVerbal Label
<0.00< 0.00Less than chance
0.000.200.00 - 0.20Slight
0.210.400.21 - 0.40Fair
0.410.600.41 - 0.60Moderate
0.610.800.61 - 0.80Substantial
0.811.000.81 - 1.00Almost perfect

12. Worked Examples

Example 1: Nominal Variable — Preferred Mode of Transport (Binary and K=4K = 4)

A transport planning survey collects preferred commuting mode from N=250N = 250 adults. There are 3 missing responses. The valid responses (Nvalid=247N_{valid} = 247) are:

ModeCount
Car102
Public Transport78
Cycling41
Walking26

Step 1 — Frequencies and Proportions:

Modeffp^\hat{p}%\%95% Wilson CI
Car102.41341.3%[35.5%, 47.4%]
Public Transport78.31631.6%[26.1%, 37.6%]
Cycling41.16616.6%[12.4%, 22.0%]
Walking26.10510.5%[7.2%, 15.1%]
Total (valid)2471.000100%
Missing3

Step 2 — Mode:

Mo=CarM_o = \text{Car} (f=102f = 102, p^=.413\hat{p} = .413). Unimodal.

Step 3 — Heterogeneity Measures:

VR=10.413=0.587VR = 1 - 0.413 = 0.587

H=(0.413log20.413+0.316log20.316+0.166log20.166+0.105log20.105)H = -(0.413\log_2 0.413 + 0.316\log_2 0.316 + 0.166\log_2 0.166 + 0.105\log_2 0.105)

=(0.413×(1.275)+0.316×(1.662)+0.166×(2.590)+0.105×(3.252))= -(0.413 \times (-1.275) + 0.316 \times (-1.662) + 0.166 \times (-2.590) + 0.105 \times (-3.252))

=(0.5270.5250.4300.341)=1.823 bits= -(−0.527 − 0.525 − 0.430 − 0.341) = 1.823 \text{ bits}

Hnorm=1.823log24=1.8232.000=0.912H_{norm} = \frac{1.823}{\log_2 4} = \frac{1.823}{2.000} = 0.912

HHI=0.4132+0.3162+0.1662+0.1052=0.171+0.100+0.028+0.011=0.310HHI = 0.413^2 + 0.316^2 + 0.166^2 + 0.105^2 = 0.171 + 0.100 + 0.028 + 0.011 = 0.310

D=10.310=0.690D = 1 - 0.310 = 0.690

IQV=43×0.690=0.920IQV = \frac{4}{3} \times 0.690 = 0.920

Interpretation: The distribution shows moderate concentration — car is the dominant mode (41.3%), but a substantial minority use public transport (31.6%). The very high Hnorm=0.912H_{norm} = 0.912 and IQV=0.920IQV = 0.920 indicate a highly diverse distribution, with responses spread across all four categories.

APA write-up: "Among 247 valid respondents (3 missing), the most frequently preferred commuting mode was car (n=102n = 102, 41.3%, 95% CI [35.5%, 47.4%]), followed by public transport (n=78n = 78, 31.6%, 95% CI [26.1%, 37.6%]), cycling (n=41n = 41, 16.6%, 95% CI [12.4%, 22.0%]), and walking (n=26n = 26, 10.5%, 95% CI [7.2%, 15.1%]). The distribution showed high heterogeneity (Shannon entropy = 1.82 bits, HnormH_{norm} = 0.91, IQV = 0.92)."


Example 2: Ordinal Variable — Patient Satisfaction (5-Point Scale)

A hospital surveys N=180N = 180 patients on their satisfaction with care (5-point ordinal scale: Very Dissatisfied, Dissatisfied, Neutral, Satisfied, Very Satisfied). There are no missing values.

Satisfactionff%\%Cumulative ffCumulative %\%95% Wilson CI
Very Dissatisfied84.4%84.4%[2.2%, 8.6%]
Dissatisfied1910.6%2715.0%[6.9%, 15.9%]
Neutral2815.6%5530.6%[10.9%, 21.5%]
Satisfied7240.0%12770.6%[33.3%, 47.1%]
Very Satisfied5329.4%180100.0%[23.2%, 36.5%]
Total180100%

Mode: Satisfied (f=72f = 72, p^=.400\hat{p} = .400).

Median: The cumulative percentage first reaches 50% at "Satisfied" (P4=70.6%P_4 = 70.6\%). Median = Satisfied.

Quartiles:

  • Q1Q_1 (25th percentile): First category where Pk25%P_k \geq 25\%P3=30.6%P_3 = 30.6\%Q1=Q_1 = Neutral
  • Q3Q_3 (75th percentile): First category where Pk75%P_k \geq 75\%P5=100%P_5 = 100\%Q3=Q_3 = Very Satisfied
  • IQR=IQR = Neutral to Very Satisfied (spans 3 category levels)

Heterogeneity:

VR=10.400=0.600VR = 1 - 0.400 = 0.600

H=(0.044log20.044+0.106log20.106+0.156log20.156+0.400log20.400+0.294log20.294)H = -(0.044\log_2 0.044 + 0.106\log_2 0.106 + 0.156\log_2 0.156 + 0.400\log_2 0.400 + 0.294\log_2 0.294)

(0.2030.3530.4280.5290.519)=2.032 bits\approx -(−0.203 − 0.353 − 0.428 − 0.529 − 0.519) = 2.032 \text{ bits}

Hnorm=2.032log25=2.0322.322=0.875H_{norm} = \frac{2.032}{\log_2 5} = \frac{2.032}{2.322} = 0.875

Interpretation: The distribution is positively skewed toward higher satisfaction. The median and mode both fall at "Satisfied", and over 69% of patients rated their care as Satisfied or Very Satisfied. The Hnorm=0.875H_{norm} = 0.875 indicates moderately high variability across response options.

APA write-up: "Patient satisfaction ratings (N=180N = 180) showed a positively skewed distribution. The modal response was Satisfied (n=72n = 72, 40.0%, 95% CI [33.3%, 47.1%]) and the median was Satisfied (Q1=Q_1 = Neutral; Q3=Q_3 = Very Satisfied). A combined 69.4% of patients rated their care as Satisfied or Very Satisfied. The distribution showed moderately high heterogeneity (Hnorm=0.88H_{norm} = 0.88, VR=0.60VR = 0.60)."


Example 3: Binary Variable — Vaccination Status

A public health register records vaccination status (vaccinated/not vaccinated) for N=1200N = 1\,200 individuals. Nmiss=14N_{miss} = 14.

Statusff%\%95% Wilson CI
Vaccinated93479.3%[76.8%, 81.6%]
Not Vaccinated24420.7%[18.4%, 23.2%]
Total (valid)1,178100%
Missing141.2%

Mode: Vaccinated (p^=.793\hat{p} = .793).

Binary entropy:

H=(0.793log20.793+0.207log20.207)=(0.3050.531)=0.836 bitsH = -(0.793\log_2 0.793 + 0.207\log_2 0.207) = -(−0.305 − 0.531) = 0.836 \text{ bits}

Hnorm=0.8361.000=0.836H_{norm} = \frac{0.836}{1.000} = 0.836

VR=10.793=0.207VR = 1 - 0.793 = 0.207

Interpretation: Approximately 79.3% of the population is vaccinated, with the 95% CI indicating the true coverage is between 76.8% and 81.6%. This falls below the commonly cited 95% herd immunity threshold. The low VR=0.207VR = 0.207 and moderate Hnorm=0.836H_{norm} = 0.836 confirm that one category (vaccinated) dominates, but a non-negligible 20.7% remain unvaccinated.

APA write-up: "Of 1,178 individuals with valid vaccination records (14 missing, 1.2%), 934 (79.3%, 95% CI [76.8%, 81.6%]) were vaccinated and 244 (20.7%, 95% CI [18.4%, 23.2%]) were not vaccinated. Coverage fell below the 95% target threshold."


Example 4: Subgroup Breakdown — Grade Distribution by Teaching Method

Building on the teaching method data from the chi-square tutorial, a researcher reports the grade distribution for each teaching method (N=210N = 210, 70 per method).

Conditional Frequency Table (Row Percentages):

MethodABCD/FTotal
Lecture12 (17.1%)23 (32.9%)22 (31.4%)13 (18.6%)70
Flipped24 (34.3%)27 (38.6%)16 (22.9%)3 (4.3%)70
Online9 (12.9%)17 (24.3%)28 (40.0%)16 (22.9%)70

Modes: Lecture = B; Flipped = B; Online = C.

Medians: Lecture = B; Flipped = B; Online = C.

Shannon Entropy by Group:

MethodHH (bits)HnormH_{norm}VRVRInterpretation
Lecture1.9690.9840.671High variability; grades spread across all levels
Flipped1.7770.8880.657Moderate variability; concentrated at upper grades
Online1.9390.9690.600High variability; concentrated at lower grades

Interpretation: All three methods show moderate-to-high grade variability. The flipped classroom has the highest proportion of A grades (34.3%) and the lowest D/F rate (4.3%), while the online method shows the highest C and D/F rates. Formal testing of these differences is provided by the chi-square test of association.


13. Common Mistakes and How to Avoid Them

Mistake 1: Computing a Mean or Standard Deviation for a Nominal Variable

Problem: Assigning arbitrary numeric codes to categories (e.g., 1 = Male, 2 = Female, 3 = Non-binary) and computing their mean or standard deviation. The resulting number is arithmetically computable but statistically meaningless — the numeric codes carry no magnitude information.

Solution: For nominal variables, report only frequency, proportion, and mode. If a numeric summary of a categorical variable is needed for modelling, create appropriate dummy/indicator variables.


Mistake 2: Treating Ordinal Variables as Fully Continuous

Problem: Computing the arithmetic mean of ordinal scale responses (e.g., mean Likert score = 3.47) as if the intervals between categories were equal. The mean assumes equal spacing; ordinal categories have no such guarantee.

Solution: For ordinal variables, report the median and IQR as the primary central tendency and spread measures. Report the mode as a supplementary measure. Computing means of ordinal variables is acceptable as a pragmatic convention in some fields (notably Likert-scale research), but must be explicitly acknowledged and defended.


Mistake 3: Failing to Report Missing Values

Problem: Computing and reporting proportions from valid observations only, without disclosing the number of missing values. This gives readers no way to assess whether missingness is substantial enough to bias the results.

Solution: Always report both NvalidN_{valid} and NmissN_{miss} (and the missing rate). Investigate whether missingness is systematic. Report results for both complete-case and missing-included analyses when missingness is substantial (>5%> 5\%).


Mistake 4: Reporting Only Counts Without Proportions (or Vice Versa)

Problem: Reporting only absolute frequencies makes comparisons across groups of different sizes misleading. Reporting only proportions without counts obscures the precision of estimates (a proportion of 50% based on N=4N = 4 is very different from one based on N=4000N = 4000).

Solution: Always report both absolute frequency and proportion (or percentage) in frequency tables. Include NvalidN_{valid} so readers can recover the raw counts from percentages.


Mistake 5: Using a Pie Chart for More Than 5 Categories

Problem: A pie chart with 6 or more slices becomes illegible. Slices of similar size are virtually indistinguishable, and small categories vanish. Over-reliance on pie charts is one of the most widely cited visualisation errors.

Solution: For K>5K > 5, use a horizontal bar chart sorted by frequency or a lollipop chart. Reserve pie charts for K4K \leq 4 when part-to-whole relationships are the primary message.


Mistake 6: Ignoring Category Order for Ordinal Variables

Problem: Sorting ordinal categories alphabetically or by frequency rather than in their natural rank order (e.g., displaying satisfaction responses as: High, Low, Medium). This disrupts the cumulative frequency interpretation, makes cumulative plots meaningless, and confuses readers.

Solution: For ordinal variables, always display categories in their meaningful rank order. DataStatPro enforces the user-specified rank ordering in all tables and charts when the variable is designated as ordinal.


Mistake 7: Reporting the Wald Interval for Small Samples or Extreme Proportions

Problem: The Wald CI p^±1.96p^(1p^)/N\hat{p} \pm 1.96\sqrt{\hat{p}(1-\hat{p})/N} can produce intervals below 0 or above 1, and has poor coverage when N<30N < 30 or p^<0.10\hat{p} < 0.10 or p^>0.90\hat{p} > 0.90.

Solution: Use the Wilson score interval as the default for all sample sizes. Use the Clopper-Pearson exact interval when a conservative guarantee is required. DataStatPro defaults to Wilson intervals.


Mistake 8: Comparing Distributions Using Overall Proportions Instead of Conditional Proportions

Problem: When comparing the distribution of a variable across groups of unequal size, reporting overall proportions conflates the group composition with the variable distribution. For example, if 90% of respondents are female and more females prefer Brand A, overall Brand A preference will be high not because it is universally preferred, but because females are overrepresented.

Solution: Always compute and report conditional (within-group) proportions when comparing distributions across groups. Use row percentages when groups are defined by rows.


Mistake 9: Conflating Diversity Measures Across Variables with Different KK

Problem: Comparing the raw Shannon entropy HH of a 3-category variable (Hmax=log23=1.585H_{max} = \log_2 3 = 1.585 bits) with that of a 6-category variable (Hmax=log26=2.585H_{max} = \log_2 6 = 2.585 bits). A higher raw HH may simply reflect more categories, not greater relative diversity.

Solution: Compare variables using normalised entropy Hnorm=H/log2KH_{norm} = H / \log_2 K or IQV, both of which are bounded [0,1][0, 1] regardless of KK.


Mistake 10: Concluding Category Absence from a Zero Frequency

Problem: A category with fk=0f_k = 0 in the sample may exist in the population but was not observed due to small sample size or sampling variability. Declaring the category "absent" may be premature.

Solution: Report zero-frequency categories explicitly in the table. Compute the 95% CI for p^k=0\hat{p}_k = 0 using the Clopper-Pearson upper bound: CIupper=B(1α/2;  1,  N)CI_{upper} = B(1 - \alpha/2;\; 1,\; N), which gives a plausible upper bound for the true proportion.


14. Troubleshooting

ProblemLikely CauseSolution
Proportions do not sum to 1.000Rounding in display; missing values excludedReport "rounding may cause totals to differ from 100%"; verify fk=Nvalid\sum f_k = N_{valid}
More categories than expectedLabel inconsistencies (case, spaces, synonyms)Use DataStatPro's category merge/clean tool; standardise labels
Mode is reported as multiple categoriesTwo or more categories share the maximum frequencyReport all co-modes and describe as a bimodal or multimodal distribution
Cumulative percentage does not reach 100%Missing category or roundingVerify all categories are included; check fk=Nvalid\sum f_k = N_{valid}
Confidence interval lower bound is negative (Wald)Small NN or extreme p^\hat{p} with Wald methodSwitch to Wilson or Clopper-Pearson interval; both are bounded [0,1][0,1]
Hnorm=0H_{norm} = 0All observations in one categoryExpected result; no variation present; report VR=0VR = 0
Hnorm=1H_{norm} = 1Perfectly uniform distributionAll categories equally represented; may reflect a small sample or genuine uniformity
IQV>1IQV > 1Computation error or incorrect KKVerify formula; IQVIQV is bounded [0,1][0,1] by construction
Median not definedPerfectly even split at a boundary (cumulative % jumps from <50%< 50\% to >50%> 50\% without exactly hitting 50%)Report median as the lower of the two surrounding categories; use interpolation if numeric scores assigned
Weighted proportions differ greatly from unweightedSignificant over/undersampling of some groupsExpected in complex surveys; report both; weighted estimates are for population inference
All cells have very small counts (<5< 5)Very small total NNReport counts only; CIs will be very wide; caution against over-interpreting proportions
Chart shows categories in wrong orderDefault alphabetical sorting applied to ordinal variableSpecify ordinal order in DataStatPro's variable settings; reorder manually in chart editor
Missing rate is very high (>20%> 20\%)Non-response bias; data collection issueInvestigate MCAR/MAR/MNAR mechanism; perform sensitivity analysis; consider imputation
Entropy calculated as negativeUse of 0×log(0)0 \times \log(0) not set to 0Apply L'Hôpital's convention: 0×log2(0)00 \times \log_2(0) \equiv 0; check software implementation

15. Quick Reference Cheat Sheet

Core Equations

FormulaDescription
p^k=fk/Nvalid\hat{p}_k = f_k / N_{valid}Sample proportion in category kk
%k=100×fk/Nvalid\%_k = 100 \times f_k / N_{valid}Percentage in category kk
Fk=jkfjF_k = \sum_{j \leq k} f_jCumulative frequency up to category kk (ordinal)
Mo=xk,  k=argmaxfkM_o = x_{k^*},\; k^* = \arg\max f_kMode: most frequent category
Median=xk,  k=min{k:Pk0.50}\text{Median} = x_{k^*},\; k^* = \min\{k : P_k \geq 0.50\}Median category (ordinal only)
VR=1p^modeVR = 1 - \hat{p}_{mode}Variation ratio
H=kp^klog2p^kH = -\sum_{k} \hat{p}_k \log_2 \hat{p}_kShannon entropy (bits)
Hnorm=H/log2KH_{norm} = H / \log_2 KNormalised entropy [0,1]\in [0,1]
HHI=kp^k2HHI = \sum_{k} \hat{p}_k^2Herfindahl–Hirschman Index
D=1kp^k2D = 1 - \sum_{k} \hat{p}_k^2Simpson's diversity index
IQV=KK1(1kp^k2)IQV = \frac{K}{K-1}(1 - \sum_{k} \hat{p}_k^2)Index of Qualitative Variation [0,1]\in [0,1]
Wilson CI: see Section 10.2Recommended 95% CI for proportions

Measure Applicability by Scale

MeasureNominalOrdinal
Frequency, proportion, %
Cumulative frequency / %
Mode
Median
Quartiles / IQR
VRVR, HH, HnormH_{norm}, DD, HHIHHI, IQVIQV
Mean, SD❌ (unless numeric scores assigned)

Chart Selection Guide

Variable TypeKKAudienceRecommended Chart
Nominal2–5TechnicalVertical bar chart
Nominal2–4GeneralPie chart / waffle chart
Nominal6+AnyHorizontal bar / lollipop
Ordinal (non-Likert)AnyAnyBar chart (ordered)
Ordinal (Likert)4–7AnyDiverging bar chart
Binary2GeneralSingle bar / donut / waffle
Grouped comparisons3–6 × 2–4TechnicalGrouped / stacked bar
Trend over timeAnyAnyLine chart of proportions

Heterogeneity Benchmarks (HnormH_{norm})

HnormH_{norm}Diversity Level
0.000.200.00 - 0.20Very low (near-complete concentration)
0.210.400.21 - 0.40Low
0.410.600.41 - 0.60Moderate
0.610.800.61 - 0.80High
0.811.000.81 - 1.00Very high (near-uniform)

Confidence Interval Method Selection

SituationRecommended Method
General use; any NNWilson score interval
Conservative guarantee; small NNClopper-Pearson exact
Simplicity; N10N \geq 10Agresti-Coull
Large N>100N > 100; p^\hat{p} not extremeWald (acceptable but not preferred)
Joint CIs for all KK proportionsBonferroni-adjusted Wilson

Approximate Wilson 95% CI Width (p^=0.50\hat{p} = 0.50)

NN± Width
20±0.219
50±0.138
100±0.098
200±0.069
500±0.044
1000±0.031

APA 7th Edition Reporting Templates

Single nominal variable: "The most frequently reported [variable name] was [modal category] (nn = [value], [%]%, 95% CI [[LB]%, [UB]%]), followed by [second category] (nn = [value], [%]%, 95% CI [[LB]%, [UB]%]). The full distribution is presented in Table X."

Single ordinal variable: "Responses to [variable name] were distributed as follows: [lowest category] (nn = [value], [%]%), …, [highest category] (nn = [value], [%]%). The median response was [median category] (Q1Q_1 = [Q1 category]; Q3Q_3 = [Q3 category])."

Binary variable: "A total of [f] participants ([%]%, 95% CI [[LB]%, [UB]%]) reported [positive category]; the remaining [f] ([%]%, 95% CI [[LB]%, [UB]%]) reported [negative category]."

With missing data: "Of [N total] participants, [N valid] provided valid responses ([N miss] missing, [miss %]%). Among valid responses, …"

With heterogeneity measures: "The distribution showed [low / moderate / high] heterogeneity (Shannon entropy = [value] bits, HnormH_{norm} = [value], IQV = [value])."

With subgroup comparison: "Conditional frequency distributions by [group variable] are presented in Table X. [Category] was most prevalent in [subgroup] ([%]%, 95% CI [[LB]%, [UB]%]) compared to [other subgroup] ([%]%, 95% CI [[LB]%, [UB]%])."

Reporting Checklist

ItemRequired
Valid NN and missing NN (with missing rate)✅ Always
All category labels clearly defined✅ Always
Absolute frequencies for all categories✅ Always
Percentages for all categories✅ Always
Proportions (or % totalling 100%)✅ Always
Mode (with indication of multimodality if present)✅ Always
Median and quartiles✅ For ordinal variables
Cumulative frequencies / percentages✅ For ordinal variables
95% CI for all proportions (Wilson recommended)✅ Always
Shannon entropy (HH and HnormH_{norm})✅ When heterogeneity is of substantive interest
VRVR or IQVIQV✅ When heterogeneity is of substantive interest
HHI / Simpson's DD✅ For concentration / diversity analyses
Appropriate chart with axis labels and title✅ Always
Reference to published frequency table in text✅ Always
Measurement scale stated (nominal / ordinal)✅ Always
Missing data mechanism discussed✅ When Nmiss>5%N_{miss} > 5\%
Weighted estimates (if survey data)✅ When design weights provided
CI method stated✅ When N<100N < 100 or any p^k<0.10\hat{p}_k < 0.10
Goodness-of-fit against reference distribution✅ When comparing to theoretical or historical baseline
Subgroup conditional distributions✅ When a grouping variable is present

This tutorial provides a comprehensive foundation for understanding, computing, interpreting, visualising, and reporting categorical descriptive statistics within the DataStatPro application. For further reading, consult Agresti's "An Introduction to Categorical Data Analysis" (3rd ed., 2018), Tukey's "Exploratory Data Analysis" (1977), Wickham's "ggplot2: Elegant Graphics for Data Analysis" (3rd ed., 2024) for visualisation principles, and Shannon & Weaver's "The Mathematical Theory of Communication" (1949) for entropy foundations. For feature requests or support, contact the DataStatPro team.