Categorical Descriptives: Zero to Hero Tutorial
This comprehensive tutorial takes you from the foundational concepts of summarising categorical data all the way through advanced interpretation, reporting, visualisation, assumption checking, and practical usage within the DataStatPro application. Whether you are encountering categorical descriptive statistics for the first time or deepening your understanding of how to characterise, display, and communicate the distribution of categorical variables, this guide builds your knowledge systematically from the ground up.
Table of Contents
- Prerequisites and Background Concepts
- What are Categorical Descriptives?
- The Mathematics Behind Categorical Descriptives
- Considerations and Data Quality Checks
- Types of Categorical Descriptive Measures
- Using the Categorical Descriptives Calculator Component
- Step-by-Step Procedure
- Interpreting the Output
- Visualising Categorical Data
- Confidence Intervals for Proportions
- Advanced Topics
- Worked Examples
- Common Mistakes and How to Avoid Them
- Troubleshooting
- Quick Reference Cheat Sheet
1. Prerequisites and Background Concepts
Before diving into categorical descriptives, it is essential to be comfortable with the following foundational statistical concepts. Each is briefly reviewed below.
1.1 What is a Variable?
A variable is any characteristic, attribute, or quantity that can take on different values across observations. In statistics, variables are the building blocks of data, and every analytic method — including descriptive statistics — begins with a clear understanding of the variables at hand.
- Observation: A single unit of study (e.g., one person, one product, one country).
- Dataset: A rectangular array of observations (rows) and variables (columns).
- Value: The specific outcome recorded for a variable on a given observation.
1.2 Scales of Measurement
All variables fall into one of four measurement scales, which determine which statistical operations are valid:
| Scale | Properties | Examples |
|---|---|---|
| Nominal | Named categories; no order | Blood type, gender, country, species |
| Ordinal | Ordered categories; no equal spacing | Satisfaction (low/medium/high), education level |
| Interval | Equal spacing; no true zero | Temperature (°C), year |
| Ratio | Equal spacing; true zero | Height, weight, income, reaction time |
Categorical descriptives apply to nominal and ordinal variables (and, when treated as categorical, to discretised interval or ratio variables).
1.3 Categorical Variables Defined
A categorical variable assigns each observation to exactly one of a finite set of mutually exclusive, exhaustive categories. Two key sub-types exist:
-
Nominal variables: Categories carry labels only, with no inherent ordering. There is no meaningful sense in which one category is "more" or "less" than another. Examples: eye colour (blue, green, brown, hazel), marital status (single, married, divorced, widowed), preferred mode of transport.
-
Ordinal variables: Categories carry labels and a meaningful rank order, but the intervals between adjacent categories need not be equal. Examples: pain level (none, mild, moderate, severe), academic grade (A, B, C, D, F), Likert-scale items (strongly disagree to strongly agree).
⚠️ Descriptive statistics appropriate for nominal variables (frequency, proportion, mode) are always valid for ordinal variables, but ordinal variables additionally support rank-based summaries. Never apply ordinal summaries — such as the median — to purely nominal variables.
1.4 Frequency and Frequency Distributions
The most fundamental summary of a categorical variable is its frequency distribution: a complete enumeration of each category and the number of observations falling into it.
- Absolute frequency : The raw count of observations in category .
- Relative frequency : The proportion of all observations in category , defined as .
- Percentage : The relative frequency expressed as a percentage, .
- Cumulative frequency : The count of all observations in categories up to and including (meaningful only for ordinal variables).
- Cumulative relative frequency : The proportion of observations in categories up to and including .
1.5 The Mode
The mode is the category (or categories) that appears most frequently in the data. It is the only measure of central tendency that is valid for nominal-scale data.
- A distribution with one clear most-frequent category is unimodal.
- A distribution with two equally or near-equally frequent categories is bimodal.
- A distribution in which all categories occur with equal frequency is uniform — the mode is undefined or uninformative in this case.
1.6 Population vs. Sample
All descriptive statistics computed from data describe the sample at hand. When the goal is to make inferences about a broader population, sample statistics become estimates subject to sampling variability. Confidence intervals (Section 10) quantify this uncertainty.
- Population proportion: — the true proportion of the population in category .
- Sample proportion: — the estimate of from the sample.
1.7 The Concept of a Probability Distribution for Categorical Data
For a categorical variable with categories, the probability distribution specifies the probability assigned to each category , subject to:
This is the categorical distribution (a generalisation of the Bernoulli distribution to outcomes). When (binary variable), it reduces to the Bernoulli distribution with parameter and .
1.8 Missing Data in Categorical Variables
Missing values are observations for which no category was recorded. They are fundamentally different from a valid category and must be handled deliberately:
- Complete case analysis: Exclude observations with missing values from all calculations. Simple but potentially biasing.
- Include as a category: Treat missing as its own explicit category (appropriate when missingness is informative, e.g., "refused to answer").
- Imputation: Replace missing values with estimated values using mode imputation or multiple imputation methods.
DataStatPro reports the number and percentage of missing values separately from the frequency distribution of valid responses.
2. What are Categorical Descriptives?
2.1 The Core Purpose
Categorical descriptive statistics are numerical and graphical summaries that characterise the distribution of a categorical variable. Their purpose is to answer, in a rigorous and communicable way, the fundamental question: How are observations distributed across the categories of this variable?
Unlike continuous descriptives (mean, standard deviation, skewness), which describe the location and spread of a numeric scale, categorical descriptives quantify the frequency, proportion, and relative dominance of discrete categories.
2.2 What Categorical Descriptives Tell You
| Summary | Core Question Answered |
|---|---|
| Frequency table | How many observations fall into each category? |
| Proportions / percentages | What share of observations does each category represent? |
| Mode | Which category is most common? |
| Variation ratio | How heterogeneous is the distribution of categories? |
| Entropy | How uncertain or diverse is the distribution? |
| Concentration index | How much are observations concentrated in one category? |
| Cumulative frequencies | What proportion of observations fall at or below a given level? (ordinal only) |
| Confidence intervals | What is the plausible range for the true population proportion? |
2.3 When to Use Categorical Descriptives
| Condition | Requirement |
|---|---|
| Variable scale | Nominal or ordinal |
| Data format | Observations assigned to mutually exclusive categories |
| Purpose | Summarise the marginal distribution of one variable |
| Sample size | Any; CIs become more informative with larger |
| Reporting | Always precede inferential tests with descriptive summaries |
2.4 Real-World Applications
| Field | Variable | Categories | Descriptive Goal |
|---|---|---|---|
| Public Health | Vaccination status | Vaccinated / Partially vaccinated / Unvaccinated | Estimate population coverage |
| Marketing | Brand preference | Brand A / B / C / D / None | Identify dominant preference |
| HR & Organisational | Employment type | Full-time / Part-time / Contract / Casual | Describe workforce composition |
| Clinical Trials | Adverse event severity | Mild / Moderate / Severe / Life-threatening | Profile safety outcomes |
| Education | Letter grade | A / B / C / D / F | Characterise grade distribution |
| Sociology | Religious affiliation | Multiple denominations | Map social structure |
| Quality Control | Defect category | Type I / II / III / None | Identify dominant failure modes |
| Political Science | Voting intention | Party A / B / C / Undecided | Track electoral preference |
2.5 Distinguishing Categorical Descriptives from Related Analyses
| Goal | Appropriate Method |
|---|---|
| Summarise one categorical variable | Categorical descriptives |
| Test association between two categorical variables | Chi-square test of association |
| Test whether one distribution matches a known distribution | Chi-square goodness-of-fit test |
| Summarise a continuous variable | Continuous descriptives (mean, SD, median, IQR) |
| Compare proportions across two or more groups | Two-proportion z-test; chi-square test |
| Summarise the joint distribution of two categorical variables | Contingency table (cross-tabulation) |
| Model a binary outcome | Logistic regression |
3. The Mathematics Behind Categorical Descriptives
3.1 Notation
Consider a categorical variable with mutually exclusive categories labelled . A sample of observations yields:
- = count of observations in category (absolute frequency)
- = total number of valid observations
- = sample proportion in category (relative frequency)
3.2 Frequency and Proportion
Absolute frequency:
Relative frequency (proportion):
Percentage:
Verification: and .
3.3 Cumulative Frequencies (Ordinal Variables)
For an ordinal variable with categories ordered :
Cumulative absolute frequency:
Cumulative relative frequency:
By definition, and .
3.4 The Mode
The mode is the category with the highest absolute frequency:
When two or more categories share the maximum frequency, the distribution is multimodal and all maximum-frequency categories are reported as co-modes.
3.5 The Variation Ratio
The variation ratio () measures the proportion of observations that do not fall into the modal category. It is the simplest measure of dispersion for nominal data:
- : All observations are in one category (no variation).
- : Maximum variation; all categories are equally frequent (uniform distribution).
- ranges from to .
3.6 Shannon's Entropy
Shannon's entropy (from information theory) quantifies the uncertainty or diversity in a categorical distribution:
Measured in bits (using ). Convention: .
- : Minimum entropy — all observations in one category (complete certainty).
- : Maximum entropy — all categories equally probable (maximum uncertainty).
Normalised entropy (also called relative entropy or evenness index) rescales to :
indicates complete concentration; indicates maximum diversity across categories.
⚠️ Natural logarithm () is often used instead of , yielding entropy in nats. The choice of logarithm base affects the numerical value of but not relative comparisons. DataStatPro uses (bits) by default.
3.7 Herfindahl–Hirschman Index (HHI) and Simpson's Concentration Index
The Herfindahl–Hirschman Index quantifies the degree to which observations are concentrated in a small number of categories:
- : Minimum concentration — all categories equally frequent.
- : Maximum concentration — all observations in a single category.
The complement, Simpson's diversity index , measures the probability that two randomly selected observations belong to different categories:
- : No diversity (all in one category).
- : Maximum diversity (uniform distribution).
3.8 Qualitative Variation Index (IQV)
The Index of Qualitative Variation (IQV), also attributed to Gibbs and Martin (1962), standardises Simpson's diversity index to regardless of the number of categories:
- : All observations in one category.
- : All categories equally represented (maximum heterogeneity).
IQV facilitates comparisons of categorical dispersion across variables with different numbers of categories.
3.9 The Median for Ordinal Variables
For ordinal variables, the median is the category at which the cumulative relative frequency first reaches or exceeds 0.50:
The median is more informative than the mode for ordinal data when the distribution is asymmetric, as it captures the central ordering of responses.
3.10 Percentiles and Quartiles for Ordinal Variables
Percentiles for ordinal variables are defined analogously to the median, using the cumulative frequency distribution:
The interquartile range () describes the middle 50% of ordinal responses and spans from the 25th percentile () to the 75th percentile ():
⚠️ Arithmetic differences between ordinal category labels are not meaningful unless numeric scores are assigned. The IQR for ordinal data should be reported as a range of category labels, not as a single numeric value.
3.11 The Geometric Mean of Proportions (Diversity)
For comparing proportional distributions across samples of different sizes, the geometric mean proportion can be used to summarise the average per-category representation:
This is directly related to entropy: .
4. Considerations and Data Quality Checks
4.1 Mutual Exclusivity and Exhaustiveness
The fundamental validity requirement for a categorical variable is that its categories are:
-
Mutually exclusive: Each observation belongs to exactly one category. If a respondent can select multiple categories (multi-select questions), the variable violates mutual exclusivity and must be restructured (e.g., as multiple binary indicator variables) before standard categorical descriptives can be applied.
-
Exhaustive: Every possible observation must map to some category. If the category set does not cover all possibilities, an "Other" category must be added.
How to check: Confirm that (valid observations). If , some observations are unaccounted for.
4.2 Category Labelling Consistency
Inconsistent labelling causes artificial category inflation. Common problems include:
| Problem | Example | Solution |
|---|---|---|
| Case inconsistency | "male" vs. "Male" vs. "MALE" | Standardise case before analysis |
| Leading/trailing spaces | " Yes" vs. "Yes" | Strip whitespace |
| Synonymous labels | "N/A" vs. "Not Applicable" | Merge into one category |
| Abbreviations | "F" vs. "Female" | Choose one consistent label |
| Encoding issues | "Caf" vs. "Café" | Fix encoding and standardise |
DataStatPro flags potential label inconsistencies and offers a category merge tool.
4.3 Missing Data Assessment
Before reporting any descriptive statistics, the extent and pattern of missing data must be evaluated:
| Metric | Formula | Interpretation |
|---|---|---|
| Missing count | Number of absent responses | |
| Missing rate | Proportion of data missing | |
| Valid rate | Proportion of usable responses |
Missing data mechanisms:
- MCAR (Missing Completely At Random): Missingness is unrelated to any variable. Complete case analysis is unbiased.
- MAR (Missing At Random): Missingness depends on observed variables, not the missing value itself. Imputation or weighting is appropriate.
- MNAR (Missing Not At Random): Missingness depends on the value that is missing (e.g., people with extreme views refusing to disclose). Most problematic; requires sensitivity analyses.
4.4 Sample Size Adequacy
While categorical descriptives can be computed for any , interpretability and precision depend on sample size:
| Guidance | |
|---|---|
| Proportions are highly unstable; report counts only | |
| Proportions are reported with wide CIs; interpret cautiously | |
| Proportions reasonably stable; report CIs using Wilson's method | |
| Proportions stable; standard CIs appropriate; diversity measures reliable | |
| Fine-grained proportions meaningful; subgroup breakdowns feasible |
4.5 Rare Categories
Categories with very few observations (e.g., ) pose challenges:
- Instability: Proportions based on tiny counts fluctuate widely across samples.
- Privacy: Small cell counts in sensitive data may enable re-identification.
- Misleading visuals: Tiny slices in pie charts or bars are hard to read.
Options for handling rare categories:
- Retain and flag: Report as-is with a note on small .
- Collapse: Merge rare categories with theoretically similar ones.
- "Other" grouping: Create a residual "Other" category for all categories below a frequency threshold.
- Suppress: Omit categories below a frequency threshold from public reports.
4.6 Ordered vs. Unordered Presentation
For nominal variables, the order in which categories are displayed is arbitrary. Common orderings include:
- Alphabetical (neutral, reproducible).
- By descending frequency (highlights dominant categories).
- By theoretical grouping (e.g., clinical severity).
For ordinal variables, categories must always be presented in their natural rank order (ascending or descending) to preserve the meaning of cumulative frequencies and the median.
4.7 Weighted Data
In survey research, observations are frequently assigned weights to correct for unequal selection probabilities or to make the sample representative of a target population. When weights are present:
DataStatPro supports weighted frequency tables when a weight variable is specified. Both unweighted and weighted results are reported side by side.
5. Types of Categorical Descriptive Measures
5.1 Measures of Frequency
The most direct summaries — counts and proportions — form the foundation of all categorical description.
| Measure | Symbol | Formula | Scale |
|---|---|---|---|
| Absolute frequency | Count in category | Nominal, Ordinal | |
| Relative frequency | Nominal, Ordinal | ||
| Percentage | Nominal, Ordinal | ||
| Cumulative frequency | Ordinal only | ||
| Cumulative % | Ordinal only |
5.2 Measures of Central Tendency
| Measure | Formula / Definition | Applicable Scale |
|---|---|---|
| Mode | Category with maximum | Nominal, Ordinal |
| Median | Category where first | Ordinal only |
| Percentiles (, ) | Category where target first | Ordinal only |
5.3 Measures of Dispersion / Heterogeneity
| Measure | Formula | Range | Scale |
|---|---|---|---|
| Variation ratio | Nominal, Ordinal | ||
| Shannon entropy | Nominal, Ordinal | ||
| Normalised entropy | Nominal, Ordinal | ||
| Simpson's diversity | Nominal, Ordinal | ||
| HHI (concentration) | Nominal, Ordinal | ||
| IQV | Nominal, Ordinal | ||
| IQR (category range) | Category units | Ordinal only |
5.4 Comparative Descriptives: Subgroup Breakdowns
When a grouping variable partitions observations into subgroups, categorical descriptives can be computed separately within each group, enabling comparison:
Where is the count in category within group and is the total in group . This is the foundation of cross-tabulation and is reported as a conditional frequency table (see Section 11.1).
5.5 Descriptives for Binary (Dichotomous) Variables
A binary variable is a special case of a categorical variable with categories (typically coded 0/1 or "No"/"Yes"). All standard categorical descriptives apply, but additional simplifications hold:
- The distribution is fully described by a single proportion (the proportion in the positive/event category); the complementary proportion is .
- — maximum at .
- ; — maximum at .
- — the binary entropy function.
6. Using the Categorical Descriptives Calculator Component
The Categorical Descriptives Calculator in DataStatPro provides a fully featured tool for computing, diagnosing, visualising, and reporting descriptive statistics for categorical variables.
Step-by-Step Guide
Step 1 — Navigate to the Component
Go to Descriptive Statistics → Categorical Descriptives.
Step 2 — Input Method
Choose how to provide your data:
- Raw data: Upload or paste a column of categorical observations. DataStatPro automatically detects the variable type, counts unique categories, and handles missing values.
- Pre-aggregated frequency table: Enter category labels and their counts directly into the table grid. Useful when you already have a summary table and wish to compute additional descriptive measures from it.
- Multiple variables: Select two or more categorical columns simultaneously to run batch descriptives across all selected variables in one pass.
Step 3 — Variable Configuration
- Assign a meaningful variable name and category labels for display.
- Specify the measurement scale (nominal or ordinal). If ordinal, define the correct rank ordering of categories using the drag-and-drop interface.
- Designate whether the variable is binary to unlock specialised binary proportion summaries and exact confidence intervals.
- Specify a grouping variable (optional) to produce stratified breakdowns and conditional frequency tables.
- Specify a weight variable (optional) to produce weighted frequency estimates.
Step 4 — Missing Data Handling
Select one of the following:
- Exclude missing (valid only): All summaries based on .
- Include missing as category: Missing values form an explicit "Missing" category.
- Report missing separately: Missing counts reported in a separate table; all summaries exclude missing.
Step 5 — Set Display Options
- ✅ Absolute frequencies ().
- ✅ Relative frequencies / proportions ().
- ✅ Percentages () with optional decimal places.
- ✅ Cumulative frequencies and cumulative percentages (ordinal variables).
- ✅ Valid , missing , and total .
- ✅ Mode (with multi-mode detection).
- ✅ Median and quartiles (ordinal variables).
- ✅ Variation ratio, Shannon entropy (raw and normalised), HHI, Simpson's , IQV.
- ✅ 95% confidence intervals for all proportions (Wilson, Clopper-Pearson, Agresti-Coull, or Wald — selectable in Settings).
- ✅ Comparison to a reference distribution (goodness-of-fit chi-square).
- ✅ Weighted estimates (when weight variable specified).
- ✅ Bar chart (simple, stacked, or grouped).
- ✅ Pie chart with customisable colour palette.
- ✅ Donut chart.
- ✅ Waffle chart (unit square representation).
- ✅ Lollipop chart.
- ✅ Diverging bar chart (for ordinal Likert-scale variables).
- ✅ Cumulative frequency plot (ordinal variables).
- ✅ APA 7th edition results paragraph (auto-generated).
- ✅ Publication-ready frequency table (formatted for direct insertion into manuscripts).
Step 6 — Run the Analysis
Click "Compute Categorical Descriptives". DataStatPro will:
- Validate data: check mutual exclusivity, label consistency, and missing values.
- Compute the full frequency distribution (absolute, relative, cumulative).
- Identify the mode(s) and, for ordinal variables, the median and quartiles.
- Compute all selected heterogeneity measures (, , , , , ).
- Compute Wilson 95% CIs for all proportions.
- Generate all selected visualisations with customisable formatting.
- Produce the APA-compliant results paragraph and formatted frequency table.
7. Step-by-Step Procedure
7.1 Full Manual Procedure
Step 1 — Identify and Define the Variable
State the variable name, its measurement scale (nominal or ordinal), the population of observations, and all valid categories. Confirm mutual exclusivity and exhaustiveness.
Step 2 — Count Total and Missing Observations
Report and the missing rate explicitly. Decide on missing data handling before proceeding.
Step 3 — Tally Absolute Frequencies
For each category :
Verify: .
Step 4 — Compute Relative Frequencies and Percentages
Step 5 — Compute Cumulative Frequencies (Ordinal Variables Only)
Verify: and .
Step 6 — Identify the Mode
If multiple categories share the maximum , report all co-modes.
Step 7 — Identify the Median (Ordinal Variables Only)
Locate the first category such that :
Step 8 — Compute Heterogeneity Measures
Variation ratio:
Shannon entropy:
Normalised entropy:
HHI and Simpson's diversity:
IQV:
Step 9 — Compute Confidence Intervals for Proportions
For each , compute a 95% Wilson CI (recommended):
Where for 95% confidence.
Step 10 — Construct the Frequency Table
Assemble all computed values into a publication-ready frequency table:
| Category | Cumulative | 95% CI | ||
|---|---|---|---|---|
| 100.0% | ||||
| Total | 100.0% | — | — | |
| Missing | — | — | — |
Step 11 — Produce Visualisations
Select appropriate chart types (see Section 9) and annotate with frequencies or percentages. Ensure all axes are labelled and a title is provided.
Step 12 — Interpret and Report
Use APA reporting guidelines (Section 15). Always report , , the complete frequency table, the mode, and at minimum the variation ratio or Shannon entropy. For ordinal variables, also report the median and quartiles.
8. Interpreting the Output
8.1 The Frequency Table
The frequency table is the primary output. Read it as follows:
| Observation | Interpretation |
|---|---|
| One category has | Distribution is concentrated; modal category dominates |
| All | Distribution is approximately uniform; no dominant category |
| for one category | All observations in one category; no variation |
| Small for some categories | Rare categories; consider collapsing or flagging |
| Large | Potential bias; investigate mechanism of missingness |
8.2 Mode Interpretation
| Mode Pattern | Interpretation |
|---|---|
| Single clear mode with high | Strong consensus around one category |
| Single mode with close to | Weakly dominant mode; near-uniform distribution |
| Two co-modes | Bimodal distribution; two competing dominant categories |
| All categories equal | Uniform distribution; mode is uninformative |
8.3 Heterogeneity Measures Interpretation
| Measure | Low Value Indicates | High Value Indicates |
|---|---|---|
| Most observations in modal category | Observations spread across many categories | |
| (Shannon) | Predictable, concentrated distribution | Diverse, uncertain distribution |
| near 0 | Near-complete concentration | Near-perfect diversity |
| near 1 | Near-monopoly in one category | Spread across categories |
| (Simpson) near 0 | Low diversity; one category dominates | High diversity; categories well-represented |
| near 0 | Homogeneous distribution | Heterogeneous distribution |
8.4 Cumulative Frequency Interpretation (Ordinal Variables)
| Cumulative Metric | Interpretation |
|---|---|
| at low-rank category | Most responses at the lower end; negatively skewed |
| at middle category | Symmetric; median in the middle |
| at high-rank category | Most responses at the upper end; positively skewed |
| Wide IQR (many categories span Q1 to Q3) | High ordinal variability |
| Narrow IQR | Tight concentration around the median category |
8.5 Confidence Interval Interpretation
| CI Pattern | Interpretation |
|---|---|
| Narrow CI around | Precise estimate; large or extreme proportion |
| Wide CI around | Imprecise estimate; small or near 0.50 |
| CI excludes a reference value | Statistically significant difference from |
| CIs for two categories overlap | No statistically significant difference between their proportions |
8.6 Contextualising Heterogeneity: Reference Benchmarks
| Verbal Label | Description | |
|---|---|---|
| Very low diversity | One category overwhelmingly dominant | |
| Low diversity | A few categories contain most observations | |
| Moderate diversity | Several categories reasonably represented | |
| High diversity | No single category clearly dominant | |
| Very high diversity | Observations distributed nearly uniformly |
⚠️ These benchmarks are heuristic guides, not universal standards. Domain context is essential — a may indicate healthy diversity in clinical adverse event categories but near-monopoly in a competitive market context. Always interpret heterogeneity measures relative to the theoretical range for the specific in your variable.
9. Visualising Categorical Data
9.1 Bar Chart
The bar chart (also called a bar graph) is the most widely recommended visualisation for categorical data. Each category is represented by a rectangular bar whose height (or length, for horizontal bars) is proportional to its frequency or proportion.
Best practices:
- Use vertical bars for a small number of categories () with short labels.
- Use horizontal bars when labels are long or when .
- Start the frequency axis at zero — truncating the axis distorts relative comparisons.
- Sort bars by descending frequency for nominal variables (unless there is a theoretically meaningful order).
- For ordinal variables, preserve the natural category order.
- Label each bar with its count, percentage, or both for clarity.
- Use a single, consistent colour for one-variable displays; reserve colour variation for grouped or stacked charts.
Appropriate for: Nominal and ordinal variables; any ; frequency and percentage comparisons.
9.2 Grouped Bar Chart
The grouped bar chart (clustered bar chart) displays the frequency distributions of a categorical variable separately for each level of a grouping variable, with groups of bars placed side by side.
Best practices:
- Limit to categories and groups to avoid clutter.
- Use a distinct colour for each group; provide a clear legend.
- Report percentages within groups (row percentages) when comparing group profiles.
Appropriate for: Comparing the distribution of one categorical variable across multiple groups.
9.3 Stacked Bar Chart
The stacked bar chart represents the proportion of each category stacked within a single bar (or within each group bar). The 100% stacked bar chart is particularly useful for comparing proportional breakdowns across groups.
Best practices:
- Use 100% stacked bars when comparing proportional composition across groups.
- Place the most important or interpretively central category consistently (either first or last in the stack).
- Avoid too many categories in a stack ( makes stacks hard to read).
Appropriate for: Visualising proportional composition; comparing distributions across groups.
9.4 Pie Chart
The pie chart encodes frequency as the angle (and area) of each slice. It is appropriate only when the number of categories is small () and the primary goal is showing part-to-whole relationships.
Limitations:
- Human perception of angular differences is less accurate than of bar lengths.
- Very small slices are illegible.
- Comparison across multiple pie charts is difficult.
When to avoid: When , when categories are similar in size, or when precise comparisons between categories are required. Prefer a bar chart in most cases.
9.5 Donut Chart
The donut chart is a variant of the pie chart with a hollow centre. The centre space can be used to display the total or a key summary statistic. It shares the limitations of pie charts and should be used with equal care.
9.6 Waffle Chart
The waffle chart (or unit chart) represents proportions as filled cells in a (or similar) grid, where each cell represents 1% (or ) of the total. Waffle charts are highly accessible and intuitive for general audiences.
Appropriate for: Communicating proportions to non-technical audiences; displaying one or two categories in a clear, visual format.
9.7 Lollipop Chart
The lollipop chart is a space-efficient alternative to the bar chart. Each category is represented by a thin line ("stick") topped with a dot ("lollipop"), whose position encodes frequency or proportion.
Best practices:
- Sort by descending frequency for nominal variables.
- Particularly effective for categories where bars become visually dense.
9.8 Diverging Bar Chart (Likert Scales)
For Likert-scale ordinal variables (e.g., 5-point agree–disagree scales), the diverging bar chart (also called a diverging stacked bar chart) centres the neutral category at zero and extends positive-direction categories to the right and negative-direction categories to the left.
Construction:
- Define a neutral midpoint (e.g., "Neither agree nor disagree").
- Positive categories extend rightward from the midpoint.
- Negative categories extend leftward from the midpoint.
- Each half-bar's length is proportional to the percentage in that response category.
Why it is effective: Enables simultaneous visual assessment of the overall agreement/disagreement balance and the distribution across all response options.
9.9 Cumulative Frequency Plot (Ordinal Variables)
The cumulative frequency (ogive) plot graphs cumulative percentage on the -axis against ordered category levels on the -axis. It is the primary tool for visually identifying the median (where the curve crosses 50%), quartiles, and the shape of the ordinal distribution.
Appropriate for: Ordinal variables; assessing cumulative burden or threshold effects.
9.10 Visualisation Selection Guide
| Variable Type | Primary Audience | Recommended Chart | |
|---|---|---|---|
| Nominal | 2–5 | Technical | Bar chart |
| Nominal | 2–4 | General | Pie chart or waffle chart |
| Nominal | 6+ | Any | Horizontal bar or lollipop |
| Ordinal (non-Likert) | Any | Any | Bar chart (ordered) or cumulative plot |
| Ordinal (Likert) | 4–7 | Any | Diverging bar chart |
| Binary | 2 | Any | Single bar, donut, or waffle |
| Grouped (nominal × group) | 3–6 × 2–4 | Technical | Grouped or stacked bar chart |
10. Confidence Intervals for Proportions
10.1 Why Confidence Intervals Are Essential
Sample proportions are estimates of population proportions. A 95% confidence interval (CI) provides a range of plausible values for the true population proportion , given the observed and sample size . CIs are not optional — they are integral to responsible reporting of proportions.
10.2 Wilson Score Interval (Recommended)
The Wilson score interval is the recommended method for most applications, performing well across all sample sizes and values of :
Where for 95% CI. The Wilson interval maintains coverage probability close to the nominal 95% level even for small or extreme (near 0 or 1).
10.3 Clopper-Pearson Exact Interval
The Clopper-Pearson interval is an exact (conservative) method based on the binomial distribution:
Where is the -th quantile of the Beta distribution. The Clopper-Pearson interval guarantees that the true coverage is at least , but is typically wider (more conservative) than necessary. Recommended when a conservative guarantee is required (e.g., regulatory contexts).
10.4 Agresti-Coull Interval
The Agresti-Coull interval is a simple approximation that adjusts the observed proportion by adding pseudo-successes and pseudo-failures:
For , this adds approximately 2 pseudo-successes and 2 pseudo-failures. The Agresti-Coull interval is computationally simple, nearly as accurate as Wilson's method, and performs well for .
10.5 Wald Interval (Not Recommended for Small Samples)
The Wald interval is the classic textbook method:
Limitations: The Wald interval can produce lower bounds below 0 or upper bounds above 1 for extreme proportions. It has poor coverage properties for small or when is near 0 or 1. Use Wilson or Agresti-Coull instead.
10.6 CI Method Comparison
| Method | Recommended For | Coverage | Notes |
|---|---|---|---|
| Wilson Score | General use; any | Excellent | Best default choice |
| Clopper-Pearson | Small ; conservative guarantee required | Conservative | Wider than necessary for large |
| Agresti-Coull | Simplicity; | Very good | Slightly wider than Wilson |
| Wald | Large (); not extreme | Good only for large | Fails for small or extreme |
10.7 Simultaneous CIs for Multiple Proportions
When reporting CIs for all proportions simultaneously, the familywise confidence level is not maintained at 95% — each individual CI has 95% coverage but the joint coverage is lower. To maintain joint 95% coverage:
Bonferroni-adjusted CI: Use instead of .
For categories at 95% joint confidence: .
10.8 CI Width as a Function of and
Wilson 95% CI width is approximately , which is maximised at .
Approximate CI width for :
| Approximate 95% CI Width | |
|---|---|
| 20 | ±0.219 |
| 50 | ±0.138 |
| 100 | ±0.098 |
| 200 | ±0.069 |
| 500 | ±0.044 |
| 1000 | ±0.031 |
| 5000 | ±0.014 |
11. Advanced Topics
11.1 Conditional Frequency Tables and Subgroup Comparisons
When a categorical outcome variable is examined across levels of a grouping variable, the result is a conditional frequency table (cross-tabulation). Each row or column shows the distribution of the outcome variable within a subgroup:
Comparing across groups reveals whether the distribution of categories differs between subgroups. Formal inferential testing of such differences is the domain of the chi-square test of association.
⚠️ When comparing conditional distributions across groups, report within-group percentages (row percentages when groups define rows), not overall percentages. Reporting overall percentages when groups differ in size produces misleading comparisons.
11.2 Standardisation and Reweighting
When comparing categorical distributions across samples with different population structures (e.g., different age compositions), direct standardisation weights each group's proportions to a common reference population:
Where is the proportion of the reference population in group . This removes the confounding effect of group composition and enables fair comparisons across samples.
11.3 Goodness-of-Fit: Comparing to a Theoretical Distribution
When a theoretical or historical distribution exists for a variable, the observed proportions can be tested against it using the chi-square goodness-of-fit test:
With degrees of freedom. DataStatPro integrates this test directly into the Categorical Descriptives output when a reference distribution is supplied.
11.4 Detecting Digit Preference and Response Bias
In survey data, systematic response biases cause disproportionate selection of certain categories:
- Acquiescence bias: Tendency to agree regardless of question content, inflating higher Likert categories.
- Centrality bias: Over-selection of the neutral/middle category.
- Extremity bias: Over-selection of the highest and lowest categories.
- Social desirability bias: Over-reporting of socially preferred responses.
Detection methods include comparing the observed distribution to an expected uniform distribution and inspecting standardised residuals from a goodness-of-fit test.
11.5 Benford's Law for Categorical First Digits
In datasets with naturally occurring numeric counts (e.g., city populations, financial transaction amounts), Benford's Law predicts that the first significant digit follows the distribution:
Significant departures from Benford's Law — assessed via a chi-square goodness-of-fit test on the first-digit frequency distribution — can flag data fabrication or anomalies in certain contexts.
11.6 Entropy-Based Feature Selection
In machine learning and data science, information gain and related entropy-based metrics use Shannon entropy to assess the predictive value of a categorical feature for an outcome variable :
Where is the conditional entropy of given . Features with high information gain are more useful predictors. DataStatPro reports , , and in the advanced output panel when an outcome variable is specified.
11.7 Sampling Weights and Complex Survey Design
Nationally representative surveys typically use complex sampling designs (stratification, clustering, unequal selection probabilities). In such cases:
- Design-weighted proportions correctly estimate population proportions.
- Unweighted proportions estimate the sample distribution only.
- Variance estimation must account for the sampling design (Taylor linearisation or bootstrap replication), not just the binomial formula.
DataStatPro supports Taylor-linearised variance estimation for weighted proportions when design variables (stratum, cluster, weight) are specified.
11.8 Temporal Trends in Categorical Variables
When the same categorical variable is measured at multiple time points, tracking the change in proportions over time reveals trends. Visualisation options include:
- Line chart of proportions over time (one line per category).
- Stacked area chart (visualises changing composition).
- Small multiples (one bar chart per time point).
Formal testing of temporal trends in proportions can be done using the Cochran-Armitage trend test (for binary outcomes) or regression models for categorical outcomes.
11.9 Inter-Rater Reliability for Categorical Classifications
When the same observations are classified independently by two or more raters, the agreement between raters is quantified by Cohen's kappa () or Fleiss' kappa (for three or more raters):
Where is the observed agreement proportion (sum of diagonal proportions in the agreement table) and is the expected agreement by chance.
| Verbal Label | |
|---|---|
| Less than chance | |
| Slight | |
| Fair | |
| Moderate | |
| Substantial | |
| Almost perfect |
12. Worked Examples
Example 1: Nominal Variable — Preferred Mode of Transport (Binary and )
A transport planning survey collects preferred commuting mode from adults. There are 3 missing responses. The valid responses () are:
| Mode | Count |
|---|---|
| Car | 102 |
| Public Transport | 78 |
| Cycling | 41 |
| Walking | 26 |
Step 1 — Frequencies and Proportions:
| Mode | 95% Wilson CI | |||
|---|---|---|---|---|
| Car | 102 | .413 | 41.3% | [35.5%, 47.4%] |
| Public Transport | 78 | .316 | 31.6% | [26.1%, 37.6%] |
| Cycling | 41 | .166 | 16.6% | [12.4%, 22.0%] |
| Walking | 26 | .105 | 10.5% | [7.2%, 15.1%] |
| Total (valid) | 247 | 1.000 | 100% | |
| Missing | 3 |
Step 2 — Mode:
(, ). Unimodal.
Step 3 — Heterogeneity Measures:
Interpretation: The distribution shows moderate concentration — car is the dominant mode (41.3%), but a substantial minority use public transport (31.6%). The very high and indicate a highly diverse distribution, with responses spread across all four categories.
APA write-up: "Among 247 valid respondents (3 missing), the most frequently preferred commuting mode was car (, 41.3%, 95% CI [35.5%, 47.4%]), followed by public transport (, 31.6%, 95% CI [26.1%, 37.6%]), cycling (, 16.6%, 95% CI [12.4%, 22.0%]), and walking (, 10.5%, 95% CI [7.2%, 15.1%]). The distribution showed high heterogeneity (Shannon entropy = 1.82 bits, = 0.91, IQV = 0.92)."
Example 2: Ordinal Variable — Patient Satisfaction (5-Point Scale)
A hospital surveys patients on their satisfaction with care (5-point ordinal scale: Very Dissatisfied, Dissatisfied, Neutral, Satisfied, Very Satisfied). There are no missing values.
| Satisfaction | Cumulative | Cumulative | 95% Wilson CI | ||
|---|---|---|---|---|---|
| Very Dissatisfied | 8 | 4.4% | 8 | 4.4% | [2.2%, 8.6%] |
| Dissatisfied | 19 | 10.6% | 27 | 15.0% | [6.9%, 15.9%] |
| Neutral | 28 | 15.6% | 55 | 30.6% | [10.9%, 21.5%] |
| Satisfied | 72 | 40.0% | 127 | 70.6% | [33.3%, 47.1%] |
| Very Satisfied | 53 | 29.4% | 180 | 100.0% | [23.2%, 36.5%] |
| Total | 180 | 100% |
Mode: Satisfied (, ).
Median: The cumulative percentage first reaches 50% at "Satisfied" (). Median = Satisfied.
Quartiles:
- (25th percentile): First category where → → Neutral
- (75th percentile): First category where → → Very Satisfied
- Neutral to Very Satisfied (spans 3 category levels)
Heterogeneity:
Interpretation: The distribution is positively skewed toward higher satisfaction. The median and mode both fall at "Satisfied", and over 69% of patients rated their care as Satisfied or Very Satisfied. The indicates moderately high variability across response options.
APA write-up: "Patient satisfaction ratings () showed a positively skewed distribution. The modal response was Satisfied (, 40.0%, 95% CI [33.3%, 47.1%]) and the median was Satisfied ( Neutral; Very Satisfied). A combined 69.4% of patients rated their care as Satisfied or Very Satisfied. The distribution showed moderately high heterogeneity (, )."
Example 3: Binary Variable — Vaccination Status
A public health register records vaccination status (vaccinated/not vaccinated) for individuals. .
| Status | 95% Wilson CI | ||
|---|---|---|---|
| Vaccinated | 934 | 79.3% | [76.8%, 81.6%] |
| Not Vaccinated | 244 | 20.7% | [18.4%, 23.2%] |
| Total (valid) | 1,178 | 100% | |
| Missing | 14 | 1.2% |
Mode: Vaccinated ().
Binary entropy:
Interpretation: Approximately 79.3% of the population is vaccinated, with the 95% CI indicating the true coverage is between 76.8% and 81.6%. This falls below the commonly cited 95% herd immunity threshold. The low and moderate confirm that one category (vaccinated) dominates, but a non-negligible 20.7% remain unvaccinated.
APA write-up: "Of 1,178 individuals with valid vaccination records (14 missing, 1.2%), 934 (79.3%, 95% CI [76.8%, 81.6%]) were vaccinated and 244 (20.7%, 95% CI [18.4%, 23.2%]) were not vaccinated. Coverage fell below the 95% target threshold."
Example 4: Subgroup Breakdown — Grade Distribution by Teaching Method
Building on the teaching method data from the chi-square tutorial, a researcher reports the grade distribution for each teaching method (, 70 per method).
Conditional Frequency Table (Row Percentages):
| Method | A | B | C | D/F | Total |
|---|---|---|---|---|---|
| Lecture | 12 (17.1%) | 23 (32.9%) | 22 (31.4%) | 13 (18.6%) | 70 |
| Flipped | 24 (34.3%) | 27 (38.6%) | 16 (22.9%) | 3 (4.3%) | 70 |
| Online | 9 (12.9%) | 17 (24.3%) | 28 (40.0%) | 16 (22.9%) | 70 |
Modes: Lecture = B; Flipped = B; Online = C.
Medians: Lecture = B; Flipped = B; Online = C.
Shannon Entropy by Group:
| Method | (bits) | Interpretation | ||
|---|---|---|---|---|
| Lecture | 1.969 | 0.984 | 0.671 | High variability; grades spread across all levels |
| Flipped | 1.777 | 0.888 | 0.657 | Moderate variability; concentrated at upper grades |
| Online | 1.939 | 0.969 | 0.600 | High variability; concentrated at lower grades |
Interpretation: All three methods show moderate-to-high grade variability. The flipped classroom has the highest proportion of A grades (34.3%) and the lowest D/F rate (4.3%), while the online method shows the highest C and D/F rates. Formal testing of these differences is provided by the chi-square test of association.
13. Common Mistakes and How to Avoid Them
Mistake 1: Computing a Mean or Standard Deviation for a Nominal Variable
Problem: Assigning arbitrary numeric codes to categories (e.g., 1 = Male, 2 = Female, 3 = Non-binary) and computing their mean or standard deviation. The resulting number is arithmetically computable but statistically meaningless — the numeric codes carry no magnitude information.
Solution: For nominal variables, report only frequency, proportion, and mode. If a numeric summary of a categorical variable is needed for modelling, create appropriate dummy/indicator variables.
Mistake 2: Treating Ordinal Variables as Fully Continuous
Problem: Computing the arithmetic mean of ordinal scale responses (e.g., mean Likert score = 3.47) as if the intervals between categories were equal. The mean assumes equal spacing; ordinal categories have no such guarantee.
Solution: For ordinal variables, report the median and IQR as the primary central tendency and spread measures. Report the mode as a supplementary measure. Computing means of ordinal variables is acceptable as a pragmatic convention in some fields (notably Likert-scale research), but must be explicitly acknowledged and defended.
Mistake 3: Failing to Report Missing Values
Problem: Computing and reporting proportions from valid observations only, without disclosing the number of missing values. This gives readers no way to assess whether missingness is substantial enough to bias the results.
Solution: Always report both and (and the missing rate). Investigate whether missingness is systematic. Report results for both complete-case and missing-included analyses when missingness is substantial ().
Mistake 4: Reporting Only Counts Without Proportions (or Vice Versa)
Problem: Reporting only absolute frequencies makes comparisons across groups of different sizes misleading. Reporting only proportions without counts obscures the precision of estimates (a proportion of 50% based on is very different from one based on ).
Solution: Always report both absolute frequency and proportion (or percentage) in frequency tables. Include so readers can recover the raw counts from percentages.
Mistake 5: Using a Pie Chart for More Than 5 Categories
Problem: A pie chart with 6 or more slices becomes illegible. Slices of similar size are virtually indistinguishable, and small categories vanish. Over-reliance on pie charts is one of the most widely cited visualisation errors.
Solution: For , use a horizontal bar chart sorted by frequency or a lollipop chart. Reserve pie charts for when part-to-whole relationships are the primary message.
Mistake 6: Ignoring Category Order for Ordinal Variables
Problem: Sorting ordinal categories alphabetically or by frequency rather than in their natural rank order (e.g., displaying satisfaction responses as: High, Low, Medium). This disrupts the cumulative frequency interpretation, makes cumulative plots meaningless, and confuses readers.
Solution: For ordinal variables, always display categories in their meaningful rank order. DataStatPro enforces the user-specified rank ordering in all tables and charts when the variable is designated as ordinal.
Mistake 7: Reporting the Wald Interval for Small Samples or Extreme Proportions
Problem: The Wald CI can produce intervals below 0 or above 1, and has poor coverage when or or .
Solution: Use the Wilson score interval as the default for all sample sizes. Use the Clopper-Pearson exact interval when a conservative guarantee is required. DataStatPro defaults to Wilson intervals.
Mistake 8: Comparing Distributions Using Overall Proportions Instead of Conditional Proportions
Problem: When comparing the distribution of a variable across groups of unequal size, reporting overall proportions conflates the group composition with the variable distribution. For example, if 90% of respondents are female and more females prefer Brand A, overall Brand A preference will be high not because it is universally preferred, but because females are overrepresented.
Solution: Always compute and report conditional (within-group) proportions when comparing distributions across groups. Use row percentages when groups are defined by rows.
Mistake 9: Conflating Diversity Measures Across Variables with Different
Problem: Comparing the raw Shannon entropy of a 3-category variable ( bits) with that of a 6-category variable ( bits). A higher raw may simply reflect more categories, not greater relative diversity.
Solution: Compare variables using normalised entropy or IQV, both of which are bounded regardless of .
Mistake 10: Concluding Category Absence from a Zero Frequency
Problem: A category with in the sample may exist in the population but was not observed due to small sample size or sampling variability. Declaring the category "absent" may be premature.
Solution: Report zero-frequency categories explicitly in the table. Compute the 95% CI for using the Clopper-Pearson upper bound: , which gives a plausible upper bound for the true proportion.
14. Troubleshooting
| Problem | Likely Cause | Solution |
|---|---|---|
| Proportions do not sum to 1.000 | Rounding in display; missing values excluded | Report "rounding may cause totals to differ from 100%"; verify |
| More categories than expected | Label inconsistencies (case, spaces, synonyms) | Use DataStatPro's category merge/clean tool; standardise labels |
| Mode is reported as multiple categories | Two or more categories share the maximum frequency | Report all co-modes and describe as a bimodal or multimodal distribution |
| Cumulative percentage does not reach 100% | Missing category or rounding | Verify all categories are included; check |
| Confidence interval lower bound is negative (Wald) | Small or extreme with Wald method | Switch to Wilson or Clopper-Pearson interval; both are bounded |
| All observations in one category | Expected result; no variation present; report | |
| Perfectly uniform distribution | All categories equally represented; may reflect a small sample or genuine uniformity | |
| Computation error or incorrect | Verify formula; is bounded by construction | |
| Median not defined | Perfectly even split at a boundary (cumulative % jumps from to without exactly hitting 50%) | Report median as the lower of the two surrounding categories; use interpolation if numeric scores assigned |
| Weighted proportions differ greatly from unweighted | Significant over/undersampling of some groups | Expected in complex surveys; report both; weighted estimates are for population inference |
| All cells have very small counts () | Very small total | Report counts only; CIs will be very wide; caution against over-interpreting proportions |
| Chart shows categories in wrong order | Default alphabetical sorting applied to ordinal variable | Specify ordinal order in DataStatPro's variable settings; reorder manually in chart editor |
| Missing rate is very high () | Non-response bias; data collection issue | Investigate MCAR/MAR/MNAR mechanism; perform sensitivity analysis; consider imputation |
| Entropy calculated as negative | Use of not set to 0 | Apply L'Hôpital's convention: ; check software implementation |
15. Quick Reference Cheat Sheet
Core Equations
| Formula | Description |
|---|---|
| Sample proportion in category | |
| Percentage in category | |
| Cumulative frequency up to category (ordinal) | |
| Mode: most frequent category | |
| Median category (ordinal only) | |
| Variation ratio | |
| Shannon entropy (bits) | |
| Normalised entropy | |
| Herfindahl–Hirschman Index | |
| Simpson's diversity index | |
| Index of Qualitative Variation | |
| Wilson CI: see Section 10.2 | Recommended 95% CI for proportions |
Measure Applicability by Scale
| Measure | Nominal | Ordinal |
|---|---|---|
| Frequency, proportion, % | ✅ | ✅ |
| Cumulative frequency / % | ❌ | ✅ |
| Mode | ✅ | ✅ |
| Median | ❌ | ✅ |
| Quartiles / IQR | ❌ | ✅ |
| , , , , , | ✅ | ✅ |
| Mean, SD | ❌ | ❌ (unless numeric scores assigned) |
Chart Selection Guide
| Variable Type | Audience | Recommended Chart | |
|---|---|---|---|
| Nominal | 2–5 | Technical | Vertical bar chart |
| Nominal | 2–4 | General | Pie chart / waffle chart |
| Nominal | 6+ | Any | Horizontal bar / lollipop |
| Ordinal (non-Likert) | Any | Any | Bar chart (ordered) |
| Ordinal (Likert) | 4–7 | Any | Diverging bar chart |
| Binary | 2 | General | Single bar / donut / waffle |
| Grouped comparisons | 3–6 × 2–4 | Technical | Grouped / stacked bar |
| Trend over time | Any | Any | Line chart of proportions |
Heterogeneity Benchmarks ()
| Diversity Level | |
|---|---|
| Very low (near-complete concentration) | |
| Low | |
| Moderate | |
| High | |
| Very high (near-uniform) |
Confidence Interval Method Selection
| Situation | Recommended Method |
|---|---|
| General use; any | Wilson score interval |
| Conservative guarantee; small | Clopper-Pearson exact |
| Simplicity; | Agresti-Coull |
| Large ; not extreme | Wald (acceptable but not preferred) |
| Joint CIs for all proportions | Bonferroni-adjusted Wilson |
Approximate Wilson 95% CI Width ()
| ± Width | |
|---|---|
| 20 | ±0.219 |
| 50 | ±0.138 |
| 100 | ±0.098 |
| 200 | ±0.069 |
| 500 | ±0.044 |
| 1000 | ±0.031 |
APA 7th Edition Reporting Templates
Single nominal variable: "The most frequently reported [variable name] was [modal category] ( = [value], [%]%, 95% CI [[LB]%, [UB]%]), followed by [second category] ( = [value], [%]%, 95% CI [[LB]%, [UB]%]). The full distribution is presented in Table X."
Single ordinal variable: "Responses to [variable name] were distributed as follows: [lowest category] ( = [value], [%]%), …, [highest category] ( = [value], [%]%). The median response was [median category] ( = [Q1 category]; = [Q3 category])."
Binary variable: "A total of [f] participants ([%]%, 95% CI [[LB]%, [UB]%]) reported [positive category]; the remaining [f] ([%]%, 95% CI [[LB]%, [UB]%]) reported [negative category]."
With missing data: "Of [N total] participants, [N valid] provided valid responses ([N miss] missing, [miss %]%). Among valid responses, …"
With heterogeneity measures: "The distribution showed [low / moderate / high] heterogeneity (Shannon entropy = [value] bits, = [value], IQV = [value])."
With subgroup comparison: "Conditional frequency distributions by [group variable] are presented in Table X. [Category] was most prevalent in [subgroup] ([%]%, 95% CI [[LB]%, [UB]%]) compared to [other subgroup] ([%]%, 95% CI [[LB]%, [UB]%])."
Reporting Checklist
| Item | Required |
|---|---|
| Valid and missing (with missing rate) | ✅ Always |
| All category labels clearly defined | ✅ Always |
| Absolute frequencies for all categories | ✅ Always |
| Percentages for all categories | ✅ Always |
| Proportions (or % totalling 100%) | ✅ Always |
| Mode (with indication of multimodality if present) | ✅ Always |
| Median and quartiles | ✅ For ordinal variables |
| Cumulative frequencies / percentages | ✅ For ordinal variables |
| 95% CI for all proportions (Wilson recommended) | ✅ Always |
| Shannon entropy ( and ) | ✅ When heterogeneity is of substantive interest |
| or | ✅ When heterogeneity is of substantive interest |
| HHI / Simpson's | ✅ For concentration / diversity analyses |
| Appropriate chart with axis labels and title | ✅ Always |
| Reference to published frequency table in text | ✅ Always |
| Measurement scale stated (nominal / ordinal) | ✅ Always |
| Missing data mechanism discussed | ✅ When |
| Weighted estimates (if survey data) | ✅ When design weights provided |
| CI method stated | ✅ When or any |
| Goodness-of-fit against reference distribution | ✅ When comparing to theoretical or historical baseline |
| Subgroup conditional distributions | ✅ When a grouping variable is present |
This tutorial provides a comprehensive foundation for understanding, computing, interpreting, visualising, and reporting categorical descriptive statistics within the DataStatPro application. For further reading, consult Agresti's "An Introduction to Categorical Data Analysis" (3rd ed., 2018), Tukey's "Exploratory Data Analysis" (1977), Wickham's "ggplot2: Elegant Graphics for Data Analysis" (3rd ed., 2024) for visualisation principles, and Shannon & Weaver's "The Mathematical Theory of Communication" (1949) for entropy foundations. For feature requests or support, contact the DataStatPro team.