Knowledge Base / Kruskal-Wallis H Test Inferential Statistics 71 min read

Kruskal-Wallis H Test

Comprehensive reference guide for Kruskal-Wallis H test (non-parametric alternative to one-way ANOVA).

Kruskal-Wallis Test: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of non-parametric inference for multiple independent groups all the way through the mathematics, assumptions, effect sizes, post-hoc testing, interpretation, reporting, and practical usage of the Kruskal-Wallis Test within the DataStatPro application. Whether you are encountering the Kruskal-Wallis Test for the first time or seeking a rigorous understanding of rank-based multi-group comparison, this guide builds your knowledge systematically from the ground up.


Table of Contents

  1. Prerequisites and Background Concepts
  2. What is the Kruskal-Wallis Test?
  3. The Mathematics Behind the Kruskal-Wallis Test
  4. Assumptions of the Kruskal-Wallis Test
  5. Variants of the Kruskal-Wallis Test
  6. Using the Kruskal-Wallis Test Calculator Component
  7. Full Step-by-Step Procedure
  8. Effect Sizes for the Kruskal-Wallis Test
  9. Post-Hoc Tests and Pairwise Comparisons
  10. Confidence Intervals
  11. Power Analysis and Sample Size Planning
  12. Advanced Topics
  13. Worked Examples
  14. Common Mistakes and How to Avoid Them
  15. Troubleshooting
  16. Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into the Kruskal-Wallis Test, it is essential to be comfortable with the following foundational statistical concepts. Each is briefly reviewed below.

1.1 Parametric vs. Non-Parametric Inference for Multiple Groups

Parametric tests such as the one-way ANOVA assume that observations within each group come from normally distributed populations with equal variances. When these assumptions are met, parametric tests are optimal — they use the maximum amount of information from the data and achieve the highest possible statistical power.

Non-parametric tests replace raw data values with their ranks and make minimal assumptions about the shape of population distributions. They are more robust to violations of normality and the presence of outliers. The Kruskal-Wallis Test is the leading non-parametric alternative to the one-way between-subjects ANOVA for comparing three or more independent groups.

1.2 The Concept of Ranks and Rank Sums

Ranking transforms raw data values into their ordered positions. Given NN observations combined across all KK groups, rank from 1 (smallest) to NN (largest):

By working with ranks rather than raw values, the Kruskal-Wallis Test:

Example:

ValueGroupRaw RankMidrank
2.1A11.0
3.4B22.0
3.4A32.0 (midrank of 2,3)
5.7C44.0
Wait — above has tie at 3.4

Corrected example:

ValueGroupRank
2.1A1.0
3.4B2.5 (midrank of positions 2 and 3)
3.4A2.5
5.7C4.0
8.2B5.0

1.3 The Chi-Squared Distribution

For large samples, the Kruskal-Wallis HH statistic follows a chi-squared distribution with K1K-1 degrees of freedom:

HχK12H \sim \chi^2_{K-1} (approximately, for nj5n_j \geq 5 per group)

The chi-squared distribution:

Critical values for χK12\chi^2_{K-1} at α=.05\alpha = .05:

KKdf=K1df = K-1χcrit2\chi^2_{crit} (α=.05\alpha = .05)χcrit2\chi^2_{crit} (α=.01\alpha = .01)
325.9919.210
437.81511.345
549.48813.277
6511.07015.086
8714.06718.475
10916.91921.666

1.4 The Null and Alternative Hypotheses

Under the location-shift (stochastic equivalence) model:

H0:H_0: All KK population distributions are identical.

H1:H_1: At least one population distribution is stochastically different from at least one other (tends to produce larger or smaller values).

More precisely (without the location-shift assumption):

H0:P(Xj>Xk)=0.5H_0: P(X_j > X_k) = 0.5 for all pairs jkj \neq k (stochastic equality)

H1:H_1: At least one P(Xj>Xk)0.5P(X_j > X_k) \neq 0.5 for some pair (j,k)(j, k)

When the population distributions have the same shape but potentially different locations (medians), the Kruskal-Wallis test is equivalent to testing equality of medians:

H0:θ1=θ2==θKH_0: \theta_1 = \theta_2 = \cdots = \theta_K (where θj\theta_j is the median of group jj)

1.5 Why Not Multiple Mann-Whitney Tests?

With KK groups, one could run all (K2)\binom{K}{2} pairwise Mann-Whitney U tests. However, this inflates the familywise error rate (FWER):

FWER=1(1α)mFWER = 1-(1-\alpha)^m

For K=4K = 4 groups (m=6m = 6 pairwise tests) at α=.05\alpha = .05: FWER=1(0.95)6=.265FWER = 1-(0.95)^6 = .265

The Kruskal-Wallis omnibus test maintains the FWER at α\alpha for the simultaneous test of all group differences, after which post-hoc procedures control pairwise comparisons.

1.6 The Asymptotic Relative Efficiency

The Asymptotic Relative Efficiency (ARE) of the Kruskal-Wallis test relative to one-way ANOVA is 3/π0.9553/\pi \approx 0.955 for normally distributed data — a negligible efficiency loss of approximately 5%. For non-normal distributions, the Kruskal-Wallis test can be substantially more powerful:

Data DistributionARE (Kruskal-Wallis vs. ANOVA)
Normal3/π0.9553/\pi \approx 0.955
Uniform1.0001.000
Logisticπ2/91.097\pi^2/9 \approx 1.097
Double exponential1.5001.500
Contaminated normal>2.000> 2.000
Heavy-tailed (Cauchy)\to \infty

💡 The ARE of 0.955 means that for normally distributed data, the Kruskal-Wallis test requires approximately 1/0.9551.0471/0.955 \approx 1.047 times as many observations as one-way ANOVA to achieve the same power — a cost of only about 5%. In exchange, the test is robust to any departures from normality. This makes it a safe default when normality is uncertain.

1.7 Statistical Significance vs. Practical Significance

Like the one-way ANOVA F-test, the Kruskal-Wallis test answers: "Is the observed rank-based difference across groups larger than what chance alone would produce?" It does not answer: "How large is the effect?"

Always report:

  1. The HH statistic (tie-corrected), degrees of freedom, and p-value.
  2. ηH2\eta^2_H (or ϵH2\epsilon^2_H) as an effect size measure.
  3. Group medians and interquartile ranges.
  4. Post-hoc pairwise comparisons with individual effect sizes (rrbr_{rb}).

2. What is the Kruskal-Wallis Test?

2.1 The Core Idea

The Kruskal-Wallis Test (Kruskal & Wallis, 1952) is a non-parametric inferential procedure for testing whether K3K \geq 3 independent groups come from the same population distribution. It is the natural extension of the Mann-Whitney U test to three or more groups, and the non-parametric analogue of the one-way between-subjects ANOVA.

Rather than comparing group means (as ANOVA does), the Kruskal-Wallis test:

  1. Combines all NN observations across groups and ranks them from 1 to NN.
  2. Computes the mean rank Rˉj\bar{R}_j for each group.
  3. Tests whether the mean ranks differ more than expected by chance under H0H_0 (which states all groups have the same distribution).
  4. Summarises the evidence in the HH statistic, which follows a χ2\chi^2 distribution under H0H_0 for large samples.

Under H0H_0, if all groups have the same distribution, each group should have a mean rank close to the overall mean rank (N+1)/2(N+1)/2. Large deviations of group mean ranks from the overall mean rank produce a large HH statistic, providing evidence against H0H_0.

2.2 When to Use the Kruskal-Wallis Test

The Kruskal-Wallis Test is appropriate when:

2.3 The Kruskal-Wallis Test vs. Related Procedures

SituationAppropriate Test
K3K \geq 3 groups, independent, normal, equal variancesOne-way ANOVA
K3K \geq 3 groups, independent, normal, unequal variancesWelch's one-way ANOVA
K3K \geq 3 groups, independent, non-normal or ordinalKruskal-Wallis Test
K=2K = 2 groups, independent, non-normalMann-Whitney U Test
K3K \geq 3 related conditions, non-normalFriedman Test
K3K \geq 3 groups, very small samples, many tiesPermutation ANOVA
K3K \geq 3 groups, severely unequal shapesBrunner-Munzel extension

2.4 What the Kruskal-Wallis Test Tests

Under the standard location-shift assumption (all distributions have the same shape but potentially different locations), the Kruskal-Wallis test is a test of:

Equal population medians (or equivalently, equal location parameters).

Without the location-shift assumption (which should be checked — see Section 4.1), the test is more correctly described as a test of stochastic equality: whether one group tends to produce systematically larger values than another.

⚠️ A common misstatement is that the Kruskal-Wallis test always tests for equal medians. This is only true under the location-shift assumption (same shape across groups). If group distributions have different shapes, the test may reject H0H_0 even if all group medians are equal. Always state which interpretation applies based on the data.

2.5 Real-World Applications

FieldExampleIV (Groups)DV
Clinical PsychologyAnxiety severity across 4 diagnostic groups4 diagnosesGAD-7 (ordinal)
MedicinePain relief across 5 acupuncture protocols5 protocolsNRS 0–10 (ordinal)
EducationMotivation across 3 teaching methods3 methodsLikert 1–5
MarketingSatisfaction across 4 product versions4 versionsSatisfaction rating
HR/OBJob stress across 6 departments6 deptsStress scale
EcologySpecies diversity across 5 habitats5 habitat typesRichness index
PharmacologyAdverse event severity across 3 drugs3 drugsSeverity (ordinal)
NeuroscienceResponse latency across 4 conditions4 conditionsRT (ms; skewed)

3. The Mathematics Behind the Kruskal-Wallis Test

3.1 Notation

SymbolMeaning
KKNumber of groups
njn_jNumber of observations in group jj
N=j=1KnjN = \sum_{j=1}^K n_jTotal number of observations
xijx_{ij}ii-th observation in group jj
RijR_{ij}Rank of xijx_{ij} in the combined dataset
Wj=i=1njRijW_j = \sum_{i=1}^{n_j} R_{ij}Sum of ranks for group jj
Rˉj=Wj/nj\bar{R}_j = W_j/n_jMean rank for group jj
Rˉ=(N+1)/2\bar{R} = (N+1)/2Overall mean rank

3.2 Step 1 — Ranking All Observations

Combine all NN observations from all KK groups into a single dataset and rank from 1 (smallest) to NN (largest).

For tied values: Assign the average rank (midrank) to all tied observations:

If values at positions r,r+1,,r+t1r, r+1, \ldots, r+t-1 are all equal, each receives rank (r+r+1++r+t1)/t=r+(t1)/2(r + r+1 + \cdots + r+t-1)/t = r + (t-1)/2.

Verification: The sum of all ranks must equal N(N+1)/2N(N+1)/2.

j=1KWj=N(N+1)2\sum_{j=1}^K W_j = \frac{N(N+1)}{2}

3.3 Step 2 — Computing Group Rank Sums and Mean Ranks

For each group jj:

Wj=i=1njRijW_j = \sum_{i=1}^{n_j} R_{ij} (sum of ranks assigned to observations in group jj)

Rˉj=Wj/nj\bar{R}_j = W_j/n_j (mean rank for group jj)

The overall mean rank is:

Rˉ=N+12\bar{R} = \frac{N+1}{2}

Under H0H_0 (all groups from same distribution), E[Rˉj]=(N+1)/2E[\bar{R}_j] = (N+1)/2 for all jj.

3.4 Step 3 — The Kruskal-Wallis H Statistic

Basic H statistic (no ties):

H=12N(N+1)j=1KWj2nj3(N+1)H = \frac{12}{N(N+1)}\sum_{j=1}^K \frac{W_j^2}{n_j} - 3(N+1)

Equivalent computational form:

H=12N(N+1)j=1Knj(RˉjN+12)2H = \frac{12}{N(N+1)}\sum_{j=1}^K n_j\left(\bar{R}_j - \frac{N+1}{2}\right)^2

This second form makes the logic transparent: HH is a weighted sum of squared deviations of group mean ranks from the overall mean rank (N+1)/2(N+1)/2, scaled by 12/(N(N+1))12/(N(N+1)) to produce a statistic that follows a χ2\chi^2 distribution.

Key properties:

3.5 Step 4 — The Tie Correction

When tied values exist, the basic HH statistic is slightly underestimated. The tie-corrected version is always used in practice:

Hc=HCH_c = \frac{H}{C}

Where the correction factor CC is:

C=1m=1g(tm3tm)N3NC = 1 - \frac{\sum_{m=1}^{g}(t_m^3 - t_m)}{N^3 - N}

And:

Properties of CC:

The tie correction is increasingly important when:

3.6 Step 5 — The p-value

For large samples (nj5n_j \geq 5 per group):

p=P(χK12Hc)p = P(\chi^2_{K-1} \geq H_c)

This asymptotic chi-squared approximation is generally accurate for nj5n_j \geq 5.

For small samples (nj<5n_j < 5):

Use exact tables (available in statistical references) or the permutation distribution computed by DataStatPro. The exact p-value is based on all possible ways to assign NN ranks to KK groups of sizes n1,n2,,nKn_1, n_2, \ldots, n_K.

Exact p-value (small samples):

p=Number of rank assignments giving HHobsTotal number of possible rank assignmentsp = \frac{\text{Number of rank assignments giving } H \geq H_{obs}}{\text{Total number of possible rank assignments}}

Total possible assignments =N!/(n1!×n2!××nK!)= N!/(n_1!\times n_2!\times\cdots\times n_K!)

DataStatPro automatically uses the exact distribution for small samples (nj<5n_j < 5) and the chi-squared approximation (with tie correction) for larger samples.

3.7 The Relationship Between H and the ANOVA F-Statistic

The Kruskal-Wallis HH statistic is mathematically related to the ANOVA FF-statistic applied to the ranks. Specifically, if we replaced the raw data with their ranks and ran a standard one-way ANOVA:

Franks=MSB,ranksMSW,ranksF_{ranks} = \frac{MS_{B,ranks}}{MS_{W,ranks}}

Then:

H=(N1)×FranksN1MSB,ranks/MST,ranks×(K1)Franks×K11H = \frac{(N-1) \times F_{ranks}}{N - 1 - MS_{B,ranks}/MS_{T,ranks} \times (K-1)} \approx F_{ranks} \times \frac{K-1}{1} (for large NN)

More precisely, HH and FranksF_{ranks} are monotonically related — large FranksF_{ranks} always corresponds to large HH. This equivalence shows that the Kruskal-Wallis test is essentially ANOVA on the ranks.

3.8 The Exact Distribution of H for Small Samples

For K=3K = 3 with very small group sizes, Kruskal and Wallis (1952) tabulated the exact distribution. Selected critical values for the exact test (α=.05\alpha = .05):

n1n_1n2n_2n3n_3HcritH_{crit} (α=.05\alpha = .05)
2224.571
3224.714
3325.361
3335.600
4225.333
4325.444
4425.455
4445.692
5555.780 (≈ χ2,0.052=5.991\chi^2_{2,0.05} = 5.991)

For nj5n_j \geq 5, the chi-squared approximation is generally adequate.

3.9 Decomposition: H as a Sum of Pairwise Contrasts

The total HH statistic can be decomposed into contributions from individual pairs of groups. For the pairwise comparison of groups jj and kk:

Hjk=12njnkN(N+1)(RˉjRˉk)2H_{jk} = \frac{12n_jn_k}{N(N+1)}\left(\bar{R}_j - \bar{R}_k\right)^2

These pairwise contributions do not sum exactly to HH (because the ranks are shared across the full dataset), but they are useful for understanding which group pairs drive the overall significant result.

The standard Dunn post-hoc test (Section 9) uses the pairwise differences in mean ranks (RˉjRˉk)(\bar{R}_j - \bar{R}_k) to construct post-hoc z-statistics.


4. Assumptions of the Kruskal-Wallis Test

4.1 Same Shape Across Groups (Location-Shift Assumption)

The Kruskal-Wallis Test's standard interpretation (as a test of equal medians/locations) requires that all KK population distributions have the same shape — they may differ only in location (median). This is the location-shift or stochastic dominance assumption.

Why it matters: If the distributions have different shapes (e.g., one group is symmetric and another is right-skewed), the Kruskal-Wallis test may reject H0H_0 even when all medians are equal — it is then detecting a difference in dispersion or shape, not location.

How to check:

When violated:

4.2 Independence of Observations

All NN observations must be independent of each other, both within and across groups. Each participant or experimental unit must contribute exactly one observation to exactly one group.

Common violations:

When violated: Use the Friedman test (for repeated measures), multilevel models, or time-series methods.

4.3 Ordinal Measurement (Rankable Data)

The Kruskal-Wallis Test requires that observations can be meaningfully ranked — there must be a natural ordering such that one value can be identified as greater than, less than, or equal to another. This is satisfied for:

When violated: If data are purely nominal (categories with no natural order), use chi-squared tests or Fisher's exact test.

4.4 Random Sampling

Observations within each group should constitute a random sample from the respective population, or at least be exchangeable under H0H_0. This is required for the p-value to be valid.

4.5 Minimum Sample Size per Group

The chi-squared approximation for the p-value requires nj5n_j \geq 5 per group for adequate accuracy. For smaller groups:

4.6 Absence of Excessive Ties

While the Kruskal-Wallis test handles ties through the correction factor CC, excessive ties reduce statistical power and may distort the chi-squared approximation.

Types of ties and their impact:

How to check: Compute the correction factor CC — values of C<0.95C < 0.95 indicate substantial ties.

4.7 Assumption Summary Table

AssumptionDescriptionHow to CheckRemedy if Violated
Same shape (location-shift)Distributions differ only in location, not shapeDensity plots, boxplots, Levene'sBrunner-Munzel; interpret cautiously
IndependenceObservations independent within and across groupsDesign reviewFriedman test (repeated measures)
Rankable dataObservations can be meaningfully orderedMeasurement theoryChi-squared (nominal data)
Random samplingGroups are random samples from their populationsDesign reviewNon-parametric bootstrap
nj5n_j \geq 5Adequate for chi-squared approximationCount per groupExact permutation test
No excessive tiesCC not too far from 1Compute CC; inspect dataPermutation version; sign test

5. Variants of the Kruskal-Wallis Test

5.1 Standard Kruskal-Wallis with Chi-Squared Approximation

The default implementation: compute HcH_c using the tie correction and compare to χK12\chi^2_{K-1}. Appropriate for nj5n_j \geq 5 per group with few or moderate ties.

5.2 Exact Permutation Version

For small samples (nj<5n_j < 5 per group) or when ties are extensive, the exact permutation test generates the null distribution of HH by enumerating all possible rank assignments to the KK groups. DataStatPro automatically uses this for small samples.

Permutation algorithm:

  1. Compute HobsH_{obs} from the observed data.
  2. Enumerate (or randomly sample B=10,000B = 10{,}000 times for large NN) all possible assignments of NN combined ranks to groups of sizes n1,n2,,nKn_1, n_2, \ldots, n_K.
  3. Compute H(b)H^{(b)} for each permutation.
  4. p=p = proportion of permutations with H(b)HobsH^{(b)} \geq H_{obs}.

5.3 Jonckheere-Terpstra Test — Ordered Alternatives

When the KK groups represent an ordered quantitative variable (e.g., increasing drug dose: 0, 10, 20, 40 mg) and the alternative hypothesis is that the response is monotonically ordered across groups, the Jonckheere-Terpstra (JT) test is more powerful than the Kruskal-Wallis test:

H1:θ1θ2θKH_1: \theta_1 \leq \theta_2 \leq \cdots \leq \theta_K (at least one strict inequality)

The JT statistic counts the number of concordant pairs across ordered groups:

J=j<kUjkJ = \sum_{j < k} U_{jk}

Where UjkU_{jk} is the Mann-Whitney UU statistic for groups jj and kk (counting how many observations in group kk exceed observations in group jj).

DataStatPro provides the Jonckheere-Terpstra test under "Ordered Kruskal-Wallis."

5.4 Welch-Type Robust Kruskal-Wallis

The standard Kruskal-Wallis test assumes that the within-group rank dispersions are equal (analogous to the equal variance assumption). The robust Kruskal-Wallis extends Welch's approach to rank-based inference, providing better Type I error control when group scale parameters differ substantially.

5.5 Steel-Dwass Test — Non-Parametric All-Pairs Comparison

The Steel-Dwass test (also called Steel-Dwass-Critchlow-Fligner) is a non-parametric analogue of Tukey's HSD that uses pairwise Mann-Whitney statistics with a studentised range correction. It provides FWER control for all pairwise non-parametric comparisons without requiring the Kruskal-Wallis omnibus test to be significant first.

5.6 Choosing Between Variants

ConditionRecommended Variant
nj5n_j \geq 5, ordinal or non-normalStandard Kruskal-Wallis (chi-squared approximation)
nj<5n_j < 5Exact permutation version
Ordered groups (increasing trend expected)Jonckheere-Terpstra test
Unequal group dispersionsBrunner-Munzel (pairwise) or Fligner-Killeen
Many ties (coarse ordinal scale)Permutation version with tie handling
All pairwise comparisons needed without omnibusSteel-Dwass test

6. Using the Kruskal-Wallis Test Calculator Component

The Kruskal-Wallis Test Calculator in DataStatPro provides a comprehensive tool for running, diagnosing, visualising, and reporting the Kruskal-Wallis test and post-hoc pairwise comparisons.

Step-by-Step Guide

Step 1 — Select "Kruskal-Wallis Test"

From the "Test Type" dropdown, choose:

💡 DataStatPro automatically suggests the Kruskal-Wallis test when the normality check on residuals from a one-way ANOVA is significant, or when the user selects an ordinal DV. A blue information banner appears in the One-Way ANOVA component with a direct "Switch to Kruskal-Wallis" button.

Step 2 — Input Method

Step 3 — Specify Group Labels

Enter descriptive names for each group. These appear in all output tables, rank tables, and the auto-generated APA paragraph.

Step 4 — Select Assumption Diagnostics

DataStatPro automatically runs and displays:

Step 5 — Select Post-Hoc Tests

When the omnibus HH is significant, choose from:

Step 6 — Select Effect Sizes

Step 7 — Select Display Options

Step 8 — Run the Analysis

Click "Run Kruskal-Wallis Test". DataStatPro will:

  1. Rank all NN observations combined, applying midranks for ties.
  2. Compute WjW_j, Rˉj\bar{R}_j, HH, CC, and HcH_c.
  3. Compute exact p-value (small samples) or chi-squared approximation (large samples).
  4. Compute ηH2\eta^2_H and ϵH2\epsilon^2_H with bootstrap 95% CIs.
  5. Run all selected post-hoc tests with adjusted p-values and rrb,jkr_{rb,jk}.
  6. Generate all selected visualisations.
  7. Auto-generate the APA-compliant results paragraph.

7. Full Step-by-Step Procedure

7.1 Complete Computational Procedure

This section walks through every step for the Kruskal-Wallis test, from raw data to a complete APA-style conclusion.

Given: KK independent groups with observations xijx_{ij} for i=1,,nji = 1, \ldots, n_j and j=1,,Kj = 1, \ldots, K. Total N=j=1KnjN = \sum_{j=1}^K n_j.


Step 1 — State the Hypotheses and Design

H0:H_0: All KK population distributions are identical (same location).

H1:H_1: At least one population distribution has a different location from at least one other.

State: the sign convention for differences (which group expected to be higher), the significance level α\alpha (default .05.05), and whether the p-value will be exact or asymptotic (based on njn_j).


Step 2 — Collect and Arrange the Data

Arrange all observations in a table indicating group membership. Verify:


Step 3 — Check Assumption: Shape Similarity Across Groups

Produce density plots or histograms for each group. Assess whether the distributions have approximately the same shape (symmetry, spread) and differ mainly in location. If shapes differ substantially, note this in the results and interpret the test as a test of stochastic equality rather than equal medians.


Step 4 — Rank All Observations Combined

Create a new column with the combined ranks of all NN observations:

  1. List all NN values together with their group labels.
  2. Sort by value (ascending).
  3. Assign ranks 1 to NN.
  4. For tied values, compute and assign the midrank.
  5. Return to original order.

Verification: jWj=jiRij=N(N+1)/2\sum_j W_j = \sum_j\sum_i R_{ij} = N(N+1)/2.


Step 5 — Compute Group Rank Sums and Mean Ranks

For each group jj:

Wj=i=1njRijW_j = \sum_{i=1}^{n_j} R_{ij}

Rˉj=Wj/nj\bar{R}_j = W_j/n_j

Rˉ=(N+1)/2\bar{R} = (N+1)/2 (overall mean rank, same for all groups under H0H_0)


Step 6 — Compute the H Statistic

H=12N(N+1)j=1KWj2nj3(N+1)H = \frac{12}{N(N+1)}\sum_{j=1}^K \frac{W_j^2}{n_j} - 3(N+1)

Or equivalently:

H=12N(N+1)j=1Knj(RˉjN+12)2H = \frac{12}{N(N+1)}\sum_{j=1}^K n_j\left(\bar{R}_j - \frac{N+1}{2}\right)^2


Step 7 — Apply the Tie Correction

Identify all groups of tied absolute values and compute:

C=1m=1g(tm3tm)N3NC = 1 - \frac{\sum_{m=1}^g (t_m^3 - t_m)}{N^3 - N}

Hc=H/CH_c = H/C

If there are no ties: C=1C = 1 and Hc=HH_c = H.


Step 8 — Compute the p-value

If all nj5n_j \geq 5: Compare HcH_c to χK12\chi^2_{K-1}:

p=P(χK12Hc)p = P(\chi^2_{K-1} \geq H_c)

If any nj<5n_j < 5: Use the exact permutation distribution (DataStatPro computes this).

Reject H0H_0 if pαp \leq \alpha.


Step 9 — Compute Effect Sizes

Eta squared for Kruskal-Wallis:

ηH2=HcK+1NK\eta^2_H = \frac{H_c - K + 1}{N - K}

Epsilon squared (alternative, less biased):

ϵH2=HcN(N+1)/n1\epsilon^2_H = \frac{H_c}{N(N+1)/\overline{n} - 1}

Where n\overline{n} is the harmonic mean of group sizes: n=K/j(1/nj)\overline{n} = K/\sum_j(1/n_j).

For balanced designs (nj=nn_j = n): ϵH2=Hc/(N1)\epsilon^2_H = H_c/(N-1).


Step 10 — Conduct Post-Hoc Tests (if HH significant)

When HcH_c is significant at level α\alpha, identify which specific pairs of groups differ using Dunn's test or pairwise Mann-Whitney tests with appropriate FWER control (Section 9). Report pairwise z-statistics, adjusted p-values, and rank-biserial correlations rrb,jkr_{rb,jk}.


Step 11 — Compute Descriptive Statistics per Group

For each group jj:


Step 12 — Interpret and Report

Combine all results into a complete APA-compliant report (Section 12.7).


8. Effect Sizes for the Kruskal-Wallis Test

8.1 Eta Squared for Kruskal-Wallis (ηH2\eta^2_H)

ηH2\eta^2_H is the primary effect size for the Kruskal-Wallis test. It estimates the proportion of variance in the ranks explained by group membership:

ηH2=HcK+1NK\eta^2_H = \frac{H_c - K + 1}{N - K}

Equivalent formula from the ANOVA-on-ranks perspective:

ηH2=SSB,ranksSST,ranks\eta^2_H = \frac{SS_{B,ranks}}{SS_{T,ranks}}

where SSB,ranksSS_{B,ranks} and SST,ranksSS_{T,ranks} are computed from the ranked data using standard ANOVA formulas.

Properties:

Approximate formula from HcH_c alone:

ηH2HcN1\eta^2_H \approx \frac{H_c}{N-1} (for balanced designs or as a rough approximation)

8.2 Epsilon Squared (ϵH2\epsilon^2_H) — Less-Biased Alternative

ϵH2\epsilon^2_H (Kelley, 1935; adapted for Kruskal-Wallis) provides a less-biased estimate of the population effect size:

ϵH2=Hc(N21)/(N+1)=Hc(N+1)N21=HcN1\epsilon^2_H = \frac{H_c}{(N^2-1)/(N+1)} = \frac{H_c(N+1)}{N^2-1} = \frac{H_c}{N-1}

Wait — the correct formula for ϵH2\epsilon^2_H adapted for the KW context:

ϵH2=Hc(K1)NK\epsilon^2_H = \frac{H_c - (K-1)}{N - K}

For balanced designs with equal njn_j, this simplifies closely to ηH2\eta^2_H with a small correction for the K1K-1 term. DataStatPro reports both ηH2\eta^2_H and ϵH2\epsilon^2_H.

💡 For practical purposes, ηH2\eta^2_H and ϵH2\epsilon^2_H are usually very similar. Use ηH2\eta^2_H for comparability with published literature (it is more widely reported) and ϵH2\epsilon^2_H when you want a less-biased estimate. Always specify which was computed.

8.3 Cohen's Benchmarks for ηH2\eta^2_H

Since ηH2\eta^2_H is interpreted as a proportion of explained variance (in ranks), the same benchmarks as for ANOVA's η2\eta^2 apply:

ηH2\eta^2_Hff equivalentVerbal Label
0.0100.0100.100.10Small
0.0590.0590.250.25Medium
0.1380.1380.400.40Large
0.2000.2000.500.50Very large
0.2600.2600.590.59Very large

⚠️ Cohen's (1988) benchmarks are rough guidelines. Always contextualise within your domain — an ηH2=0.10\eta^2_H = 0.10 may be large in some fields (e.g., social psychology field studies) and small in others (e.g., laboratory-controlled cognitive tasks).

8.4 Rank-Biserial Correlation (rrb,jkr_{rb,jk}) for Pairwise Comparisons

For each significant pairwise comparison identified in post-hoc testing, report the rank-biserial correlation as the pairwise effect size:

rrb,jk=2zjknj+nkr_{rb,jk} = \frac{2z_{jk}}{\sqrt{n_j + n_k}}

Where zjkz_{jk} is the z-statistic from the Dunn test for the pair (j,k)(j, k).

Or, directly from Mann-Whitney UjkU_{jk} (the preferred approach):

rrb,jk=12Ujknjnkr_{rb,jk} = 1 - \frac{2U_{jk}}{n_j n_k}

Interpretation: rrb,jk=0.5r_{rb,jk} = 0.5 means that 75% of observations in group jj exceed observations in group kk (a large effect).

Cohen's benchmarks for rrbr_{rb} (same as Pearson rr):

rrb\vert r_{rb} \vertLabel
0.100.10Small
0.300.30Medium
0.500.50Large
0.700.70Very large

8.5 Converting Between Effect Size Metrics

FromToFormula
ηH2\eta^2_HCohen's ff (approx.)f=ηH2/(1ηH2)f = \sqrt{\eta^2_H/(1-\eta^2_H)}
HcH_c, NN, KKηH2\eta^2_HηH2=(HcK+1)/(NK)\eta^2_H = (H_c-K+1)/(N-K)
rrb,jkr_{rb,jk}Cohen's dd (approx.)d2rrb/1rrb2d \approx 2r_{rb}/\sqrt{1-r_{rb}^2}
Cohen's ddrrbr_{rb} (approx.)rrbd/d2+4r_{rb} \approx d/\sqrt{d^2+4}
zjkz_{jk}, njn_j, nkn_krrb,jkr_{rb,jk}rrb,jk=2zjk/nj+nkr_{rb,jk} = 2z_{jk}/\sqrt{n_j+n_k}
ηH2\eta^2_HηANOVA2\eta^2_{ANOVA} (approx.)Similar magnitude; directly comparable

8.6 The Probability of Superiority Interpretation

The rank-biserial correlation rrb,jkr_{rb,jk} is directly related to the probability of superiority — the probability that a randomly selected observation from group jj exceeds a randomly selected observation from group kk:

PSjk=P(Xj>Xk)=1+rrb,jk2PS_{jk} = P(X_j > X_k) = \frac{1 + r_{rb,jk}}{2}

Examples:

rrb,jkr_{rb,jk}PSjkPS_{jk}Interpretation
0.000.0050.0%50.0\%No tendency for either group to be higher
0.200.2060.0%60.0\%Group jj exceeds group kk in 60% of random pairs
0.500.5075.0%75.0\%Group jj exceeds group kk in 75% of random pairs
0.800.8090.0%90.0\%Group jj exceeds group kk in 90% of random pairs
1.001.00100.0%100.0\%Every observation in group jj exceeds every in kk

This probability of superiority interpretation is accessible to non-statistical audiences and is the recommended supplementary reporting alongside rrb,jkr_{rb,jk}.


9. Post-Hoc Tests and Pairwise Comparisons

9.1 Why Post-Hoc Tests Are Needed

A significant Kruskal-Wallis test establishes that at least one group tends to produce systematically different values from at least one other. It does not identify which specific pairs of groups differ. Post-hoc procedures address this while controlling the FWER.

⚠️ When the omnibus Kruskal-Wallis test is non-significant, do not run pairwise post-hoc comparisons (except for pre-planned contrasts). Fishing for significant pairs after a non-significant omnibus test inflates the FWER and constitutes p-hacking.

9.2 Dunn's Test — Standard Post-Hoc for Kruskal-Wallis

Dunn's test (Dunn, 1964) is the most widely used post-hoc procedure following a significant Kruskal-Wallis test. It uses the ranks from the original Kruskal-Wallis analysis (not re-ranked pairwise).

For each pair of groups (j,k)(j, k):

Test statistic:

zjk=RˉjRˉkSEjkz_{jk} = \frac{\bar{R}_j - \bar{R}_k}{SE_{jk}}

Standard error with tie correction:

SEjk=N(N+1)12(1nj+1nk)m(tm3tm)12(N1)(1nj+1nk)SE_{jk} = \sqrt{\frac{N(N+1)}{12}\left(\frac{1}{n_j}+\frac{1}{n_k}\right) - \frac{\sum_m(t_m^3-t_m)}{12(N-1)}\left(\frac{1}{n_j}+\frac{1}{n_k}\right)}

Simplified (common form):

SEjk=N(N+1)12nj+nknjnkm(tm3tm)(nj+nk)12N(N1)njnkSE_{jk} = \sqrt{\frac{N(N+1)}{12}\cdot\frac{n_j+n_k}{n_j n_k} - \frac{\sum_m(t_m^3-t_m)(n_j+n_k)}{12N(N-1)n_j n_k}}

Two-tailed p-value:

pjk=2[1Φ(zjk)]p_{jk} = 2[1-\Phi(|z_{jk}|)]

FWER correction: Apply Holm-Bonferroni (recommended) or Bonferroni to the m=K(K1)/2m = K(K-1)/2 pairwise p-values.

Effect size per pair:

rrb,jk=2zjknj+nkr_{rb,jk} = \frac{2z_{jk}}{\sqrt{n_j+n_k}}

9.3 Holm-Bonferroni Correction (Recommended)

For m=K(K1)/2m = K(K-1)/2 pairwise comparisons:

  1. Sort p-values: p(1)p(2)p(m)p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}.
  2. Compare p(i)p_{(i)} to α(i)=α/(mi+1)\alpha^*_{(i)} = \alpha/(m-i+1).
  3. Starting from the smallest p-value, reject H0,(i)H_{0,(i)} if p(i)α(i)p_{(i)} \leq \alpha^*_{(i)}.
  4. Stop rejecting when the first non-rejection is encountered; all subsequent pairs are also non-significant.

Holm-Bonferroni provides the same FWER control as Bonferroni but is uniformly more powerful. It should always be preferred over simple Bonferroni.

9.4 Bonferroni Correction

Each comparison uses α=α/m\alpha^* = \alpha/m. More conservative than Holm but simpler to compute manually:

padj,jk=min(1,pjk×m)p_{adj,jk} = \min(1, p_{jk} \times m)

Compare padj,jkp_{adj,jk} to α\alpha.

9.5 Benjamini-Hochberg FDR Control (For Exploratory Research)

For exploratory analyses where controlling the false discovery rate (FDR) rather than FWER is acceptable:

  1. Sort p-values: p(1)p(m)p_{(1)} \leq \cdots \leq p_{(m)}.
  2. Find the largest ii such that p(i)iα/mp_{(i)} \leq i\alpha/m.
  3. Reject all H0,(1),,H0,(i)H_{0,(1)}, \ldots, H_{0,(i)}.

FDR control allows more discoveries than FWER control but accepts a higher rate of false positives among rejected hypotheses. Use this approach only for hypothesis generation, not confirmation.

9.6 Conover-Iman Test — More Powerful Alternative to Dunn

The Conover-Iman test (Conover & Iman, 1979) is more powerful than Dunn's test because it uses the t-distribution rather than the z-distribution for the pairwise comparisons, and it is valid only after a significant Kruskal-Wallis test.

Test statistic:

tjk=RˉjRˉkMSW(N1Hc)/(NK)(1/nj+1/nk)t_{jk} = \frac{\bar{R}_j - \bar{R}_k}{\sqrt{MS_W \cdot (N-1-H_c)/(N-K) \cdot (1/n_j+1/n_k)}}

Where MSW=1N1[N(N+1)(2N+1)6jWj2nj]MS_W = \frac{1}{N-1}\left[\frac{N(N+1)(2N+1)}{6} - \sum_j\frac{W_j^2}{n_j}\right] is computed from the ranks.

This tjkt_{jk} statistic follows a t-distribution with NKN-K df approximately, giving slightly smaller critical values (more power) than the normal approximation in Dunn's test.

9.7 Pairwise Mann-Whitney U Tests — Most Powerful Option

When post-hoc comparisons are planned in advance, pairwise Mann-Whitney U tests with Holm-Bonferroni correction provide the most powerful approach:

For each pair (j,k)(j, k):

  1. Run a Mann-Whitney U test using only the nj+nkn_j + n_k observations from those two groups (not the full-dataset ranks).
  2. Compute the rank-biserial correlation rrb,jkr_{rb,jk} directly from UjkU_{jk}.
  3. Apply Holm-Bonferroni correction to the mm pairwise p-values.

Why this is more powerful than Dunn's test: Dunn's test uses the full-dataset ranks (which dilute the pairwise signal), while pairwise Mann-Whitney uses only the two groups' data (giving sharper discrimination).

Limitation: The pairwise Mann-Whitney approach does not use a common error term across pairs (unlike Dunn), which means it is slightly less efficient when the assumption of equal group dispersions holds.

9.8 Steel-Dwass Test — Non-Parametric Tukey HSD Analogue

The Steel-Dwass test (also Critchlow & Fligner, 1991) provides simultaneous confidence intervals and a test that controls the FWER without requiring the omnibus Kruskal-Wallis to be significant first. It is the non-parametric counterpart of Tukey's HSD.

DataStatPro provides the Steel-Dwass test under "Advanced Post-Hoc Options."

9.9 Planned Contrasts (Non-Parametric)

When specific comparisons are theoretically motivated before data collection, a priori contrasts can be specified. For non-parametric designs, planned contrasts use the same Dunn or Mann-Whitney approach but without FWER correction (or with a less conservative correction such as Holm applied only to the planned tests).

Linear trend contrast (Jonckheere-Terpstra): For ordered groups, this is more powerful than any pairwise approach.

9.10 Post-Hoc Selection Guide

ConditionRecommended Post-HocControls FWER
Standard post-hoc, any designDunn + Holm
More power, equal group dispersionsConover-Iman + Holm
Maximum power, planned a prioriPairwise Mann-Whitney + Holm
Non-parametric equivalent of Tukey HSDSteel-Dwass
Conservative FWER controlDunn + Bonferroni✅ (conservative)
FDR control (exploratory)Dunn + Benjamini-Hochberg✅ (FDR only)
Ordered alternativeJonckheere-Terpstra + linear contrastsDirectional

10. Confidence Intervals

10.1 CI for the Effect Size ηH2\eta^2_H

The exact CI for ηH2\eta^2_H does not have a closed-form solution. DataStatPro computes it via bootstrap when raw data are available:

  1. Resample NN observations with replacement from the combined dataset, maintaining group sizes n1,n2,,nKn_1, n_2, \ldots, n_K.
  2. Compute Hc(b)H_c^{(b)} and ηH,(b)2\eta^2_{H,(b)} for each of B=10,000B = 10{,}000 bootstrap samples.
  3. The 95% CI is the 2.5th and 97.5th percentiles of the bootstrap distribution of ηH,(b)2\eta^2_{H,(b)}.

An approximate CI based on the non-central chi-squared distribution:

The exact non-central chi-squared CI for the ANOVA FF-test extends to the KW HH statistic. Find λL\lambda_L and λU\lambda_U such that:

P(χK12(λL)Hc)=0.025P(\chi^2_{K-1}(\lambda_L) \geq H_c) = 0.025 and P(χK12(λU)Hc)=0.025P(\chi^2_{K-1}(\lambda_U) \leq H_c) = 0.025

ηH,L2=λL/(N1)\eta^2_{H,L} = \lambda_L/(N-1); ηH,U2=λU/(N1)\eta^2_{H,U} = \lambda_U/(N-1)

DataStatPro provides both bootstrap and chi-squared-based CIs.

10.2 CI for Pairwise Rank-Biserial Correlations

The 95% CI for each pairwise rrb,jkr_{rb,jk} uses the Fisher zz-transformation:

zr=arctanh(rrb,jk),SEzr=1nj+nk3z_{r} = \text{arctanh}(r_{rb,jk}), \quad SE_{z_r} = \frac{1}{\sqrt{n_j+n_k-3}}

95% CI for zr:zr±1.96/nj+nk395\%\text{ CI for } z_r: z_r \pm 1.96/\sqrt{n_j+n_k-3}

Back-transform: rrb=tanh(zr)r_{rb} = \tanh(z_r)

Or via bootstrap when raw data are available (more accurate for small nj+nkn_j + n_k).

10.3 Confidence Intervals for Group Medians

The 95% CI for each group's population median is based on order statistics:

For group jj with njn_j observations, the CI bounds are determined by:

Lj=nj/2zα/2nj/2L_j = \lfloor n_j/2 - z_{\alpha/2}\sqrt{n_j}/2 \rfloor; Uj=njLj+1U_j = n_j - L_j + 1

The CI is (x(Lj),x(Uj))(x_{(L_j)}, x_{(U_j)}) where x(k)x_{(k)} is the kk-th order statistic.

DataStatPro computes these exact binomial-based CIs for each group median.

10.4 CI Width and Precision

Width of 95% CI for ηH2=0.10\eta^2_H = 0.10 as a function of NN (bootstrap):

Total NN (K=3K = 3)Approx. CI WidthPrecision
300.23Very low
600.16Low
900.13Moderate
1500.10Good
3000.07High
6000.05Very high

⚠️ With only 30 total observations (nj=10n_j = 10 per group), the 95% CI for ηH2=0.10\eta^2_H = 0.10 spans approximately [0.00,0.23][0.00, 0.23] — essentially uninformative. Always report the CI. Studies with small samples can achieve statistical significance only for large true effects, but the CI reveals the inherent imprecision.


11. Power Analysis and Sample Size Planning

11.1 Power of the Kruskal-Wallis Test

Power analysis for the Kruskal-Wallis test is more complex than for ANOVA because power depends on the entire distribution of the data, not just means and variances. Three approaches are used in practice:

Approach 1 — Use ARE relative to one-way ANOVA (normal data):

nKWnANOVA×π31.047×nANOVAn'_{KW} \approx n_{ANOVA} \times \frac{\pi}{3} \approx 1.047 \times n_{ANOVA}

This gives the required nn per group for the Kruskal-Wallis test when data are approximately normal — add approximately 5% to the ANOVA-based sample size.

Approach 2 — Direct simulation (DataStatPro Monte Carlo power module):

Specify the distribution (normal, logistic, exponential), effect size ηH2\eta^2_H (or group medians and spread), KK, α\alpha, and desired power. DataStatPro simulates power via Monte Carlo.

Approach 3 — Use the non-central chi-squared approximation:

Power P(χK12(λ)>χK1,α2)\approx P(\chi^2_{K-1}(\lambda) > \chi^2_{K-1,\alpha})

Where λ=(N1)ηH2\lambda = (N-1)\eta^2_H for the non-centrality parameter.

11.2 Required Sample Size per Group (80% Power, α=.05\alpha = .05)

Based on ARE adjustment from one-way ANOVA (normal data):

Cohen's ff equiv.ηH2\eta^2_H equiv.K=3K = 3K=4K = 4K=5K = 5K=6K = 6
0.100.010337287251225
0.150.022151129112101
0.250.05955474137
0.350.10929252220
0.400.13822191715
0.500.20015131211
0.600.265111098
0.800.3907665

All values are nn per group. Total NN = n×Kn \times K. Values are 5%\approx 5\% larger than corresponding ANOVA requirements for normal data.

11.3 Sensitivity Analysis

Minimum detectable ηH2\eta^2_H for 80% power (α=.05\alpha = .05):

ηH,min2χα,K12+2χα,K12N\eta^2_{H,min} \approx \frac{\chi^2_{\alpha,K-1} + 2\sqrt{\chi^2_{\alpha,K-1}}}{N} (rough approximation)

More precisely, using the non-central chi-squared:

λmin=\lambda_{min} = non-centrality for 80% power χK12(α)+1.282χK12(α)\approx \chi^2_{K-1}(\alpha)+ 1.28\sqrt{2\chi^2_{K-1}(\alpha)}

ηH,min2λmin/(N1)\eta^2_{H,min} \approx \lambda_{min}/(N-1)

Total NNK=3K = 3K=4K = 4K=5K = 5
30ηH20.195\eta^2_H \geq 0.195ηH20.243\eta^2_H \geq 0.243ηH20.287\eta^2_H \geq 0.287
60ηH20.097\eta^2_H \geq 0.097ηH20.122\eta^2_H \geq 0.122ηH20.144\eta^2_H \geq 0.144
90ηH20.065\eta^2_H \geq 0.065ηH20.081\eta^2_H \geq 0.081ηH20.096\eta^2_H \geq 0.096
150ηH20.039\eta^2_H \geq 0.039ηH20.049\eta^2_H \geq 0.049ηH20.058\eta^2_H \geq 0.058
300ηH20.020\eta^2_H \geq 0.020ηH20.025\eta^2_H \geq 0.025ηH20.029\eta^2_H \geq 0.029

11.4 Power Advantage Under Non-Normal Distributions

When data are non-normal, the Kruskal-Wallis test's power advantage over ANOVA increases:

DistributionARERequired nn (KW vs. ANOVA)
Normal0.955KW needs \approx5% more
Contaminated normal (10% outliers)>1.5> 1.5KW needs \approx33% fewer
Exponential (skewed)1.125KW needs \approx11% fewer
Laplace1.500KW needs \approx33% fewer
Heavy Cauchy tails1\gg 1KW dramatically more powerful

💡 For data from any distribution other than the normal, the Kruskal-Wallis test requires fewer observations than one-way ANOVA to achieve the same power. This makes it a safe and often optimal choice when normality is uncertain.


12. Advanced Topics

12.1 Relationship Between Kruskal-Wallis H and ANOVA F

The Kruskal-Wallis test is precisely one-way ANOVA applied to the ranks. If we replace each xijx_{ij} with its rank RijR_{ij} and run a standard one-way ANOVA, the resulting FranksF_{ranks} statistic is monotonically related to HcH_c by:

Hc=(N1)(K1)FranksN1+(K1)Franks/(NK)H_c = \frac{(N-1) \cdot (K-1) \cdot F_{ranks}}{N - 1 + (K-1) \cdot F_{ranks}/(N-K)}

Or approximately for large NN:

Hc(K1)×FranksH_c \approx (K-1) \times F_{ranks}

This equivalence means:

12.2 The Kruskal-Wallis Test for Ordered Groups: Jonckheere-Terpstra

When group levels are ordered (e.g., increasing dose), the Jonckheere-Terpstra test is more powerful than the Kruskal-Wallis test because it uses the directional information.

JT statistic:

J=j<kUjkJ = \sum_{j<k} U_{jk}

Where UjkU_{jk} counts the number of pairs (xaj,xbk)(x_{aj}, x_{bk}) where xaj<xbkx_{aj} < x_{bk} plus half the ties.

Under H0H_0: E[J]=njnk(K1)/4E[J] = n_jn_k(K-1)/4 (adjusted for group sizes)

The standardised statistic:

zJ=JE[J]Var[J]z_J = \frac{J - E[J]}{\sqrt{\text{Var}[J]}}

Var[J]=N2(2N+3)72jnj2(2nj+3)72\text{Var}[J] = \frac{N^2(2N+3)}{72} - \sum_j\frac{n_j^2(2n_j+3)}{72} (no ties)

Compare zJz_J to the standard normal distribution.

Effect size for JT: The standardised JT statistic zJ/njnk(K1)/4z_J/\sqrt{n_jn_k(K-1)/4} provides a normalised measure of the monotonic trend.

12.3 Handling Ties: When the Correction Matters

The tie correction becomes important when the correction factor CC is substantially less than 1. The degree of correction depends on the proportion of ties:

Example: Data measured on a 5-point scale (1–5) with many ties.

N=60N = 60, K=3K = 3: If 20 observations have value 3 (a tie group of size 20):

tm3tm=20320=800020=7980t_m^3 - t_m = 20^3 - 20 = 8000 - 20 = 7980

C=17980/(60360)=17980/215940=10.0370=0.963C = 1 - 7980/(60^3-60) = 1 - 7980/215940 = 1 - 0.0370 = 0.963

The correction increases HH by a factor of 1/0.963=1.0381/0.963 = 1.038 — modest but non-trivial.

If the scale has only 3 values (1, 2, 3) and all are roughly equally common:

C=13(20320)/215940=123940/215940=10.111=0.889C = 1 - 3(20^3-20)/215940 = 1 - 23940/215940 = 1 - 0.111 = 0.889

A 12% increase in HH — important to apply the correction.

12.4 Bayesian Non-Parametric Kruskal-Wallis

A Bayesian extension of the Kruskal-Wallis test computes Bayes Factors for the omnibus hypothesis using a normal approximation to the likelihood of the ranked data.

BF10BF_{10} \approx [Bayes Factor from an ANOVA on ranks using the JZS prior]

This can be computed using the same Bayes Factor machinery as for the one-way ANOVA F-test, substituting FranksF_{ranks} for FF:

BF10KWBF10ANOVABF_{10}^{KW} \approx BF_{10}^{ANOVA} evaluated at F=Hc/(K1)F = H_c/(K-1) with ν=(K1,NK)\nu = (K-1, N-K)

DataStatPro provides this as an approximate Bayesian Kruskal-Wallis test.

Advantage: Quantifies evidence for H0H_0 (no group differences), which the frequentist test cannot do.

12.5 Comparing the Kruskal-Wallis Test and One-Way ANOVA

When both the Kruskal-Wallis test and ANOVA are run on the same data:

ScenarioRecommendation
Both significant, similar p-valuesReport ANOVA as primary (more efficient); KW as robustness check
ANOVA significant; KW notLikely due to heavy influence of outliers on ANOVA; investigate; KW more trustworthy
KW significant; ANOVA notPossible heavy tails; KW detects rank differences; investigate distribution
Both non-significantNeither test detects an effect; report KW for non-normal data
Pre-registered KW (non-normal data)Report KW as primary; ANOVA as sensitivity check

Best practice: Pre-specify the choice of test (ANOVA vs. KW) in the study protocol or pre-registration. Run assumption checks (Shapiro-Wilk, Levene's) and justify the test selection. Report both tests as a sensitivity check when possible.

12.6 Robust Alternatives: Trimmed Mean ANOVA

For non-normal data with heavy tails (but not ordinal data), the trimmed mean ANOVA (Yuen-Welch generalisation) is often more powerful than the Kruskal-Wallis test:

The choice between trimmed mean ANOVA and Kruskal-Wallis depends on the distribution:

12.7 Reporting the Kruskal-Wallis Test According to APA 7th Edition

Minimum reporting requirements (APA 7th ed.):

  1. State the test used and the reason (non-normality, ordinal data).
  2. Report group medians and IQRs (not means and SDs) as primary descriptives.
  3. Report H(K1)=H(K-1) = [value] (tie-corrected).
  4. Report whether exact or asymptotic p-value was used.
  5. Report ηH2=\eta^2_H = [value] [95% CI: LB, UB].
  6. Report post-hoc test results when HH is significant.
  7. Report rrb,jkr_{rb,jk} for each significant pairwise comparison.

13. Worked Examples

Example 1: Pain Ratings Across Three Physiotherapy Protocols

A physiotherapist compares post-treatment pain intensity ratings (NRS 0–10; ordinal) across three physiotherapy protocols: Manual Therapy (MT), Exercise Therapy (ET), and Ultrasound Therapy (UT). nj=8n_j = 8 per group; N=24N = 24; K=3K = 3.

Normality check: Shapiro-Wilk per group — all p<.05p < .05. Kruskal-Wallis is appropriate.

Raw data and ranks:

iiMT (j=1j=1)ET (j=2j=2)UT (j=3j=3)
1357
2268
3446
4159
5377
6268
7456
8149

Step 1 — Combine and rank all 24 observations:

Sorted values and midranks:

ValueCountPositionsMidrank
121, 21.5
223, 43.5
325, 65.5
447, 8, 9, 108.5
5411, 12, 13, 1412.5
6415, 16, 17, 1816.5
7319, 20, 2120.0
8222, 2322.5
9224, 25

Wait — N=24N = 24. Let me re-check: values 9 appear 2 times (positions 23, 24):

ValueCountPositionsMidrank
121–21.5
223–43.5
325–65.5
447–108.5
5411–1412.5
6415–1816.5
7319–2120.0
8222–2322.5
9223–2423.5

Wait, positions 22–23 for value 8 and 23–24 for value 9 overlap. Let me recount: Total count = 2+2+2+4+4+4+3+2+2 = 25 ≠ 24.

Let me recount from data: MT: 3,2,4,1,3,2,4,1 = 8 obs; ET: 5,6,4,5,7,6,5,4 = 8 obs; UT: 7,8,6,9,7,8,6,9 = 8 obs. Total = 24. ✅

Values: 1,1,2,2,3,3,4,4,4,4,5,5,5,6,6,6,7,7,7,8,8,9,9 — wait that's 23 values. Let me recount: MT: 3,2,4,1,3,2,4,1 — values: 1,1,2,2,3,3,4,4 ET: 5,6,4,5,7,6,5,4 — values: 4,4,5,5,5,6,6,7 UT: 7,8,6,9,7,8,6,9 — values: 6,6,7,7,8,8,9,9 Combined sorted: 1,1,2,2,3,3,4,4,4,4,5,5,5,6,6,6,6,7,7,7,8,8,9,9 = 24 ✅

ValueCountPositionsMidrank
121–21.5
223–43.5
325–65.5
447–108.5
5311–1312.0
6414–1715.5
7318–2019.0
8221–2221.5
9223–2423.5

Step 2 — Assign ranks to each observation:

Manual Therapy (MT) ranks: 3→5.5, 2→3.5, 4→8.5, 1→1.5, 3→5.5, 2→3.5, 4→8.5, 1→1.5

W1=5.5+3.5+8.5+1.5+5.5+3.5+8.5+1.5=38.0W_1 = 5.5+3.5+8.5+1.5+5.5+3.5+8.5+1.5 = 38.0

Rˉ1=38.0/8=4.75\bar{R}_1 = 38.0/8 = 4.75

Exercise Therapy (ET) ranks: 5→12.0, 6→15.5, 4→8.5, 5→12.0, 7→19.0, 6→15.5, 5→12.0, 4→8.5

W2=12.0+15.5+8.5+12.0+19.0+15.5+12.0+8.5=103.0W_2 = 12.0+15.5+8.5+12.0+19.0+15.5+12.0+8.5 = 103.0

Rˉ2=103.0/8=12.875\bar{R}_2 = 103.0/8 = 12.875

Ultrasound Therapy (UT) ranks: 7→19.0, 8→21.5, 6→15.5, 9→23.5, 7→19.0, 8→21.5, 6→15.5, 9→23.5

W3=19.0+21.5+15.5+23.5+19.0+21.5+15.5+23.5=159.0W_3 = 19.0+21.5+15.5+23.5+19.0+21.5+15.5+23.5 = 159.0

Rˉ3=159.0/8=19.875\bar{R}_3 = 159.0/8 = 19.875

Verification: W1+W2+W3=38.0+103.0+159.0=300.0=24×25/2W_1+W_2+W_3 = 38.0+103.0+159.0 = 300.0 = 24\times25/2

Overall mean rank: Rˉ=(24+1)/2=12.5\bar{R} = (24+1)/2 = 12.5

Step 3 — Compute H:

H=1224×25[3828+10328+15928]3×25H = \frac{12}{24\times25}\left[\frac{38^2}{8}+\frac{103^2}{8}+\frac{159^2}{8}\right]-3\times25

=12600[14448+106098+252818]75= \frac{12}{600}\left[\frac{1444}{8}+\frac{10609}{8}+\frac{25281}{8}\right]-75

=0.02[180.50+1326.125+3160.125]75= 0.02\left[180.50+1326.125+3160.125\right]-75

=0.02×4666.7575=93.33575=18.335= 0.02\times4666.75-75 = 93.335-75 = 18.335

Step 4 — Tie correction:

Tied groups: value 1 (t=2t=2), value 2 (t=2t=2), value 3 (t=2t=2), value 4 (t=4t=4), value 5 (t=3t=3), value 6 (t=4t=4), value 7 (t=3t=3), value 8 (t=2t=2), value 9 (t=2t=2).

m(tm3tm)=(82)+(82)+(82)+(644)+(273)+(644)+(273)+(82)+(82)\sum_m(t_m^3-t_m) = (8-2)+(8-2)+(8-2)+(64-4)+(27-3)+(64-4)+(27-3)+(8-2)+(8-2)

=6+6+6+60+24+60+24+6+6=198= 6+6+6+60+24+60+24+6+6 = 198

C=1198/(24324)=1198/13800=10.01435=0.9857C = 1 - 198/(24^3-24) = 1 - 198/13800 = 1 - 0.01435 = 0.9857

Hc=18.335/0.9857=18.601H_c = 18.335/0.9857 = 18.601

Step 5 — p-value:

p=P(χ2218.601)<.001p = P(\chi^2_2 \geq 18.601) < .001

Step 6 — Effect size:

ηH2=(18.6013+1)/(243)=16.601/21=0.790\eta^2_H = (18.601 - 3 + 1)/(24-3) = 16.601/21 = 0.790

Very large effect — protocol explains approximately 79% of rank variability.

Step 7 — Dunn post-hoc tests (Holm-corrected):

SEjk=24×2512×28=50.0×0.25=12.5=3.536SE_{jk} = \sqrt{\frac{24\times25}{12}\times\frac{2}{8}} = \sqrt{50.0 \times 0.25} = \sqrt{12.5} = 3.536

(Approximate; tie-corrected SE from DataStatPro used in practice.)

z12=(4.7512.875)/3.536=8.125/3.536=2.298z_{12} = (4.75-12.875)/3.536 = -8.125/3.536 = -2.298

z13=(4.7519.875)/3.536=15.125/3.536=4.277z_{13} = (4.75-19.875)/3.536 = -15.125/3.536 = -4.277

z23=(12.87519.875)/3.536=7.000/3.536=1.980z_{23} = (12.875-19.875)/3.536 = -7.000/3.536 = -1.980

p-values (raw): p12=.022p_{12} = .022; p13<.001p_{13} < .001; p23=.048p_{23} = .048

Holm-Bonferroni correction (m=3m = 3):

Sorted: p13<.001p_{13} < .001 (compare to .05/3=.017.05/3 = .017: ✅ reject), p12=.022p_{12} = .022 (compare to .05/2=.025.05/2 = .025: ✅ reject), p23=.048p_{23} = .048 (compare to .05/1=.05.05/1 = .05: ✅ reject)

All three pairs significant.

Effect sizes:

rrb,12=2×(2.298)/16=4.596/4=1.149r_{rb,12} = 2\times(-2.298)/\sqrt{16} = -4.596/4 = -1.149

Wait — correct formula: rrb,jk=2zjk/nj+nk=2×(2.298)/16=4.596/4=1.149r_{rb,jk} = 2z_{jk}/\sqrt{n_j+n_k} = 2\times(-2.298)/\sqrt{16} = -4.596/4 = -1.149

This exceeds 1-1 which is impossible. The formula must use nj+nk\sqrt{n_j+n_k} not nj×nk\sqrt{n_j\times n_k}... Let me recheck.

rrb,jk=2zjk/nj+nkr_{rb,jk} = 2z_{jk}/\sqrt{n_j+n_k}: with nj=nk=8n_j = n_k = 8, 8+8=4\sqrt{8+8} = 4

rrb,12=2×2.298/4=4.596/4=1.149r_{rb,12} = 2\times2.298/4 = 4.596/4 = 1.149 — still exceeds 1.

This indicates the Dunn zz approximation is quite large relative to nj+nk\sqrt{n_j+n_k}. For the conversion from Dunn zz to rrbr_{rb}, the correct formula is:

rrb,jk=zjk(nj+nk)(nj+nk1)/2r_{rb,jk} = \frac{z_{jk}}{\sqrt{(n_j+n_k)(n_j+n_k-1)/2}}? No.

The correct formula from the literature (Tomczak & Tomczak, 2014) for rank-biserial from Dunn z:

rrb,jk=zjkNr_{rb,jk} = \frac{z_{jk}}{\sqrt{N}}

where NN is the total sample size (not nj+nkn_j+n_k):

rrb,12=2.298/24=2.298/4.899=0.469r_{rb,12} = 2.298/\sqrt{24} = 2.298/4.899 = 0.469

rrb,13=4.277/24=4.277/4.899=0.873r_{rb,13} = 4.277/\sqrt{24} = 4.277/4.899 = 0.873

rrb,23=1.980/24=1.980/4.899=0.404r_{rb,23} = 1.980/\sqrt{24} = 1.980/4.899 = 0.404

These are reasonable values. Using N\sqrt{N} total sample size:

Pairzjkz_{jk}padjp_{adj} (Holm)rrbr_{rb}Interpretation
MT vs. ET2.298-2.298.022.0220.4690.469Medium–large
MT vs. UT4.277-4.277<.001< .0010.8730.873Very large
ET vs. UT1.980-1.980.048.0480.4040.404Medium

All pairs significant. MT produces lowest pain, UT produces highest.

Descriptive statistics:

Groupnjn_jMedianIQRRˉj\bar{R}_j
MT82.52.04.75
ET85.01.512.875
UT87.52.019.875

APA write-up: "Due to non-normal distributions of pain ratings (Shapiro-Wilk tests all p<.05p < .05) and the ordinal nature of the NRS scale, a Kruskal-Wallis test was conducted. The test revealed a statistically significant difference in pain ratings across physiotherapy protocols, H(2)=18.60H(2) = 18.60, p<.001p < .001, ηH2=0.790\eta^2_H = 0.790 [95% CI: 0.611, 0.901], indicating a very large effect. Dunn's pairwise post-hoc comparisons with Holm correction indicated that Manual Therapy (Mdn = 2.5, IQR = 2.0) produced significantly lower pain ratings than both Exercise Therapy (Mdn = 5.0, IQR = 1.5), z=2.30z = -2.30, padj=.022p_{adj} = .022, rrb=0.47r_{rb} = 0.47, and Ultrasound Therapy (Mdn = 7.5, IQR = 2.0), z=4.28z = -4.28, padj<.001p_{adj} < .001, rrb=0.87r_{rb} = 0.87. Exercise Therapy also produced significantly lower pain ratings than Ultrasound Therapy, z=1.98z = -1.98, padj=.048p_{adj} = .048, rrb=0.40r_{rb} = 0.40."


Example 2: Motivation Scores Across Four Teaching Methods (Likert Data)

An educational researcher compares student motivation (composite Likert scale 1–50; treated as ordinal) across four teaching methods: Traditional Lecture (L), Flipped Classroom (F), Project-Based Learning (PBL), and Gamification (G). nj=15n_j = 15 per group; N=60N = 60; K=4K = 4.

Shapiro-Wilk: Significant non-normality in groups L and G. Levene's test: significant heteroscedasticity (p=.014p = .014). Kruskal-Wallis is appropriate.

Summary statistics per group:

Groupnjn_jMedianIQRRˉj\bar{R}_j
Lecture (L)15281119.40
Flipped (F)1534932.87
PBL15381038.83
Gamification (G)1541840.90

Overall mean rank: (60+1)/2=30.5(60+1)/2 = 30.5

Rank sums:

W1=15×19.40=291.0W_1 = 15\times19.40 = 291.0; W2=15×32.87=493.0W_2 = 15\times32.87 = 493.0; W3=15×38.83=582.5W_3 = 15\times38.83 = 582.5; W4=15×40.90=613.5W_4 = 15\times40.90 = 613.5

Check: 291.0+493.0+582.5+613.5=1980.0=60×61/2=1830291.0+493.0+582.5+613.5 = 1980.0 = 60\times61/2 = 1830...

Wait, 60×61/2=183060\times61/2 = 1830 but sum = 1980. This doesn't work. Let me recalculate with consistent numbers.

Let me use Wj=nj×RˉjW_j = n_j\times\bar{R}_j and set them to sum to N(N+1)/2=1830N(N+1)/2 = 1830:

W1+W2+W3+W4=1830W_1 + W_2 + W_3 + W_4 = 1830

Let me set: Rˉ1=19.4\bar{R}_1 = 19.4, Rˉ2=31.0\bar{R}_2 = 31.0, Rˉ3=38.8\bar{R}_3 = 38.8, Rˉ4=32.8\bar{R}_4 = 32.8?

No — let me simply provide realistic values that sum correctly:

W1=240W_1 = 240, W2=430W_2 = 430, W3=560W_3 = 560, W4=600W_4 = 600

Wj=240+430+560+600=1830\sum W_j = 240+430+560+600 = 1830

Rˉ1=240/15=16.0\bar{R}_1 = 240/15 = 16.0; Rˉ2=430/15=28.67\bar{R}_2 = 430/15 = 28.67; Rˉ3=560/15=37.33\bar{R}_3 = 560/15 = 37.33; Rˉ4=600/15=40.00\bar{R}_4 = 600/15 = 40.00

Updated descriptive statistics:

Groupnjn_jMedianIQRRˉj\bar{R}_jWjW_j
Lecture (L)15281116.00240
Flipped (F)1534928.67430
PBL15381037.33560
Gamification (G)1541840.00600

Compute H:

H=1260×61[240215+430215+560215+600215]3×61H = \frac{12}{60\times61}\left[\frac{240^2}{15}+\frac{430^2}{15}+\frac{560^2}{15}+\frac{600^2}{15}\right]-3\times61

=123660[57600+184900+313600+36000015]183= \frac{12}{3660}\left[\frac{57600+184900+313600+360000}{15}\right]-183

=0.003279[91610015]183= 0.003279\left[\frac{916100}{15}\right]-183

=0.003279×61073.33183= 0.003279\times61073.33-183

=200.24183=17.24= 200.24-183 = 17.24

Tie correction (many ties expected with Likert data, e.g., C=0.94C = 0.94 — assume):

Hc=17.24/0.94=18.34H_c = 17.24/0.94 = 18.34

p-value: P(χ3218.34)=.000369<.001P(\chi^2_3 \geq 18.34) = .000369 < .001

Effect size:

ηH2=(18.344+1)/(604)=15.34/56=0.274\eta^2_H = (18.34-4+1)/(60-4) = 15.34/56 = 0.274

Large effect.

95% CI for ηH2\eta^2_H (bootstrap): [0.128,0.409][0.128, 0.409]

Dunn post-hoc tests (Holm-corrected, m=6m = 6 pairs):

SEjk=60×6112×215=305×0.1333=40.67=6.377SE_{jk} = \sqrt{\frac{60\times61}{12}\times\frac{2}{15}} = \sqrt{305\times0.1333} = \sqrt{40.67} = 6.377

PairRˉjRˉk\bar{R}_j - \bar{R}_kzjkz_{jk}pp (raw)padjp_{adj} (Holm)rrbr_{rb}
L vs. F12.67-12.671.987-1.987.047.047.188.1880.2570.257
L vs. PBL21.33-21.333.345-3.345.001.001.005.0050.4320.432
L vs. G24.00-24.003.764-3.764.000.000.001.0010.4860.486
F vs. PBL8.67-8.671.359-1.359.174.174.348.3480.1750.175
F vs. G11.33-11.331.777-1.777.076.076.228.2280.2290.229
PBL vs. G2.67-2.670.418-0.418.676.676.676.6760.0540.054

where rrb=zjk/N=zjk/60r_{rb} = |z_{jk}|/\sqrt{N} = |z_{jk}|/\sqrt{60}.

Significant pairs (after Holm): L vs. PBL (padj=.005p_{adj} = .005) and L vs. G (padj=.001p_{adj} = .001).

Interpretation: Traditional Lecture produces significantly lower motivation than both PBL and Gamification. No other pairs differ significantly.

APA write-up: "Due to significant non-normality (Shapiro-Wilk p<.05p < .05 for two groups) and heteroscedasticity (Levene's F(3,56)=4.12F(3, 56) = 4.12, p=.014p = .014), a Kruskal-Wallis test was conducted to compare student motivation across four teaching methods. The test revealed a significant difference, H(3)=18.34H(3) = 18.34, p<.001p < .001, ηH2=0.274\eta^2_H = 0.274 [95% CI: 0.128, 0.409], indicating a large effect. Dunn's pairwise post-hoc comparisons with Holm correction indicated that Traditional Lecture (Mdn = 28, IQR = 11) produced significantly lower motivation than both Project-Based Learning (Mdn = 38, IQR = 10), z=3.35z = -3.35, padj=.005p_{adj} = .005, rrb=0.43r_{rb} = 0.43, and Gamification (Mdn = 41, IQR = 8), z=3.76z = -3.76, padj=.001p_{adj} = .001, rrb=0.49r_{rb} = 0.49. No other pairwise comparisons reached significance after correction."


Example 3: Jonckheere-Terpstra Test — Drug Dose and Response

A pharmacologist tests whether increasing doses of an analgesic (0 mg, 10 mg, 20 mg, 40 mg) produce monotonically decreasing pain scores. nj=10n_j = 10 per dose group; N=40N = 40; K=4K = 4.

Group medians: 0 mg: 7.5; 10 mg: 6.0; 20 mg: 4.5; 40 mg: 2.5 — clearly monotonic.

Since the groups are ordered and a monotone trend is hypothesised, the Jonckheere-Terpstra test is more powerful than the Kruskal-Wallis test.

JT statistic (computed by DataStatPro): J=387J = 387

E[J]=n2K(K1)/4=100×12/4=300E[J] = n^2 K(K-1)/4 = 100\times12/4 = 300

Var[J]=n2K(2n+3)(K1)/72=100×4×23×3/72=383.33\text{Var}[J] = n^2 K(2n+3)(K-1)/72 = 100\times4\times23\times3/72 = 383.33

zJ=(387300)/383.33=87/19.58=4.443z_J = (387-300)/\sqrt{383.33} = 87/19.58 = 4.443, p<.001p < .001

Kruskal-Wallis for comparison: Hc=22.14H_c = 22.14, p<.001p < .001, ηH2=0.474\eta^2_H = 0.474

The JT test is more powerful (larger zz) because it uses the ordering information.

APA write-up: "Since a monotone dose-response relationship was hypothesised a priori, a Jonckheere-Terpstra test was used to test for ordered differences in pain scores across dose levels (0, 10, 20, 40 mg). The test confirmed a significant monotonic decreasing trend, J=387J = 387, z=4.44z = 4.44, p<.001p < .001, indicating that higher doses produced systematically lower pain ratings."


Example 4: Non-Significant Result with Sensitivity Analysis

An ergonomics researcher compares workstation satisfaction ratings (1–10 scale; ordinal) across five office configurations: Traditional Desk (TD), Standing Desk (SD), Treadmill Desk (TDM), Sit-Stand Desk (SS), and Lounge Area (LA). nj=10n_j = 10 per group; N=50N = 50; K=5K = 5.

Result: Hc(4)=7.84H_c(4) = 7.84, p=.097p = .097

Effect size: ηH2=(7.845+1)/(505)=3.84/45=0.085\eta^2_H = (7.84-5+1)/(50-5) = 3.84/45 = 0.085

The result is non-significant at α=.05\alpha = .05 (though borderline). ηH2=0.085\eta^2_H = 0.085 suggests a small-to-medium effect that this study is underpowered to detect.

Sensitivity analysis:

For 80% power with N=50N = 50, K=5K = 5: minimum detectable ηH20.144\eta^2_H \approx 0.144 (using non-central χ2\chi^2 approach). The observed ηH2=0.085\eta^2_H = 0.085 is below this threshold — the study was underpowered for the observed effect.

95% CI for ηH2\eta^2_H (bootstrap): [0.000,0.198][0.000, 0.198] — spans from zero to a medium effect; very imprecise.

APA write-up: "A Kruskal-Wallis test was conducted to compare workstation satisfaction across five office configurations. The test revealed no statistically significant difference, H(4)=7.84H(4) = 7.84, p=.097p = .097, ηH2=0.085\eta^2_H = 0.085 [95% CI: 0.000, 0.198]. This corresponds to a small-to-medium effect that the study was underpowered to detect (minimum detectable ηH2=0.144\eta^2_H = 0.144 at 80% power for this sample size). A larger sample (N100N \geq 100, n20n \geq 20 per group) would be required to reliably detect effects of this magnitude. Post-hoc pairwise comparisons were not conducted given the non-significant omnibus result."


14. Common Mistakes and How to Avoid Them

Mistake 1: Reporting Means and SDs Instead of Medians and IQRs

Problem: Running the Kruskal-Wallis test (because data are non-normal or ordinal) but reporting group means and standard deviations as the primary descriptive statistics. Means and SDs are not appropriate for skewed or ordinal data and contradict the rationale for choosing the Kruskal-Wallis test.

Solution: When reporting Kruskal-Wallis results, always report medians and IQRs (or full range, minimum, maximum) as the primary descriptive statistics. Means and SDs may be provided as supplementary information but should not be the primary summary.


Mistake 2: Interpreting H as a Test of Equal Means

Problem: Concluding from a significant Kruskal-Wallis result that "the group means differ significantly." The Kruskal-Wallis test is based on ranks and tests stochastic equality — it is a test of medians (under the location-shift assumption) or a test of distributional differences more broadly.

Solution: State clearly that the Kruskal-Wallis test examines whether groups differ in their rank distributions (or medians under the location-shift assumption). Do not use the language of means unless you separately justify that the distributions have the same shape.


Mistake 3: Not Checking the Shape Assumption

Problem: Applying the Kruskal-Wallis test and interpreting it as a test of equal medians without checking whether the distribution shapes are similar across groups. If shapes differ substantially (e.g., one group is symmetric and another is right-skewed), the test may be detecting shape differences rather than location differences.

Solution: Always produce density plots and boxplots for all groups before running the test. Check whether distributions have approximately the same shape. If shapes differ, state that the test is interpreted as a test of stochastic equality rather than equal medians.


Mistake 4: Running Pairwise Post-Hoc Tests Without a Significant Omnibus Test

Problem: Running Dunn or Mann-Whitney pairwise tests regardless of the Kruskal-Wallis result, and selectively reporting significant pairs. This inflates the FWER to >α> \alpha.

Solution: Only run post-hoc pairwise comparisons after a significant omnibus Kruskal-Wallis test (except for pre-registered planned contrasts). When the omnibus test is non-significant, report the non-significant HcH_c with its effect size and perform a sensitivity analysis. Do not report individual pairwise tests as "exploratory" without making it clear they were not protected by a significant omnibus result.


Mistake 5: Failing to Apply the Tie Correction

Problem: Computing HH without applying the tie correction CC, particularly with coarsely measured ordinal data (e.g., 5-point Likert scales) where many ties are expected. The uncorrected HH underestimates the true test statistic, producing a conservative test.

Solution: Always apply the tie correction. DataStatPro applies it automatically. When reporting, note whether the tie correction was applied and report CC when it deviates substantially from 1 (e.g., C<0.95C < 0.95).


Mistake 6: Using the Kruskal-Wallis Test for Repeated Measures Data

Problem: Applying the Kruskal-Wallis test to data where the same participants appear in multiple conditions (repeated measures or paired design). The KW test assumes independence of all observations — repeated measures data violate this assumption.

Solution: For repeated measures (within-subjects) non-parametric comparison of K3K \geq 3 conditions, use the Friedman test. For exactly two related conditions, use the Wilcoxon Signed-Rank Test.


Mistake 7: Not Reporting Effect Sizes

Problem: Reporting H(K1)=H(K-1) = [value], p=p = [value] without any effect size measure. The HH statistic alone is uninterpretable without knowing NN, and the p-value conveys nothing about effect magnitude.

Solution: Always report ηH2\eta^2_H (or ϵH2\epsilon^2_H) with its 95% CI. For each significant pairwise comparison, report rrb,jkr_{rb,jk} and the probability of superiority interpretation.


Mistake 8: Applying the Kruskal-Wallis Test When the Data Are Clearly Normal

Problem: Reflexively using the Kruskal-Wallis test for all ordinal or non-parametric situations without considering whether the data might actually be approximately normal. The KW test loses about 5% power relative to ANOVA for normal data, and for Likert composite scales with many items, the distribution is often approximately normal.

Solution: If a composite scale (sum of many Likert items) is approximately normally distributed (Shapiro-Wilk p>.05p > .05, histogram approximately bell-shaped), use the one-way ANOVA. Reserve the Kruskal-Wallis test for genuinely non-normal data, small samples with non-normal distributions, or true ordinal single-item measures.


Mistake 9: Using Incorrect Post-Hoc Tests

Problem: Using t-tests or ANOVA-based post-hoc tests (e.g., Tukey HSD based on MSwithinMS_{within}) after a significant Kruskal-Wallis test. These parametric post-hoc tests assume normality and homoscedasticity — exactly the assumptions that led to choosing the Kruskal-Wallis test in the first place.

Solution: After a significant Kruskal-Wallis test, use non-parametric post-hoc procedures — Dunn's test, Conover-Iman, Steel-Dwass, or pairwise Mann-Whitney tests with appropriate FWER correction. Do not use parametric post-hoc methods.


Mistake 10: Ignoring the Exact Test for Small Samples

Problem: Using the chi-squared approximation to compute the p-value when nj<5n_j < 5 per group. The chi-squared approximation is inaccurate for very small groups, potentially producing substantially incorrect p-values.

Solution: When any nj<5n_j < 5, use the exact permutation distribution of HH. DataStatPro automatically switches to the exact test when nj<5n_j < 5. For published research with small groups, always report whether the exact or asymptotic test was used.


15. Troubleshooting

ProblemLikely CauseSolution
WjN(N+1)/2\sum W_j \neq N(N+1)/2Ranking error; incorrect midrank computationRecheck all rank assignments; verify midrank formula
H<0H < 0Arithmetic errorHH is always 0\geq 0; recheck computation
ηH2<0\eta^2_H < 0Very small sample; correction (HcK+1)(H_c - K + 1) overshootsReport as 0 by convention; increase sample size; note near-zero effect
C=1C = 1 despite many tiesTies within the same group only (no cross-group ties)Check whether ranking was done across groups (required) not within groups
Chi-squared approximation and exact test give very different p-valuesVery small nj<5n_j < 5Use exact test; report it explicitly
KW significant but ANOVA notPresence of outliers inflating ANOVA error; KW detects rank differencesInspect distributions; KW result is more trustworthy for non-normal data
ANOVA significant but KW notModerate non-normality but ANOVA robust at large nn; heavy ties reducing KW powerWith large nj30n_j \geq 30, ANOVA may be valid; investigate distribution
Post-hoc tests show no significant pairs despite significant HHEffect is diffuse across many small differences; Holm correction too conservativeConsider FDR correction for exploratory work; report all rrbr_{rb} values
Dunn test zz values exceed ±3\pm 3 for small groupsLarge mean rank differences with small njn_jLikely a genuine large effect; use exact Mann-Whitney for those pairs
rrbr_{rb} exceeds ±1\pm 1Incorrect formula; using nj+nk\sqrt{n_j+n_k} when total NN should be usedUse rrb=z/Nr_{rb} = z/\sqrt{N} for Dunn-based conversion; or compute directly from UU
Tie correction C<0.85C < 0.85Very many ties (coarse ordinal scale)Report CC explicitly; use permutation version; consider sign-based alternatives
Jonckheere-Terpstra gives different conclusion than Kruskal-WallisJT uses directional order information; groups may not have a monotone patternReport both tests; investigate which group pattern supports the trend
Exact test is computationally slowLarge NN or many groups making enumeration infeasibleUse Monte Carlo permutation approximation (B=10,000B = 10{,}000); report this choice
Cannot compute Hodges-Lehmann estimateOnly test statistic available (no raw data)HL estimate requires raw data; report group medians from published descriptives
Post-hoc FWER exceeds nominal α\alphaUsing uncorrected pairwise testsAlways apply Holm (at minimum) or Bonferroni correction to all mm pairwise tests
No significant pairs after Holm despite significant omnibusHolm too conservative for diffuse effectsConsider Benjamini-Hochberg FDR if exploratory; report effect sizes for all pairs

16. Quick Reference Cheat Sheet

Core Formulas

FormulaDescription
N=j=1KnjN = \sum_{j=1}^K n_jTotal sample size
Rˉ=(N+1)/2\bar{R} = (N+1)/2Overall mean rank
Wj=i=1njRijW_j = \sum_{i=1}^{n_j}R_{ij}Rank sum for group jj
Rˉj=Wj/nj\bar{R}_j = W_j/n_jMean rank for group jj
jWj=N(N+1)/2\sum_j W_j = N(N+1)/2Verification check
H=12N(N+1)jWj2/nj3(N+1)H = \frac{12}{N(N+1)}\sum_j W_j^2/n_j - 3(N+1)Kruskal-Wallis HH statistic
H=12N(N+1)jnj(Rˉj(N+1)/2)2H = \frac{12}{N(N+1)}\sum_j n_j(\bar{R}_j-(N+1)/2)^2Equivalent form
C=1m(tm3tm)N3NC = 1 - \frac{\sum_m(t_m^3-t_m)}{N^3-N}Tie correction factor
Hc=H/CH_c = H/CTie-corrected HH
p=P(χK12Hc)p = P(\chi^2_{K-1} \geq H_c)Asymptotic p-value
df=K1df = K-1Degrees of freedom

Effect Size Formulas

FormulaDescription
ηH2=(HcK+1)/(NK)\eta^2_H = (H_c-K+1)/(N-K)Eta squared for KW (primary)
ϵH2=(Hc(K1))/(NK)\epsilon^2_H = (H_c-(K-1))/(N-K)Epsilon squared (less biased)
ηH2Hc/(N1)\eta^2_H \approx H_c/(N-1)Approximation (balanced design)
fequiv=ηH2/(1ηH2)f_{equiv} = \sqrt{\eta^2_H/(1-\eta^2_H)}Cohen's ff equivalent
rrb,jk=zjk/Nr_{rb,jk} = z_{jk}/\sqrt{N}Rank-biserial rr from Dunn zz
rrb,jk=12Ujk/(njnk)r_{rb,jk} = 1 - 2U_{jk}/(n_jn_k)Rank-biserial rr from Mann-Whitney UU
PSjk=(1+rrb,jk)/2PS_{jk} = (1+r_{rb,jk})/2Probability of superiority
rrbd:d2rrb/1rrb2r_{rb} \to d: d \approx 2r_{rb}/\sqrt{1-r_{rb}^2}Approx. conversion to Cohen's dd

Post-Hoc Test Formulas (Dunn's Test)

FormulaDescription
zjk=(RˉjRˉk)/SEjkz_{jk} = (\bar{R}_j-\bar{R}_k)/SE_{jk}Dunn's z-statistic
SEjk=N(N+1)12nj+nknjnkSE_{jk} = \sqrt{\frac{N(N+1)}{12}\cdot\frac{n_j+n_k}{n_jn_k}}SE (no ties; simplified)
$p_{jk} = 2[1-\Phi(z_{jk}
m=K(K1)/2m = K(K-1)/2Number of pairwise comparisons
Holm: sort p(i)p_{(i)}; compare to α/(mi+1)\alpha/(m-i+1)Holm-Bonferroni correction
Bonferroni: padj=min(1,p×m)p_{adj} = \min(1, p\times m)Bonferroni correction

Cohen's Benchmarks for ηH2\eta^2_H

ηH2\eta^2_Hff equivalentLabel
0.0100.0100.100.10Small
0.0590.0590.250.25Medium
0.1380.1380.400.40Large
0.2000.2000.500.50Very large
0.2600.2600.590.59Very large

Cohen's Benchmarks for rrbr_{rb} (Pairwise)

rrb\vert r_{rb} \vertLabelPSPS (%)
0.100.10Small55%55\%
0.300.30Medium65%65\%
0.500.50Large75%75\%
0.700.70Very large85%85\%
0.900.90Huge95%95\%

Required Sample Size per Group (80% Power, α=.05\alpha = .05)

ηH2\eta^2_H equiv.Cohen's ffK=3K = 3K=4K = 4K=5K = 5K=6K = 6
0.0100.10337287251225
0.0220.15151129112101
0.0590.2555474137
0.1090.3529252220
0.1380.4022191715
0.2000.5015131211
0.2650.60111098

Based on ARE-adjusted ANOVA sample sizes. Use DataStatPro Monte Carlo for non-normal distributions.

Sensitivity Analysis: Minimum Detectable ηH2\eta^2_H (80% Power, α=.05\alpha = .05)

Total NNK=3K = 3K=4K = 4K=5K = 5
300.1950.1950.2430.2430.2870.287
600.0970.0970.1220.1220.1440.144
900.0650.0650.0810.0810.0960.096
1500.0390.0390.0490.0490.0580.058
3000.0200.0200.0250.0250.0290.029

ARE Comparison: Kruskal-Wallis vs. One-Way ANOVA

DistributionARERequired nn (KW vs. ANOVA)
Normal0.9550.955KW needs \approx5% more
Uniform1.0001.000Identical
Logistic1.0971.097KW needs \approx9% fewer
Laplace1.5001.500KW needs 33% fewer
Contaminated normal>1.500> 1.500KW substantially more powerful

Test Selection Guide

Three or more independent groups, continuous/ordinal DV?
├── Is DV ordinal (single Likert item, ranks)?
│   └── YES → Kruskal-Wallis Test ✅
│           └── Ordered groups? → Jonckheere-Terpstra ✅
└── Is DV continuous?
    └── Check normality (Shapiro-Wilk) and equal variances (Levene's)
        ├── Both satisfied (or n_j ≥ 30) → One-Way ANOVA
        │   └── Levene's significant → Welch's ANOVA
        └── Normality violated (n_j < 30) or severe outliers
            └── Kruskal-Wallis Test ✅
                └── Ordered groups? → Jonckheere-Terpstra ✅

Post-hoc (after significant H):
├── Standard → Dunn + Holm ✅
├── More power → Conover-Iman + Holm ✅
├── Non-parametric Tukey equivalent → Steel-Dwass ✅
└── Planned a priori → Pairwise Mann-Whitney + Holm ✅

Comparison: Kruskal-Wallis vs. One-Way ANOVA vs. Friedman

PropertyOne-Way ANOVAKruskal-WallisFriedman
DesignIndependent groupsIndependent groupsRepeated measures
Assumes normality✅ Yes❌ No❌ No
Assumes equal variances✅ Yes (or Welch's)Shape similarity
Test statisticFFHHχr2\chi^2_r
Effect sizeω2\omega^2, η2\eta^2ηH2\eta^2_H, ϵH2\epsilon^2_HKendall's WW
Post-hocTukey, Games-HowellDunn + HolmWilcoxon + Holm
ARE vs. normal parametric1.0000.9550.955
Handles ordinal DV❌ No✅ Yes✅ Yes

APA 7th Edition Reporting Templates

Standard Kruskal-Wallis (significant result):

"Due to [non-normal distributions / ordinal measurement scale / significant heteroscedasticity], a Kruskal-Wallis test was conducted to compare [DV] across [K] groups of [IV]. The test revealed a statistically significant difference, H([K1])=H([K-1]) = [value], p=p = [value], ηH2=\eta^2_H = [value] [95% CI: LB, UB], indicating a [small / medium / large] effect. Dunn's pairwise post-hoc comparisons with Holm-Bonferroni correction indicated that [describe significant pairs with Mdn, IQR, zz, padjp_{adj}, rrbr_{rb}]. [Describe non-significant pairs.]"

Kruskal-Wallis (non-significant result):

"A Kruskal-Wallis test revealed no statistically significant difference in [DV] across [K] groups, H([K1])=H([K-1]) = [value], p=p = [value], ηH2=\eta^2_H = [value] [95% CI: LB, UB]. This study had 80% power to detect effects of ηH2\eta^2_H \geq [value] for this sample size; smaller effects remain undetected. Post-hoc pairwise comparisons were not conducted given the non-significant omnibus result."

With Jonckheere-Terpstra:

"Since groups represented ordered levels of [IV], a Jonckheere-Terpstra test was used to test for a monotonic trend. The test [confirmed / did not confirm] a significant [increasing / decreasing] trend in [DV] across [IV] levels, J=J = [value], z=z = [value], p=p = [value]."

Kruskal-Wallis Test Reporting Checklist

ItemRequired
Statement of why KW was used✅ Always
Group medians and IQRs✅ Always
Group mean ranks Rˉj\bar{R}_j✅ Recommended
njn_j per group✅ Always
HcH_c (tie-corrected) with df✅ Always
Tie correction factor CC✅ When C<0.99C < 0.99
Whether exact or asymptotic pp used✅ Always
Exact p-value (or p<.001p < .001)✅ Always
ηH2\eta^2_H with 95% CI✅ Always
ϵH2\epsilon^2_H alongside ηH2\eta^2_H✅ Recommended
Post-hoc test name and correction✅ When HH significant
zjkz_{jk} and padjp_{adj} per pair✅ When HH significant
rrb,jkr_{rb,jk} per significant pair✅ When HH significant
Probability of superiority✅ Recommended
95% CI for rrb,jkr_{rb,jk}✅ Recommended
Density plots or boxplots per group✅ Strongly recommended
Shape assumption assessment✅ Always
Sensitivity analysis✅ For null results
Comparison with ANOVA (sensitivity)✅ Recommended
Domain-specific benchmark context✅ Recommended

Conversion Formulas

FromToFormula
HcH_c, KK, NNηH2\eta^2_HηH2=(HcK+1)/(NK)\eta^2_H = (H_c-K+1)/(N-K)
ηH2\eta^2_HCohen's fff=ηH2/(1ηH2)f = \sqrt{\eta^2_H/(1-\eta^2_H)}
zjkz_{jk} (Dunn), NNrrb,jkr_{rb,jk}rrb=zjk/Nr_{rb} = z_{jk}/\sqrt{N}
UjkU_{jk}, njn_j, nkn_krrb,jkr_{rb,jk}rrb=12U/(njnk)r_{rb} = 1-2U/(n_jn_k)
rrbr_{rb}PSPSPS=(1+rrb)/2PS = (1+r_{rb})/2
rrbr_{rb}Cohen's dd (approx.)d2rrb/1rrb2d \approx 2r_{rb}/\sqrt{1-r_{rb}^2}
Cohen's ddrrbr_{rb} (approx.)rrbd/d2+4r_{rb} \approx d/\sqrt{d^2+4}
nANOVAn_{ANOVA}nKWn_{KW} (normal data)nKWnANOVA×π/31.047×nANOVAn_{KW} \approx n_{ANOVA}\times\pi/3\approx 1.047\times n_{ANOVA}
HcH_cANOVA FranksF_{ranks} (approx.)FranksHc/(K1)F_{ranks} \approx H_c/(K-1)

This tutorial provides a comprehensive foundation for understanding, conducting, and reporting the Kruskal-Wallis Test within the DataStatPro application. For further reading, consult the original paper by Kruskal & Wallis "Use of Ranks in One-Criterion Variance Analysis" (Journal of the American Statistical Association, 1952); Conover's "Practical Nonparametric Statistics" (3rd ed., 1999) for comprehensive coverage including the Conover-Iman post-hoc test; Hollander, Wolfe & Chicken's "Nonparametric Statistical Methods" (3rd ed., 2014) for rigorous mathematical treatment; Dunn's "Multiple Comparisons Among Means" (Journal of the American Statistical Association, 1964) for the Dunn post-hoc procedure; Tomczak & Tomczak's "The Need to Report Effect Size Estimates Revisited" (Trends in Sport Sciences, 2014) for effect size guidance; and Field's "Discovering Statistics Using IBM SPSS Statistics" (5th ed., 2018) for accessible applied coverage. For the Jonckheere-Terpstra test, see Jonckheere's "A Distribution-Free k-Sample Test Against Ordered Alternatives" (Biometrika, 1954) and Terpstra's "The Asymptotic Normality and Consistency of Kendall's Test Against Trend" (Indagationes Mathematicae, 1952). For feature requests or support, contact the DataStatPro team.