Mann-Whitney U Test: Zero to Hero Tutorial
This comprehensive tutorial takes you from the foundational concepts of non-parametric inference all the way through the complete theory, mathematics, assumptions, effect sizes, interpretation, reporting, and practical usage of the Mann-Whitney U Test within the DataStatPro application. Whether you are encountering the Mann-Whitney U Test for the first time or seeking a deeper understanding of rank-based methods for comparing two independent groups, this guide builds your knowledge systematically from the ground up.
Table of Contents
- Prerequisites and Background Concepts
- What is the Mann-Whitney U Test?
- The Mathematics Behind the Mann-Whitney U Test
- Assumptions of the Mann-Whitney U Test
- Variants and Related Tests
- Using the Mann-Whitney U Test Calculator Component
- Exact vs. Approximate Methods
- Effect Sizes for the Mann-Whitney U Test
- Confidence Intervals
- Advanced Topics
- Worked Examples
- Common Mistakes and How to Avoid Them
- Troubleshooting
- Quick Reference Cheat Sheet
1. Prerequisites and Background Concepts
Before diving into the Mann-Whitney U Test, it is essential to be comfortable with the following foundational statistical concepts. Each is briefly reviewed below.
1.1 Parametric vs. Non-Parametric Inference
Parametric tests (e.g., the independent samples t-test) make explicit assumptions about the shape of the population distribution — typically normality — and estimate specific population parameters (e.g., , ). Their validity depends on those distributional assumptions being met.
Non-parametric tests (also called distribution-free tests) do not assume a specific functional form for the population distribution. The Mann-Whitney U Test is non-parametric: it does not assume normality. Instead of operating on raw scores, it operates on the ranks of those scores.
⚠️ "Distribution-free" is not synonymous with "assumption-free." The Mann-Whitney U Test has its own set of assumptions, reviewed in Section 4. Violating these assumptions can invalidate its conclusions just as surely as violating normality invalidates the t-test.
1.2 Ordinal Data and Ranks
Ordinal data convey the relative ordering of observations but not the magnitude of differences between them. Examples include:
- Likert scale responses (1 = Strongly Disagree, 5 = Strongly Agree).
- Pain ratings on a 0–10 visual analogue scale.
- Customer satisfaction rankings.
- Academic grade categories (Distinction, Merit, Pass, Fail).
Ranking is the process of replacing each raw score with its position in an ordered list. For total observations:
- The smallest observation receives rank 1.
- The largest receives rank .
- Tied observations receive the average of the ranks they would have occupied (mid-ranks).
Example: Raw scores become ranks (the two tied 7s share ranks 2 and 3, so each receives ).
1.3 The Concept of Stochastic Dominance
The Mann-Whitney U Test is fundamentally a test of stochastic dominance. Group 1 stochastically dominates Group 2 if a randomly chosen observation from Group 1 tends to be larger than a randomly chosen observation from Group 2:
The test statistic directly estimates this probability, making the Mann-Whitney U Test one of the most intuitively interpretable inferential tests in statistics.
1.4 The Independent Samples t-Test and Its Limitations
The independent samples t-test is the parametric alternative to the Mann-Whitney U Test. It is appropriate when:
- Data are continuous and at least interval-scaled.
- Both groups' populations are approximately normally distributed.
- Population variances are equal (or Welch's correction is applied).
When normality is markedly violated (especially with small samples) or when data are ordinal, the t-test is inappropriate and the Mann-Whitney U Test is preferred.
1.5 Statistical Power and Asymptotic Relative Efficiency
The Asymptotic Relative Efficiency (ARE) compares the power of two tests as . The ARE of the Mann-Whitney U Test relative to the t-test is:
(under normality)
This means:
- When data are normally distributed, you need approximately 5% more observations with the Mann-Whitney test to achieve the same power as the t-test.
- When data are non-normal (heavy-tailed, skewed, or contaminated), the Mann-Whitney test can be substantially more powerful than the t-test.
This near-equivalence under normality makes the Mann-Whitney test a safe default when normality is uncertain.
1.6 The Probability of Superiority
The probability of superiority (PS) — equivalent to the Common Language Effect Size — is the probability that a randomly selected observation from Group 1 exceeds a randomly selected observation from Group 2:
- : No tendency for one group to exceed the other (null hypothesis).
- : 75% of the time, a random Group 1 observation exceeds a random Group 2 observation — a large, practically meaningful difference.
- : Every Group 1 observation exceeds every Group 2 observation (perfect separation).
This interpretation is central to understanding the Mann-Whitney U Test's effect size.
1.7 Hypothesis Testing Framework
Every Mann-Whitney U Test operates within the standard hypothesis testing framework:
Step 1 — State the hypotheses (see Section 4 for precise formulations).
Step 2 — Choose — the significance level (conventionally ).
Step 3 — Compute the test statistic (or its standardised form ).
Step 4 — Compute the p-value — the probability of observing a statistic at least as extreme as the one obtained, assuming .
Step 5 — Make a decision — reject if .
Step 6 — Compute and report the effect size — (rank-biserial correlation) with 95% confidence interval.
2. What is the Mann-Whitney U Test?
2.1 The Core Idea
The Mann-Whitney U Test (also called the Wilcoxon Rank-Sum Test, or Wilcoxon-Mann- Whitney Test) is a non-parametric test for comparing two independent groups. Rather than comparing group means directly (as the t-test does), it assesses whether observations from one group tend to be systematically larger or smaller than observations from the other group.
The test was independently developed by:
- Frank Wilcoxon (1945) — proposed the Rank-Sum Test.
- Henry Mann and Donald Whitney (1947) — developed the equivalent statistic and derived its exact null distribution.
The two formulations are mathematically equivalent: they produce the same p-value.
2.2 Research Questions the Mann-Whitney U Test Answers
The Mann-Whitney U Test answers:
"Do observations from Group 1 tend to have systematically higher (or lower) values than observations from Group 2?"
More formally, under the location-shift assumption (see Section 4):
"Is the median of Group 1 equal to the median of Group 2?"
2.3 When to Use the Mann-Whitney U Test
The Mann-Whitney U Test is the appropriate choice when:
| Condition | Details |
|---|---|
| Two independent groups | Different participants in each group |
| Non-normal distribution | Normality violated; Shapiro-Wilk significant |
| Ordinal dependent variable | Likert scales, pain ratings, satisfaction scores |
| Small sample size | per group; CLT may not apply |
| Presence of outliers | Extreme values distort the t-test |
| Skewed distributions | Reaction times, income, response latencies |
| Bounded scales | Ceiling or floor effects distorting normality |
2.4 The Mann-Whitney U Test vs. the Independent Samples t-Test
| Property | Independent t-Test | Mann-Whitney U Test |
|---|---|---|
| Tests | Mean difference | Distributional dominance / median shift |
| Scale | Interval / Ratio | Ordinal or higher |
| Assumes normality | ✅ Yes | ❌ No |
| Sensitive to outliers | ✅ High | ❌ Low (rank-based) |
| Power (when normal) | Slightly higher | of t-test |
| Power (when non-normal) | Can be lower | Can exceed t-test |
| Effect size | Cohen's | Rank-biserial |
| Parametric | ✅ Yes | ❌ No |
2.5 Real-World Applications
| Field | Application | Example |
|---|---|---|
| Clinical Psychology | Symptom severity between two treatment arms | PTSD symptom score: EMDR vs. CBT |
| Medicine | Recovery time between two surgical techniques | Days to discharge: laparoscopic vs. open |
| Education | Exam performance between two instructional methods | Grades: problem-based vs. lecture |
| Marketing | Consumer preference ratings for two products | Rating (1–10): Product A vs. B |
| Ecology | Species abundance between two habitats | Count of species: Forest A vs. B |
| Neuroscience | Response latencies between patient and control groups | RT (ms): ADHD vs. neurotypical |
| Organisational Psychology | Job satisfaction between two departments | Survey score: Dept A vs. Dept B |
| Public Health | Physical activity levels between two communities | Steps/day: urban vs. rural |
3. The Mathematics Behind the Mann-Whitney U Test
3.1 The Rank-Sum Formulation (Wilcoxon)
Step 1 — Pool and rank all observations.
Combine all observations from both groups into a single ordered list. Assign ranks from 1 (smallest) to (largest). For tied values, assign average ranks (mid-ranks).
Step 2 — Compute the rank sums.
Verification (always check):
3.2 The U Statistic (Mann-Whitney)
The U statistic counts the number of times a Group 1 observation exceeds a Group 2 observation across all possible pairwise comparisons:
Equivalent formulas using rank sums (computationally simpler):
Key verification:
The test statistic is:
For large-sample tests, it is more convenient to use directly (with the sign determining the direction of the difference).
3.3 The Null Distribution of U
Under (the two populations are identical), the U statistic has a known exact distribution for small samples. The null distribution is symmetric about:
With variance:
Without ties: This formula is exact.
With ties: The variance must be corrected:
Where is the number of distinct tied groups and is the number of observations in the -th tied group. The term is the tie correction factor.
3.4 The z-Approximation for Large Samples
For (or generally when exact tables are unavailable), the standardised U statistic is approximately standard normal:
Without continuity correction:
With continuity correction (improves approximation for smaller samples):
Where is used when and when .
With tie correction:
3.5 Computing the Two-Tailed p-Value
Using exact distribution (small samples, ):
Using z-approximation (large samples):
Using one-tailed tests:
Upper tail ():
Lower tail ():
3.6 The Exact Computation via Pairwise Comparisons
The U statistic can also be computed directly by comparing all possible pairs of observations across the two groups. For each pair :
This formulation makes the connection to the probability of superiority transparent:
Under : (since ).
3.7 Critical Values for Small Samples
For small samples (), compare to the critical value . Reject (two-tailed, ) if :
| 5 | 2 | 5 | 6 | 8 | 11 | 20 | 27 |
| 6 | 5 | 7 | 8 | 10 | 14 | 24 | 34 |
| 7 | 6 | 8 | 11 | 13 | 17 | 28 | 39 |
| 8 | 8 | 10 | 13 | 15 | 20 | 33 | 45 |
| 10 | 11 | 14 | 17 | 20 | 27 | 42 | 59 |
Reject if . Values above are for , two-tailed.
💡 DataStatPro computes exact p-values for all sample sizes using complete enumeration (for small samples) or the exact permutation distribution. The z-approximation is used only when exact computation is infeasible ().
4. Assumptions of the Mann-Whitney U Test
4.1 Independence of Observations
All observations must be independent of each other, both within and across groups. No observation should influence or be influenced by any other.
Why it matters: Dependence between observations (e.g., measurements from the same participant appearing in both groups, or clustered observations) inflates the false positive rate and invalidates the null distribution of .
How to check: Review the study design. Independence is a design property, not detectable from the data alone.
When violated: Use the Wilcoxon Signed-Rank Test for paired data. For nested or clustered data, use multilevel non-parametric methods.
4.2 Ordinal or Higher Scale of Measurement
The dependent variable must be at least ordinally scaled — observations must be meaningfully rankable. The Mann-Whitney U Test is appropriate for:
- Ordinal data (Likert scales, ranked responses).
- Continuous data that violate parametric assumptions.
- Discrete count data with many ties.
When violated: If observations cannot be meaningfully ordered (i.e., the variable is truly nominal with no natural ordering), use the chi-squared test or Fisher's exact test instead.
4.3 Two Independent Groups
The test requires exactly two groups composed of different (independent) participants. Groups may have unequal sizes (), and the test remains valid.
When violated: For three or more independent groups, use the Kruskal-Wallis Test. For two related (paired) groups, use the Wilcoxon Signed-Rank Test.
4.4 The Location-Shift Assumption (for Median Interpretation)
This is the most commonly misunderstood assumption. The Mann-Whitney U Test tests:
Without the location-shift assumption: (stochastic equality) (stochastic dominance)
This is always valid under the independence and ordinal assumptions alone.
With the location-shift assumption (same distribution shape, just shifted): (equal medians) (unequal medians)
The location-shift assumption requires that the two population distributions have the same shape and spread — only their location (median) differs:
for some shift
Why this matters: If the distributions differ in shape or spread (not just location), then a significant Mann-Whitney result may reflect differences in variability or distribution shape rather than a difference in central tendency. In this case, the Brunner-Munzel Test (Section 10) is more appropriate.
How to check:
- Inspect boxplots for each group: similar spread and shape?
- Compare interquartile ranges (IQR) across groups.
- Run Levene's test on the raw data (though this tests variances, not full distribution shape).
4.5 No Assumption of Normality
Unlike the independent samples t-test, the Mann-Whitney U Test makes no normality assumption. This is its primary advantage and the most common reason for choosing it over the t-test.
4.6 Handling Ties
Ties (observations with identical values) reduce the power of the Mann-Whitney test slightly. The tie correction to the variance formula (Section 3.3) accounts for this. Excessive ties (e.g., more than 20% of observations tied) can reduce power substantially and should be noted in the methods section.
⚠️ When many ties are present, especially with small samples, the exact distribution of (rather than the normal approximation) should be used for p-values, as the normal approximation may be poor.
4.7 Assumption Summary Table
| Assumption | Required | How to Check | Remedy if Violated |
|---|---|---|---|
| Independence of observations | ✅ Yes | Study design review | Wilcoxon signed-rank (paired); multilevel methods (clustered) |
| Ordinal or higher scale | ✅ Yes | Measurement theory | Chi-squared (nominal outcome) |
| Two independent groups | ✅ Yes | Study design | Kruskal-Wallis (); Wilcoxon signed-rank (paired) |
| Location-shift (for median interpretation) | ⚠️ Conditionally | Boxplots, IQR comparison | Brunner-Munzel test (unequal shapes) |
| Normality | ❌ Not required | — | — |
| Equal variances | ❌ Not required | — | — |
5. Variants and Related Tests
5.1 The Wilcoxon Rank-Sum Test
The Wilcoxon Rank-Sum Test and the Mann-Whitney U Test are two names for the same procedure. They differ only in which test statistic is reported:
- Wilcoxon: Reports (rank sum of Group 1).
- Mann-Whitney: Reports and (or ).
The relationship:
Both produce identical p-values. DataStatPro reports both and for completeness.
5.2 One-Tailed vs. Two-Tailed Tests
Two-tailed (default): Use when the direction of the difference is not predicted in advance.
One-tailed (upper): Use when specifically predicting Group 1 tends to be larger. Divide the two-tailed p-value by 2.
One-tailed (lower): Use when specifically predicting Group 1 tends to be smaller.
⚠️ One-tailed tests must be justified and pre-registered before data collection. Switching to one-tailed after observing the data direction is p-hacking.
5.3 The Brunner-Munzel Test
The Brunner-Munzel Test (Brunner & Munzel, 2000) is a robust alternative to the Mann-Whitney test when the location-shift assumption may be violated — that is, when the two distributions may differ in shape and spread, not just location.
It tests the same null hypothesis:
But uses separate within-group rankings to construct a test statistic that is valid regardless of whether the distribution shapes are equal.
The Brunner-Munzel statistic:
Where are internal ranks (each group ranked separately within itself using the pooled ranking as reference), and are within-group variance estimates of the ranks.
Degrees of freedom are approximated using a Welch-Satterthwaite-type formula. DataStatPro reports the Brunner-Munzel test when the location-shift assumption appears violated.
5.4 Permutation Test Alternative
The permutation test (randomisation test) for two independent groups:
- Compute the observed statistic (or difference in means/medians).
- Randomly reassign all observations to two groups of sizes and .
- Recompute for each permutation.
- The p-value is the proportion of permutations where .
The permutation test is exact (no approximation needed), handles ties perfectly, and makes no distributional assumptions beyond exchangeability. DataStatPro offers this as an option under the Advanced Settings panel.
6. Using the Mann-Whitney U Test Calculator Component
The Mann-Whitney U Test Calculator component in DataStatPro provides a comprehensive tool for conducting, diagnosing, visualising, and reporting the Mann-Whitney U Test and its alternatives.
Step-by-Step Guide
Step 1 — Select the Test
From the "Non-Parametric Tests" menu, select "Mann-Whitney U Test (Independent Samples)". DataStatPro will also display Wilcoxon Rank-Sum notation alongside for software compatibility.
Step 2 — Input Method
Choose how to provide the data:
- Raw data: Upload or paste each group's data. DataStatPro ranks observations automatically, applies tie corrections, and computes all statistics.
- Summary (ranks): Enter pre-computed rank sums and with group sizes.
- U statistic + group sizes: Enter (or ), , and directly to compute p-values and effect sizes from a published result.
💡 Always use raw data when available. The rank-based computation is automatic, and raw data enable exact p-values, assumption checks, and visualisation of the full distribution.
Step 3 — Specify the Alternative Hypothesis
- Two-tailed (default):
- Upper one-tailed: (pre-registered directional prediction)
- Lower one-tailed: (pre-registered directional prediction)
Step 4 — Select the p-Value Method
- Exact (recommended for ): Enumerates the exact null distribution of .
- Normal approximation with tie correction (default for ): Uses the standardised z-statistic with the tie-corrected variance formula.
- Permutation test: Randomly samples from the permutation distribution (10,000 permutations by default; customisable up to 100,000).
Step 5 — Select the Continuity Correction
For the normal approximation:
- With continuity correction: Recommended for ; improves accuracy of the normal approximation to the discrete U distribution.
- Without continuity correction: Appropriate for .
Step 6 — Select Effect Size Options
- ✅ Rank-biserial correlation (primary, always computed).
- ✅ Probability of superiority (= ).
- ✅ Common Language Effect Size (equivalent to ).
- ✅ 95% CI for via Fisher -transformation of .
- ✅ Hodges-Lehmann estimator (median of all pairwise differences as robust location estimate).
Step 7 — Select Display Options
- ✅ , , , , -statistic, and p-value.
- ✅ Rank sums and mean ranks per group.
- ✅ Medians, IQR, and descriptive statistics per group.
- ✅ Raincloud plot (half violin + boxplot + raw data points) per group.
- ✅ Ranked dot plot (showing all ranks with group membership colour-coded).
- ✅ Effect size visualisation ( on a number line with Cohen's benchmarks).
- ✅ Tie summary table (if ties present).
- ✅ APA 7th edition results paragraph (auto-generated).
- ✅ Comparison with independent samples t-test (when raw data available).
Step 8 — Run the Analysis
Click "Run Mann-Whitney U Test". DataStatPro will:
- Pool and rank all observations with tie correction.
- Compute , , , .
- Compute the exact or approximate p-value.
- Compute , , and their 95% CIs.
- Compute the Hodges-Lehmann median difference estimate.
- Generate all selected visualisations.
- Generate an APA-compliant results paragraph.
7. Exact vs. Approximate Methods
7.1 When to Use Exact Methods
The exact Mann-Whitney distribution enumerates all possible arrangements of observations into two groups of sizes and and computes the proportion of these that yield a statistic as extreme as the observed value. This is computationally intensive but exact.
Use exact methods when:
- per group (exact tables available; approximation may be poor).
- There are many ties (approximation deteriorates with heavy ties).
- Precision is critical (clinical trials, regulatory submissions).
The total number of equally likely arrangements under :
For : arrangements — computationally feasible for exact enumeration.
7.2 The Normal Approximation — When Is It Adequate?
The normal approximation is adequate when:
- AND (rule of thumb; exact methods preferable when feasible).
- Ties are not excessive (less than 25% of observations tied).
- The continuity correction is applied for smaller samples.
Accuracy of the approximation: The approximation error for the p-value is , meaning it improves as sample size increases.
7.3 The Permutation Approach
The permutation approach avoids the normal approximation entirely by directly estimating the null distribution from the data. It is:
- Exact in principle (but Monte Carlo sampling introduces small simulation error).
- Robust to all distributional assumptions beyond exchangeability.
- Practical for large where full enumeration is infeasible.
With permutations, the Monte Carlo standard error of the p-value estimate is — adequate for most purposes. DataStatPro uses by default for higher precision.
8. Effect Sizes for the Mann-Whitney U Test
8.1 The Rank-Biserial Correlation ()
The rank-biserial correlation is the standard effect size for the Mann-Whitney U Test. It directly measures the probability of superiority on a standardised scale from to .
Formula from U statistics:
Equivalently:
Or, when is the statistic for Group 1:
Formula from mean ranks:
Where is the mean rank of group .
Formula from the z-statistic:
(approximate)
A more precise formula:
8.2 Interpreting the Rank-Biserial Correlation
| Verbal Interpretation | ||
|---|---|---|
| No tendency; equally likely to exceed | ||
| Very small effect; Group 1 slightly higher | ||
| Small effect | ||
| Small-to-medium effect | ||
| Medium effect (Cohen's convention) | ||
| Medium-large effect | ||
| Large effect (Cohen's convention) | ||
| Very large effect | ||
| Perfect — every Group 1 obs. exceeds every Group 2 obs. |
Cohen's (1988) benchmarks for (same as Pearson ):
| Label | | | :---- | :--------- | | Small | | | Medium | | | Large | |
⚠️ Cohen's benchmarks were not specifically developed for the rank-biserial correlation. Always contextualise effect sizes within your research domain and compare to typical effect sizes from meta-analyses in the same field.
8.3 The Probability of Superiority ()
The probability of superiority is the most intuitive interpretation of the Mann-Whitney effect size:
Relationship to :
Interpretation: If , then in 75% of all possible pairings of one observation from Group 1 with one from Group 2, the Group 1 observation is larger.
Confidence interval for (using Fisher -transformation of ):
95% CI:
Back-transform: ; then .
8.4 The Hodges-Lehmann Estimator
The Hodges-Lehmann estimator is a robust, rank-based point estimate of the location shift between the two groups. It is the median of all possible pairwise differences:
- has pairwise differences.
- Under the location-shift assumption, estimates the median difference .
- under (when the populations are identical).
Confidence interval for : Using the exact Mann-Whitney distribution to determine which order statistics of the pairwise differences form the CI bounds.
The Hodges-Lehmann estimator is reported by DataStatPro alongside the Mann-Whitney U test as a meaningful, robust measure of the magnitude of the location shift in the original measurement units.
8.5 Comparing Effect Sizes Across Studies
When comparing Mann-Whitney effect sizes to t-test effect sizes from other studies, use the following conversions:
to Cohen's (approximate, under normality and equal group sizes):
Or more precisely, using the relationship (point-biserial ):
Cohen's to :
(for equal group sizes)
⚠️ These conversions assume normality for the -to- direction and may not hold for non-normal data. Use conversions with caution and clearly state the assumption.
9. Confidence Intervals
9.1 Confidence Interval for the Rank-Biserial Correlation
The 95% CI for uses the Fisher -transformation:
Standard error (approximate):
A more precise standard error accounting for group sizes:
95% CI in space:
Back-transform to scale:
9.2 Confidence Interval for the Hodges-Lehmann Estimator
The CI for is derived from the exact Mann-Whitney null distribution. The procedure:
- Order all pairwise differences .
- Find the critical value from the Mann-Whitney distribution table: (two-tailed).
- The 95% CI for is:
DataStatPro computes this exactly for and uses a normal approximation for larger datasets.
9.3 Confidence Interval for the Probability of Superiority
After computing the CI for (Section 9.1):
Example: If with 95% CI , then with 95% CI .
10. Advanced Topics
10.1 The Mann-Whitney Test as a Test of Stochastic Equality
Without the location-shift assumption, the Mann-Whitney test tests the general null:
This null is called stochastic equality. It does not require equal medians, equal shapes, or any distributional assumption. The alternative:
This is the most general and defensible interpretation of the Mann-Whitney U Test.
Practical implication: If Group 1 has a higher median but larger spread, and Group 2 has a lower median but smaller spread, the distributions may overlap substantially and may be close to 0.5 even though the medians differ — the Mann-Whitney test (correctly) may not detect a significant difference.
10.2 Asymptotic Relative Efficiency Across Distributions
The ARE of the Mann-Whitney test relative to the t-test depends on the true underlying distribution:
| Distribution | ARE (Mann-Whitney vs. t-test) |
|---|---|
| Normal | |
| Uniform | |
| Logistic | |
| Double exponential (Laplace) | |
| Cauchy (heavy-tailed) | |
| Contaminated normal | Often |
For heavy-tailed distributions — common in psychology (reaction times), medicine (survival times), and economics (income) — the Mann-Whitney test is substantially more powerful than the t-test.
10.3 Sample Size and Power for the Mann-Whitney Test
Power of the Mann-Whitney test under a location-shift alternative with effect size can be approximated using the ARE relationship:
(for normal data)
For non-normal data, the required for the Mann-Whitney test is computed using the non-central distribution of (or equivalently, the non-central normal distribution for large samples):
Solving for per group (equal group sizes, ):
Required per group (80% power, , two-tailed, equal group sizes):
| | Label | per group (Mann-Whitney) | per group (t-test) | | :---------- | :---- | :--------------------------- | :--------------------- | | | Small | 414 | 394 | | | Small | 99 | 97 | | | Medium | 44 | 43 | | | Medium | 21 | 20 | | | Large | 16 | 15 | | | Large | 10 | 9 |
Note: Mann-Whitney requires approximately 5% more observations than the t-test under normality, consistent with the ARE of .
10.4 Rank-Based Post-Hoc Comparisons After Kruskal-Wallis
When the Kruskal-Wallis test (the non-parametric ANOVA equivalent) is significant, pairwise Mann-Whitney U tests are conducted as post-hoc comparisons with appropriate FWER correction (Bonferroni, Holm, or Dunn's test). The effect size for each comparison is .
Each pairwise comparison uses only the two groups being compared (not the full ranked dataset from the Kruskal-Wallis test), though using the full-dataset ranks is also acceptable and provides a consistent ranking across comparisons.
10.5 Comparing the Mann-Whitney and Kolmogorov-Smirnov Tests
Both the Mann-Whitney and the two-sample Kolmogorov-Smirnov (KS) test are non-parametric tests for comparing two independent groups. Key differences:
| Property | Mann-Whitney U | Kolmogorov-Smirnov |
|---|---|---|
| Tests | Stochastic dominance / location shift | Any distributional difference |
| Sensitive to | Location differences | Location, spread, and shape differences |
| Power (location shifts) | ✅ Higher | ❌ Lower |
| Power (spread/shape differences) | ❌ Lower | ✅ Higher |
| Effect size | , | No standard effect size |
| Handles ties | With correction | Poorly (assumes continuous) |
Use Mann-Whitney when you are specifically interested in whether one group tends to have higher values (location shift). Use Kolmogorov-Smirnov when you want a general test of whether the two distributions differ in any way.
10.6 Bootstrap Confidence Intervals for
For small samples or when the Fisher -approximation may be imprecise, DataStatPro offers bootstrap CIs for :
- Draw bootstrap samples (resample with replacement separately from Group 1 and Group 2).
- Compute for each bootstrap sample.
- The 95% bootstrap CI is the 2.5th and 97.5th percentiles of the bootstrap values.
The bias-corrected and accelerated (BCa) bootstrap CI is preferred over the simple percentile method for small .
10.7 Reporting the Mann-Whitney U Test According to APA 7th Edition
Minimum required reporting elements:
- Test statistic: [value]
- p-value: [value] (exact or approximation — specify which)
- Effect size with 95% CI: [value] [95% CI: LB, UB]
- Medians and IQR (or full range) per group
- Whether exact or asymptotic p-value was used
- Tie correction: whether applied and number of ties
- Which alternative hypothesis was tested (two-tailed or directional)
APA template:
"A Mann-Whitney U test revealed [a significant / no significant] difference in [DV] between [Group 1] (Mdn = , IQR = ) and [Group 2] (Mdn = , IQR = ), [value], [value], [value], [value] [95% CI: LB, UB], indicating a [small / medium / large] effect."
11. Worked Examples
Example 1: Small Sample with Exact p-Value — Pain Relief Ratings
A physiotherapist compares pain relief ratings (0 = no relief, 10 = complete relief) for two manual therapy techniques. Normality is violated (Shapiro-Wilk: ).
Data:
| Technique A () | Technique B () |
|---|---|
Step 1 — Pool and rank all observations:
Sorted values:
| Obs | Group | Rank |
|---|---|---|
| 3 | A | 1.0 |
| 4 | A | 2.0 |
| 5 | A | 3.5 |
| 5 | A | 3.5 |
| 6 | A | 5.0 |
| 7 | A | 6.5 |
| 7 | B | 6.5 |
| 8 | A | 9.0 |
| 8 | B | 9.0 |
| 8 | B | 9.0 |
| 9 | B | 11.5 |
| 9 | B | 11.5 |
| 10 | B | 13.0 |
Step 2 — Rank sums:
Check: ✅
Step 3 — U statistics:
Check: ✅
Step 4 — Tie correction and z-statistic:
Ties: value 5 (), 7 (), 8 (), 9 ():
(Using to preserve sign; positive means A tends to be smaller than B)
Actually, to reflect direction:
Two-tailed:
Exact p-value (DataStatPro):
Step 5 — Effect size:
(Positive : Group B tends to have higher values)
95% CI for :
95% CI:
Hodges-Lehmann estimator :
All pairwise differences (Technique A Technique B):
Median of these 42 differences (Technique A scores are typically 3 points lower)
Summary:
| Statistic | Value | Interpretation |
|---|---|---|
| Technique A Median (IQR) | (3.5–7.5) | Lower ratings |
| Technique B Median (IQR) | (7.75–9.25) | Higher ratings |
| (exact) | Significant at | |
| Very large effect | ||
| 95% CI for | ||
| Hodges-Lehmann | points | A is 3 points lower |
APA write-up: "A Mann-Whitney U test (exact) was conducted to compare pain relief ratings between Technique A (, Mdn , IQR –) and Technique B (, Mdn , IQR –). Technique B produced significantly higher ratings, , (exact), [95% CI: 0.62, 0.97], indicating a very large effect. The Hodges-Lehmann estimator indicated a median difference of 3.0 points (Technique B higher)."
Example 2: Larger Sample with Normal Approximation — Reaction Times
A cognitive psychologist compares simple reaction times (ms) between neurotypical adults () and adults with ADHD (). Data are positively skewed (Shapiro-Wilk for ADHD group).
Summary statistics (pre-computed):
(neurotypical), (ADHD),
Neurotypical: Mdn ms, IQR – ms ADHD: Mdn ms, IQR – ms
Step 1 — U statistics:
Check: ... recheck.
Let me recalculate: . Inconsistency — let me use self-consistent values.
With :
Let (neurotypical), (ADHD).
Check: ✅
Step 2 — z-statistic (no ties assumed for this example):
(Negative: ADHD group has lower rank sum than expected under ... but ADHD should be higher. Use :)
Two-tailed:
Step 3 — Effect size:
(Positive: neurotypical faster — lower reaction times)
Wait — neurotypical should have lower RTs (faster), so means neurotypical observations tend to have higher ranks? No — higher RT (slower) = higher rank. Let me reclarify: higher rank = longer RT = slower. ADHD group should have higher RTs = higher ranks = larger . So , , and is calculated as:
For clear directionality, always specify which group is Group 1 and which is Group 2.
Let Group 1 = ADHD, Group 2 = Neurotypical, (ADHD), (neuro):
(ADHD wins in pairings) (Neuro wins in pairings)
Interpretation: ADHD group tends to have higher RT (larger values) than neurotypical; indicates that neurotypical observations tend to be smaller (faster).
— medium-to-large effect (approaching Cohen's medium = 0.30).
Interpretation: In only 31.5% of all pairings does an ADHD participant have a faster RT than a neurotypical participant — i.e., neurotypical participants are faster in 68.5% of pairings.
95% CI for :
95% CI:
Summary:
| Statistic | Value |
|---|---|
| Neurotypical: Mdn (IQR) | ms (–) |
| ADHD: Mdn (IQR) | ms (–) |
| (two-tailed, approx.) | |
| (medium-large) | |
| 95% CI for | |
| (neuro > ADHD in RT) |
APA write-up: "A Mann-Whitney U test was conducted to compare reaction times between neurotypical adults (, Mdn ms, IQR –) and adults with ADHD (, Mdn ms, IQR –). Adults with ADHD showed significantly longer reaction times, , , , [95% CI: 0.05, 0.62], indicating a medium-to-large effect. In 68.5% of all possible pairings, a neurotypical participant was faster than an ADHD participant."
Example 3: Interpreting a Non-Significant Result
A researcher tests whether customer satisfaction ratings differ between two service delivery formats (in-person vs. online; ; 5-point scale).
Given: , , .
95% CI for :
,
95% CI: → CI: (approximately)
Equivalence test: With bounds (trivially small effect), the 90% CI for is . The lower bound is within but the upper bound is just within — equivalence is borderline. Increase for a more powerful equivalence test.
Interpretation: The test is not significant (). The effect size is trivially small (, 95% CI spanning from practically zero to a small negative effect). The CI is relatively wide given the sample size. This is genuinely null-like, but a formal equivalence test with per group would provide more definitive evidence.
APA write-up: "A Mann-Whitney U test found no significant difference in satisfaction ratings between in-person () and online () formats, , , , [95% CI: , ]. The small effect size and wide confidence interval suggest that any true difference, if present, is negligibly small. An equivalence test would be required to formally establish the absence of a meaningful difference."
12. Common Mistakes and How to Avoid Them
Mistake 1: Using the Mann-Whitney Test for Paired Data
Problem: Applying the Mann-Whitney U Test to pre-post or matched-pairs data as if the groups were independent. This ignores the within-pair correlation, produces an inflated error term, and substantially reduces power. It also violates the independence assumption.
Solution: For paired or matched data, use the Wilcoxon Signed-Rank Test — the non-parametric equivalent of the paired t-test. Verify whether data represent independent groups (different participants) or related measurements (same participants or matched pairs) before choosing the test.
Mistake 2: Interpreting the Mann-Whitney Test as Always Testing Medians
Problem: Claiming that "the Mann-Whitney test compares medians" without acknowledging that this interpretation requires the location-shift assumption (equal distribution shapes). When distributions differ in shape or spread, the test may be significant even when medians are equal, or non-significant when medians differ considerably.
Solution: State the null hypothesis precisely: "The Mann-Whitney U test tests whether the probability that a randomly selected observation from Group 1 exceeds a randomly selected observation from Group 2 is 0.5." Only invoke median interpretation when the location-shift assumption is plausible and checked.
Mistake 3: Reporting Only U and p Without an Effect Size
Problem: Reporting , without the rank-biserial correlation or probability of superiority. Like the t-test, the Mann-Whitney test is influenced by sample size — a significant result says nothing about the magnitude of the effect.
Solution: Always report (and/or ) with 95% CI alongside and . DataStatPro computes these automatically. Small-sample significant results with small should be interpreted cautiously; large-sample non-significant results with may indicate insufficient power.
Mistake 4: Defaulting to Mann-Whitney When t-Test Assumptions Are Met
Problem: Using the Mann-Whitney test "to be safe" when the independent t-test's assumptions are fully satisfied (normal data, no severe outliers, equal variances). The Mann-Whitney test sacrifices approximately 5% power under normality — a real but small cost.
Solution: Run the Shapiro-Wilk test and inspect Q-Q plots. If normality holds (and sample sizes are adequate), use the independent t-test (Welch's version) for slightly greater power and the ability to report intuitive mean differences. Reserve Mann-Whitney for genuinely non-normal or ordinal data.
Mistake 5: Not Applying Tie Correction
Problem: Using the uncorrected variance formula when ties are present. This overestimates , producing a z-statistic that is too small and a p-value that is too large — making the test conservative.
Solution: Always apply the tie correction to the variance when ties are present. DataStatPro applies this automatically. Report the number and proportion of tied observations in the methods section.
Mistake 6: Reporting Means Instead of Medians for Mann-Whitney
Problem: Reporting group means alongside a Mann-Whitney test result. Since the test is rank-based and makes no assumptions about means, reporting means as the primary descriptive statistic is inconsistent with the test's rationale.
Solution: Report medians and IQRs (interquartile ranges) as the primary descriptive statistics alongside Mann-Whitney results. Means can be additionally reported as secondary information if useful, clearly labelled as supplementary.
Mistake 7: Confusing and , Leading to Sign Errors in
Problem: Confusing which statistic belongs to which group. : a positive value means Group 1 tends to be larger; negative means Group 2 tends to be larger. Swapping and reverses the sign.
Solution: Always clearly label which group is Group 1 and which is Group 2 before computing. State the direction of the effect in the results: "Group X tended to have higher values than Group Y."
Mistake 8: Using Mann-Whitney for More Than Two Groups
Problem: Running multiple pairwise Mann-Whitney tests across three or more groups without a prior omnibus test and without FWER correction. This inflates the familywise Type I error rate.
Solution: For three or more independent groups, run the Kruskal-Wallis Test as the omnibus test first. Only if significant, conduct pairwise Mann-Whitney or Dunn's tests with Bonferroni or Holm FWER correction.
Mistake 9: Ignoring the Directionality of a Significant Result
Problem: Reporting a significant Mann-Whitney result without stating which group had higher rankings. A significant does not tell you direction — you must inspect the rank sums or mean ranks to determine which group tends to be higher.
Solution: Always report mean ranks or medians alongside , so the direction is unambiguous. Check: if , Group 1 tends to have higher values; .
Mistake 10: Treating a Non-Significant Mann-Whitney as Proof of Equal Distributions
Problem: Concluding from that the two populations are identical (or that the medians are equal). A non-significant result indicates insufficient evidence against — not evidence for .
Solution: Report and its 95% CI. A non-significant result with a wide CI (e.g., [95% CI: , ]) indicates low power, not a true null effect. Conduct an equivalence test with pre-specified bounds to positively establish that the effect is negligibly small.
13. Troubleshooting
| Problem | Likely Cause | Solution |
|---|---|---|
| Calculation error in or | Verify first; recompute from corrected | |
| Ranking error; ties not averaged | Recheck averaging of tied ranks; verify all observations are ranked | |
| noticeably | Small or many ties | Use exact for per group; report exact if available |
| or | Computational error | Check sum to ; verify direction of subtraction |
| Very large but non-significant | severely underestimated due to ties without correction | Apply tie correction; use exact p-value |
| Mann-Whitney significant but t-test not | Non-normality causing t-test to lose power; or different hypotheses | Trust Mann-Whitney for non-normal data; they test different things |
| t-test significant but Mann-Whitney not | Outlier driving mean difference but not systematically affecting ranks | Investigate outlier; report both with explanation |
| Many ties () | Coarse measurement scale (e.g., 5-point Likert) | Use tie-corrected variance; report tie proportion; consider ordinal regression |
| Hodges-Lehmann but | Distributions differ in shape/spread but not location shift | Report rather than ; location-shift assumption may be violated |
| Brunner-Munzel and Mann-Whitney give different conclusions | Distribution shapes differ; location-shift violated | Use Brunner-Munzel as more appropriate; report both |
| Exact p-value computation takes too long | too large for enumeration | Switch to permutation test or normal approximation with tie correction |
| Negative value | Formula error (U cannot be negative) | Re-examine formula; always |
| Bootstrap CI very wide | Small | Report wide CI as reflecting genuine uncertainty; collect more data |
| from z-formula from formula | Formula approximation discrepancy | Use as primary; z-based formula is only approximate |
14. Quick Reference Cheat Sheet
Core Formulas
| Formula | Description |
|---|---|
| Rank sum for group | |
| Verification check | |
| for Group 1 | |
| for Group 2 | |
| Verification check | |
| Test statistic | |
| Mean of under | |
| Variance of (no ties) | |
| Variance (with tie correction) | |
| Standardised statistic | |
| Two-tailed p-value | |
| Probability of superiority |
Effect Size Formulas
| Formula | Description |
|---|---|
| Rank-biserial correlation (primary) | |
| From | |
| From mean ranks | |
| Approximate from | |
| Probability Group 1 > Group 2 | |
| Convert to PS | |
| Convert PS to | |
| Hodges-Lehmann estimator | |
| Fisher -transform of | |
| SE for CI of |
Conversions Between Effect Sizes
| From | To | Formula |
|---|---|---|
| Cohen's | ||
| Cohen's | (equal groups) | |
| (approx) |
Cohen's Benchmarks for
| Label | ||
|---|---|---|
| Negligible | – | |
| Small | ||
| Medium | ||
| Large | ||
| Very large |
Required per Group (80% Power, , Two-Tailed)
| Label | Mann-Whitney | t-Test (if normal) | |
|---|---|---|---|
| 0.10 | Small | 414 | 394 |
| 0.20 | Small | 99 | 97 |
| 0.30 | Medium | 44 | 43 |
| 0.44 | Medium | 21 | 20 |
| 0.50 | Large | 16 | 15 |
| 0.64 | Large | 10 | 9 |
Decision Guide: Mann-Whitney vs. Alternatives
| Situation | Test |
|---|---|
| Two independent groups, non-normal or ordinal | Mann-Whitney U ✅ |
| Two independent groups, normal, equal variances | Independent t-test (or Welch's) |
| Two independent groups, unequal shapes/spreads | Brunner-Munzel test |
| Two paired/related groups, non-normal | Wilcoxon Signed-Rank test |
| Three or more independent groups, non-normal | Kruskal-Wallis test |
| Two independent groups, very small ( per group) | Fisher's exact (binary), exact Mann-Whitney |
| General distributional difference (not just location) | Kolmogorov-Smirnov test |
Tie Correction Reference
| Proportion of Ties | Impact on | Recommended p-Value Method |
|---|---|---|
| Negligible | Standard or tie-corrected approximation | |
| – | Moderate reduction | Tie-corrected approximation |
| Substantial reduction | Exact p-value; permutation test |
APA Reporting Template
"A Mann-Whitney U test [with / without] continuity correction [with exact / asymptotic p-value] was conducted to compare [DV] between [Group 1] (Mdn = [value], IQR = [LB]–[UB], ) and [Group 2] (Mdn = [value], IQR = [LB]–[UB], ). [Group X] had significantly [higher / lower] [DV] than [Group Y], [value], [value], [value], [value] [95% CI: LB, UB], indicating a [small / medium / large] effect. In [PS%] of all possible pairings, a [Group X] observation exceeded a [Group Y] observation."
Assumption Checks Checklist
| Check | Method | Action if Violated |
|---|---|---|
| Independence | Study design review | Wilcoxon SR (paired); multilevel methods (clustered) |
| Ordinal scale | Measurement review | Chi-squared (nominal); ordinal regression |
| Two independent groups | Design review | Kruskal-Wallis (); Wilcoxon SR (paired) |
| Location-shift assumption | Boxplots; IQR comparison | Brunner-Munzel test |
| Excessive ties () | Count ties | Exact p-value; permutation test; tie-corrected variance |
Mann-Whitney Reporting Checklist
| Item | Required |
|---|---|
| statistic | ✅ Always |
| -statistic (if approximation used) | ✅ When applicable |
| Whether exact or asymptotic p-value | ✅ Always |
| Exact p-value | ✅ Preferred for per group |
| with 95% CI | ✅ Always |
| Probability of superiority () | ✅ Recommended |
| Medians and IQRs per group | ✅ Always |
| Sample sizes per group | ✅ Always |
| Direction of the effect | ✅ Always |
| Tie correction applied | ✅ When ties present |
| Number/proportion of tied observations | ✅ When ties substantial |
| Hodges-Lehmann estimator | ✅ Recommended |
| Whether two-tailed or directional | ✅ Always |
| Normality violation justification | ✅ When used instead of t-test |
This tutorial provides a comprehensive foundation for understanding, conducting, and reporting the Mann-Whitney U Test within the DataStatPro application. For further reading, consult Mann & Whitney's (1947) original paper "On a test of whether one of two random variables is stochastically larger than the other" (Annals of Mathematical Statistics), Wilcoxon's (1945) foundational paper, Hollander, Wolfe & Chicken's "Nonparametric Statistical Methods" (3rd ed., 2014) for rigorous theoretical coverage, Conover's "Practical Nonparametric Statistics" (3rd ed., 1999) for applied guidance, and Brunner & Munzel's (2000) "The Nonparametric Behrens-Fisher Problem" (Biometrical Journal) for the robust alternative when distribution shapes differ. For the rank-biserial correlation as an effect size, see Kerby (2014) in the Comprehensive Psychology journal. For feature requests or support, contact the DataStatPro team.