Mann-Whitney U Test: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of non-parametric inference all the way through the complete theory, mathematics, assumptions, effect sizes, interpretation, reporting, and practical usage of the Mann-Whitney U Test within the DataStatPro application. Whether you are encountering the Mann-Whitney U Test for the first time or seeking a deeper understanding of rank-based methods for comparing two independent groups, this guide builds your knowledge systematically from the ground up.

Prerequisites and Background Concepts
What is the Mann-Whitney U Test?
The Mathematics Behind the Mann-Whitney U Test
Assumptions of the Mann-Whitney U Test
Variants and Related Tests
Using the Mann-Whitney U Test Calculator Component
Exact vs. Approximate Methods
Effect Sizes for the Mann-Whitney U Test
Confidence Intervals
Advanced Topics
Worked Examples
Common Mistakes and How to Avoid Them
Troubleshooting
Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into the Mann-Whitney U Test, it is essential to be comfortable with the following foundational statistical concepts. Each is briefly reviewed below.

1.1 Parametric vs. Non-Parametric Inference

Parametric tests (e.g., the independent samples t-test) make explicit assumptions about the shape of the population distribution — typically normality — and estimate specific population parameters (e.g., $\mu$ , $\sigma^2$ ). Their validity depends on those distributional assumptions being met.

Non-parametric tests (also called distribution-free tests) do not assume a specific functional form for the population distribution. The Mann-Whitney U Test is non-parametric: it does not assume normality. Instead of operating on raw scores, it operates on the ranks of those scores.

⚠️ "Distribution-free" is not synonymous with "assumption-free." The Mann-Whitney U Test has its own set of assumptions, reviewed in Section 4. Violating these assumptions can invalidate its conclusions just as surely as violating normality invalidates the t-test.

1.2 Ordinal Data and Ranks

Ordinal data convey the relative ordering of observations but not the magnitude of differences between them. Examples include:

Likert scale responses (1 = Strongly Disagree, 5 = Strongly Agree).
Pain ratings on a 0–10 visual analogue scale.
Customer satisfaction rankings.
Academic grade categories (Distinction, Merit, Pass, Fail).

Ranking is the process of replacing each raw score with its position in an ordered list. For $N$ total observations:

The smallest observation receives rank 1.
The largest receives rank $N$ .
Tied observations receive the average of the ranks they would have occupied (mid-ranks).

Example: Raw scores $\{3, 7, 7, 9\}$ become ranks $\{1, 2.5, 2.5, 4\}$ (the two tied 7s share ranks 2 and 3, so each receives $(2+3)/2 = 2.5$ ).

1.3 The Concept of Stochastic Dominance

The Mann-Whitney U Test is fundamentally a test of stochastic dominance. Group 1 stochastically dominates Group 2 if a randomly chosen observation from Group 1 tends to be larger than a randomly chosen observation from Group 2:

$P(X_1 > X_2) > 0.5$

The test statistic $U$ directly estimates this probability, making the Mann-Whitney U Test one of the most intuitively interpretable inferential tests in statistics.

1.4 The Independent Samples t-Test and Its Limitations

The independent samples t-test is the parametric alternative to the Mann-Whitney U Test. It is appropriate when:

Data are continuous and at least interval-scaled.
Both groups' populations are approximately normally distributed.
Population variances are equal (or Welch's correction is applied).

When normality is markedly violated (especially with small samples) or when data are ordinal, the t-test is inappropriate and the Mann-Whitney U Test is preferred.

1.5 Statistical Power and Asymptotic Relative Efficiency

The Asymptotic Relative Efficiency (ARE) compares the power of two tests as $n \to \infty$ . The ARE of the Mann-Whitney U Test relative to the t-test is:

$ARE = \frac{3}{\pi} \approx 0.955$ (under normality)

This means:

When data are normally distributed, you need approximately 5% more observations with the Mann-Whitney test to achieve the same power as the t-test.
When data are non-normal (heavy-tailed, skewed, or contaminated), the Mann-Whitney test can be substantially more powerful than the t-test.

This near-equivalence under normality makes the Mann-Whitney test a safe default when normality is uncertain.

1.6 The Probability of Superiority

The probability of superiority (PS) — equivalent to the Common Language Effect Size — is the probability that a randomly selected observation from Group 1 exceeds a randomly selected observation from Group 2:

$PS = P(X_1 > X_2) + 0.5 \cdot P(X_1 = X_2)$

$PS = 0.5$ : No tendency for one group to exceed the other (null hypothesis).
$PS = 0.75$ : 75% of the time, a random Group 1 observation exceeds a random Group 2 observation — a large, practically meaningful difference.
$PS = 1.0$ : Every Group 1 observation exceeds every Group 2 observation (perfect separation).

This interpretation is central to understanding the Mann-Whitney U Test's effect size.

1.7 Hypothesis Testing Framework

Every Mann-Whitney U Test operates within the standard hypothesis testing framework:

Step 1 — State the hypotheses (see Section 4 for precise formulations).

Step 2 — Choose $\alpha$ — the significance level (conventionally $\alpha = .05$ ).

Step 3 — Compute the test statistic $U$ (or its standardised form $z$ ).

Step 4 — Compute the p-value — the probability of observing a $U$ statistic at least as extreme as the one obtained, assuming $H_0$ .

Step 5 — Make a decision — reject $H_0$ if $p \leq \alpha$ .

Step 6 — Compute and report the effect size — $r_{rb}$ (rank-biserial correlation) with 95% confidence interval.

2. What is the Mann-Whitney U Test?

2.1 The Core Idea

The Mann-Whitney U Test (also called the Wilcoxon Rank-Sum Test, or Wilcoxon-Mann- Whitney Test) is a non-parametric test for comparing two independent groups. Rather than comparing group means directly (as the t-test does), it assesses whether observations from one group tend to be systematically larger or smaller than observations from the other group.

The test was independently developed by:

Frank Wilcoxon (1945) — proposed the Rank-Sum Test.
Henry Mann and Donald Whitney (1947) — developed the equivalent $U$ statistic and derived its exact null distribution.

The two formulations are mathematically equivalent: they produce the same p-value.

2.2 Research Questions the Mann-Whitney U Test Answers

The Mann-Whitney U Test answers:

"Do observations from Group 1 tend to have systematically higher (or lower) values than observations from Group 2?"

More formally, under the location-shift assumption (see Section 4):

"Is the median of Group 1 equal to the median of Group 2?"

2.3 When to Use the Mann-Whitney U Test

The Mann-Whitney U Test is the appropriate choice when:

Condition	Details
Two independent groups	Different participants in each group
Non-normal distribution	Normality violated; Shapiro-Wilk significant
Ordinal dependent variable	Likert scales, pain ratings, satisfaction scores
Small sample size	$n < 15$ per group; CLT may not apply
Presence of outliers	Extreme values distort the t-test
Skewed distributions	Reaction times, income, response latencies
Bounded scales	Ceiling or floor effects distorting normality

2.4 The Mann-Whitney U Test vs. the Independent Samples t-Test

Property	Independent t-Test	Mann-Whitney U Test
Tests	Mean difference	Distributional dominance / median shift
Scale	Interval / Ratio	Ordinal or higher
Assumes normality	✅ Yes	❌ No
Sensitive to outliers	✅ High	❌ Low (rank-based)
Power (when normal)	Slightly higher	$\approx 95.5\%$ of t-test
Power (when non-normal)	Can be lower	Can exceed t-test
Effect size	Cohen's $d$	Rank-biserial $r_{rb}$
Parametric	✅ Yes	❌ No

2.5 Real-World Applications

Field	Application	Example
Clinical Psychology	Symptom severity between two treatment arms	PTSD symptom score: EMDR vs. CBT
Medicine	Recovery time between two surgical techniques	Days to discharge: laparoscopic vs. open
Education	Exam performance between two instructional methods	Grades: problem-based vs. lecture
Marketing	Consumer preference ratings for two products	Rating (1–10): Product A vs. B
Ecology	Species abundance between two habitats	Count of species: Forest A vs. B
Neuroscience	Response latencies between patient and control groups	RT (ms): ADHD vs. neurotypical
Organisational Psychology	Job satisfaction between two departments	Survey score: Dept A vs. Dept B
Public Health	Physical activity levels between two communities	Steps/day: urban vs. rural

3. The Mathematics Behind the Mann-Whitney U Test

3.1 The Rank-Sum Formulation (Wilcoxon)

Step 1 — Pool and rank all observations.

Combine all $n_1 + n_2 = N$ observations from both groups into a single ordered list. Assign ranks from 1 (smallest) to $N$ (largest). For tied values, assign average ranks (mid-ranks).

Step 2 — Compute the rank sums.

$W_1 = \sum_{i=1}^{n_1} R_i \quad \text{(sum of ranks for Group 1)}$

$W_2 = \sum_{j=1}^{n_2} R_j \quad \text{(sum of ranks for Group 2)}$

Verification (always check):

$W_1 + W_2 = \frac{N(N+1)}{2}$

3.2 The U Statistic (Mann-Whitney)

The U statistic counts the number of times a Group 1 observation exceeds a Group 2 observation across all $n_1 \times n_2$ possible pairwise comparisons:

$U_1 = \sum_{i=1}^{n_1}\sum_{j=1}^{n_2}\mathbf{1}(x_{1i} > x_{2j}) + \frac{1}{2}\sum_{i=1}^{n_1}\sum_{j=1}^{n_2}\mathbf{1}(x_{1i} = x_{2j})$

Equivalent formulas using rank sums (computationally simpler):

$U_1 = n_1 n_2 + \frac{n_1(n_1+1)}{2} - W_1$

$U_2 = n_1 n_2 + \frac{n_2(n_2+1)}{2} - W_2$

Key verification:

$U_1 + U_2 = n_1 n_2$

The test statistic is:

$U = \min(U_1, U_2)$

For large-sample tests, it is more convenient to use $U_1$ directly (with the sign determining the direction of the difference).

3.3 The Null Distribution of U

Under $H_0$ (the two populations are identical), the U statistic has a known exact distribution for small samples. The null distribution is symmetric about:

$\mu_U = \frac{n_1 n_2}{2}$

With variance:

$\sigma_U^2 = \frac{n_1 n_2 (n_1 + n_2 + 1)}{12}$

Without ties: This formula is exact.

With ties: The variance must be corrected:

$\sigma_U^2 = \frac{n_1 n_2}{12}\left[(n_1+n_2+1) - \frac{\sum_{k=1}^{g}(t_k^3 - t_k)}{(n_1+n_2)(n_1+n_2-1)}\right]$

Where $g$ is the number of distinct tied groups and $t_k$ is the number of observations in the $k$ -th tied group. The term $\sum_k(t_k^3 - t_k)$ is the tie correction factor.

3.4 The z-Approximation for Large Samples

For $n_1, n_2 \geq 10$ (or generally when exact tables are unavailable), the standardised U statistic is approximately standard normal:

Without continuity correction:

$z = \frac{U_1 - \mu_U}{\sigma_U} = \frac{U_1 - n_1 n_2/2}{\sqrt{n_1 n_2(N+1)/12}}$

With continuity correction (improves approximation for smaller samples):

$z_c = \frac{U_1 - n_1 n_2/2 \pm 0.5}{\sigma_U}$

Where $-0.5$ is used when $U_1 > \mu_U$ and $+0.5$ when $U_1 < \mu_U$ .

With tie correction:

$z = \frac{U_1 - n_1 n_2/2}{\sqrt{\dfrac{n_1 n_2}{12}\left[(N+1) - \dfrac{\sum_k(t_k^3-t_k)}{N(N-1)}\right]}}$

3.5 Computing the Two-Tailed p-Value

Using exact distribution (small samples, $n_j \leq 10$ ):

$p = 2 \times P(U \leq U_{obs}) \quad \text{(using exact tables or enumeration)}$

Using z-approximation (large samples):

$p = 2 \times P(Z \geq |z|) = 2[1 - \Phi(|z|)]$

Using one-tailed tests:

Upper tail ( $H_1: P(X_1 > X_2) > 0.5$ ):

$p = P(Z \geq z)$

Lower tail ( $H_1: P(X_1 > X_2) < 0.5$ ):

$p = P(Z \leq z)$

3.6 The Exact Computation via Pairwise Comparisons

The U statistic can also be computed directly by comparing all possible pairs of observations across the two groups. For each pair $(x_{1i}, x_{2j})$ :

$S_{ij} = \begin{cases} 1 & \text{if } x_{1i} > x_{2j} \\ 0.5 & \text{if } x_{1i} = x_{2j} \\ 0 & \text{if } x_{1i} < x_{2j} \end{cases}$

$U_1 = \sum_{i=1}^{n_1}\sum_{j=1}^{n_2} S_{ij}$

This formulation makes the connection to the probability of superiority transparent:

$\widehat{PS} = \frac{U_1}{n_1 n_2}$

Under $H_0$ : $\widehat{PS} = 0.5$ (since $U_1 = n_1 n_2/2$ ).

3.7 Critical Values for Small Samples

For small samples ( $n_1, n_2 \leq 20$ ), compare $U = \min(U_1, U_2)$ to the critical value $U_{crit}$ . Reject $H_0$ (two-tailed, $\alpha = .05$ ) if $U \leq U_{crit}$ :

$n_1$	$n_2 = 5$	$n_2 = 6$	$n_2 = 7$	$n_2 = 8$	$n_2 = 10$	$n_2 = 15$	$n_2 = 20$
5	2	5	6	8	11	20	27
6	5	7	8	10	14	24	34
7	6	8	11	13	17	28	39
8	8	10	13	15	20	33	45
10	11	14	17	20	27	42	59

Reject $H_0$ if $U \leq U_{crit}$ . Values above are for $\alpha = .05$ , two-tailed.

💡 DataStatPro computes exact p-values for all sample sizes using complete enumeration (for small samples) or the exact permutation distribution. The z-approximation is used only when exact computation is infeasible ( $N > 200$ ).

4. Assumptions of the Mann-Whitney U Test

4.1 Independence of Observations

All observations must be independent of each other, both within and across groups. No observation should influence or be influenced by any other.

Why it matters: Dependence between observations (e.g., measurements from the same participant appearing in both groups, or clustered observations) inflates the false positive rate and invalidates the null distribution of $U$ .

How to check: Review the study design. Independence is a design property, not detectable from the data alone.

When violated: Use the Wilcoxon Signed-Rank Test for paired data. For nested or clustered data, use multilevel non-parametric methods.

4.2 Ordinal or Higher Scale of Measurement

The dependent variable must be at least ordinally scaled — observations must be meaningfully rankable. The Mann-Whitney U Test is appropriate for:

Ordinal data (Likert scales, ranked responses).
Continuous data that violate parametric assumptions.
Discrete count data with many ties.

When violated: If observations cannot be meaningfully ordered (i.e., the variable is truly nominal with no natural ordering), use the chi-squared test or Fisher's exact test instead.

4.3 Two Independent Groups

The test requires exactly two groups composed of different (independent) participants. Groups may have unequal sizes ( $n_1 \neq n_2$ ), and the test remains valid.

When violated: For three or more independent groups, use the Kruskal-Wallis Test. For two related (paired) groups, use the Wilcoxon Signed-Rank Test.

4.4 The Location-Shift Assumption (for Median Interpretation)

This is the most commonly misunderstood assumption. The Mann-Whitney U Test tests:

Without the location-shift assumption: $H_0: P(X_1 > X_2) = 0.5$ (stochastic equality) $H_1: P(X_1 > X_2) \neq 0.5$ (stochastic dominance)

This is always valid under the independence and ordinal assumptions alone.

With the location-shift assumption (same distribution shape, just shifted): $H_0: \theta_1 = \theta_2$ (equal medians) $H_1: \theta_1 \neq \theta_2$ (unequal medians)

The location-shift assumption requires that the two population distributions have the same shape and spread — only their location (median) differs:

$F_2(x) = F_1(x - \Delta)$ for some shift $\Delta$

Why this matters: If the distributions differ in shape or spread (not just location), then a significant Mann-Whitney result may reflect differences in variability or distribution shape rather than a difference in central tendency. In this case, the Brunner-Munzel Test (Section 10) is more appropriate.

How to check:

Inspect boxplots for each group: similar spread and shape?
Compare interquartile ranges (IQR) across groups.
Run Levene's test on the raw data (though this tests variances, not full distribution shape).

4.5 No Assumption of Normality

Unlike the independent samples t-test, the Mann-Whitney U Test makes no normality assumption. This is its primary advantage and the most common reason for choosing it over the t-test.

4.6 Handling Ties

Ties (observations with identical values) reduce the power of the Mann-Whitney test slightly. The tie correction to the variance formula (Section 3.3) accounts for this. Excessive ties (e.g., more than 20% of observations tied) can reduce power substantially and should be noted in the methods section.

⚠️ When many ties are present, especially with small samples, the exact distribution of $U$ (rather than the normal approximation) should be used for p-values, as the normal approximation may be poor.

4.7 Assumption Summary Table

Assumption	Required	How to Check	Remedy if Violated
Independence of observations	✅ Yes	Study design review	Wilcoxon signed-rank (paired); multilevel methods (clustered)
Ordinal or higher scale	✅ Yes	Measurement theory	Chi-squared (nominal outcome)
Two independent groups	✅ Yes	Study design	Kruskal-Wallis ( $K > 2$ ); Wilcoxon signed-rank (paired)
Location-shift (for median interpretation)	⚠️ Conditionally	Boxplots, IQR comparison	Brunner-Munzel test (unequal shapes)
Normality	❌ Not required	—	—
Equal variances	❌ Not required	—	—

5. Variants and Related Tests

5.1 The Wilcoxon Rank-Sum Test

The Wilcoxon Rank-Sum Test and the Mann-Whitney U Test are two names for the same procedure. They differ only in which test statistic is reported:

Wilcoxon: Reports $W = W_1$ (rank sum of Group 1).
Mann-Whitney: Reports $U_1$ and $U_2$ (or $U = \min(U_1, U_2)$ ).

The relationship: $U_1 = W_1 - n_1(n_1+1)/2$

Both produce identical p-values. DataStatPro reports both $U$ and $W$ for completeness.

5.2 One-Tailed vs. Two-Tailed Tests

Two-tailed (default): $H_1: P(X_1 > X_2) \neq 0.5$ Use when the direction of the difference is not predicted in advance.

One-tailed (upper): $H_1: P(X_1 > X_2) > 0.5$ Use when specifically predicting Group 1 tends to be larger. Divide the two-tailed p-value by 2.

One-tailed (lower): $H_1: P(X_1 > X_2) < 0.5$ Use when specifically predicting Group 1 tends to be smaller.

⚠️ One-tailed tests must be justified and pre-registered before data collection. Switching to one-tailed after observing the data direction is p-hacking.

5.3 The Brunner-Munzel Test

The Brunner-Munzel Test (Brunner & Munzel, 2000) is a robust alternative to the Mann-Whitney test when the location-shift assumption may be violated — that is, when the two distributions may differ in shape and spread, not just location.

It tests the same null hypothesis:

$H_0: P(X_1 > X_2) + 0.5 \cdot P(X_1 = X_2) = 0.5$

But uses separate within-group rankings to construct a test statistic that is valid regardless of whether the distribution shapes are equal.

The Brunner-Munzel statistic:

$t_{BM} = \frac{n_1 n_2 (\bar{R}_1^{(int)} - \bar{R}_2^{(int)})}{N\sqrt{n_1 \hat{S}_1^2 + n_2 \hat{S}_2^2}}$

Where $\bar{R}_j^{(int)}$ are internal ranks (each group ranked separately within itself using the pooled ranking as reference), and $\hat{S}_j^2$ are within-group variance estimates of the ranks.

Degrees of freedom are approximated using a Welch-Satterthwaite-type formula. DataStatPro reports the Brunner-Munzel test when the location-shift assumption appears violated.

5.4 Permutation Test Alternative

The permutation test (randomisation test) for two independent groups:

Compute the observed $U$ statistic (or difference in means/medians).
Randomly reassign all $N$ observations to two groups of sizes $n_1$ and $n_2$ .
Recompute $U^*$ for each permutation.
The p-value is the proportion of permutations where $U^* \geq U_{obs}$ .

The permutation test is exact (no approximation needed), handles ties perfectly, and makes no distributional assumptions beyond exchangeability. DataStatPro offers this as an option under the Advanced Settings panel.

6. Using the Mann-Whitney U Test Calculator Component

The Mann-Whitney U Test Calculator component in DataStatPro provides a comprehensive tool for conducting, diagnosing, visualising, and reporting the Mann-Whitney U Test and its alternatives.

Step-by-Step Guide

Step 1 — Select the Test

From the "Non-Parametric Tests" menu, select "Mann-Whitney U Test (Independent Samples)". DataStatPro will also display Wilcoxon Rank-Sum notation alongside for software compatibility.

Step 2 — Input Method

Choose how to provide the data:

Raw data: Upload or paste each group's data. DataStatPro ranks observations automatically, applies tie corrections, and computes all statistics.
Summary (ranks): Enter pre-computed rank sums $W_1$ and $W_2$ with group sizes.
U statistic + group sizes: Enter $U_1$ (or $U$ ), $n_1$ , and $n_2$ directly to compute p-values and effect sizes from a published result.

💡 Always use raw data when available. The rank-based computation is automatic, and raw data enable exact p-values, assumption checks, and visualisation of the full distribution.

Step 3 — Specify the Alternative Hypothesis

Two-tailed (default): $H_1: P(X_1 > X_2) \neq 0.5$
Upper one-tailed: $H_1: P(X_1 > X_2) > 0.5$ (pre-registered directional prediction)
Lower one-tailed: $H_1: P(X_1 > X_2) < 0.5$ (pre-registered directional prediction)

Step 4 — Select the p-Value Method

Exact (recommended for $N \leq 200$ ): Enumerates the exact null distribution of $U$ .
Normal approximation with tie correction (default for $N > 200$ ): Uses the standardised z-statistic with the tie-corrected variance formula.
Permutation test: Randomly samples from the permutation distribution (10,000 permutations by default; customisable up to 100,000).

Step 5 — Select the Continuity Correction

For the normal approximation:

With continuity correction: Recommended for $n_j < 20$ ; improves accuracy of the normal approximation to the discrete U distribution.
Without continuity correction: Appropriate for $n_j \geq 20$ .

Step 6 — Select Effect Size Options

✅ Rank-biserial correlation $r_{rb}$ (primary, always computed).
✅ Probability of superiority $\hat{PS}$ (= $U_1/(n_1 n_2)$ ).
✅ Common Language Effect Size (equivalent to $\hat{PS} \times 100\%$ ).
✅ 95% CI for $r_{rb}$ via Fisher $z$ -transformation of $\hat{PS}$ .
✅ Hodges-Lehmann estimator (median of all pairwise differences as robust location estimate).

Step 7 — Select Display Options

✅ $U_1$ , $U_2$ , $W_1$ , $W_2$ , $z$ -statistic, and p-value.
✅ Rank sums and mean ranks per group.
✅ Medians, IQR, and descriptive statistics per group.
✅ Raincloud plot (half violin + boxplot + raw data points) per group.
✅ Ranked dot plot (showing all ranks with group membership colour-coded).
✅ Effect size visualisation ( $r_{rb}$ on a number line with Cohen's benchmarks).
✅ Tie summary table (if ties present).
✅ APA 7th edition results paragraph (auto-generated).
✅ Comparison with independent samples t-test (when raw data available).

Step 8 — Run the Analysis

Click "Run Mann-Whitney U Test". DataStatPro will:

Pool and rank all observations with tie correction.
Compute $U_1$ , $U_2$ , $W_1$ , $W_2$ .
Compute the exact or approximate p-value.
Compute $r_{rb}$ , $\hat{PS}$ , and their 95% CIs.
Compute the Hodges-Lehmann median difference estimate.
Generate all selected visualisations.
Generate an APA-compliant results paragraph.

7. Exact vs. Approximate Methods

7.1 When to Use Exact Methods

The exact Mann-Whitney distribution enumerates all possible arrangements of $N$ observations into two groups of sizes $n_1$ and $n_2$ and computes the proportion of these that yield a $U$ statistic as extreme as the observed value. This is computationally intensive but exact.

Use exact methods when:

$n_j \leq 10$ per group (exact tables available; approximation may be poor).
There are many ties (approximation deteriorates with heavy ties).
Precision is critical (clinical trials, regulatory submissions).

The total number of equally likely arrangements under $H_0$ :

$\binom{N}{n_1} = \frac{N!}{n_1! n_2!}$

For $n_1 = n_2 = 10$ : $\binom{20}{10} = 184{,}756$ arrangements — computationally feasible for exact enumeration.

7.2 The Normal Approximation — When Is It Adequate?

The normal approximation is adequate when:

$n_1 \geq 10$ AND $n_2 \geq 10$ (rule of thumb; exact methods preferable when feasible).
Ties are not excessive (less than 25% of observations tied).
The continuity correction is applied for smaller samples.

Accuracy of the approximation: The approximation error for the p-value is $O(1/N)$ , meaning it improves as sample size increases.

7.3 The Permutation Approach

The permutation approach avoids the normal approximation entirely by directly estimating the null distribution from the data. It is:

Exact in principle (but Monte Carlo sampling introduces small simulation error).
Robust to all distributional assumptions beyond exchangeability.
Practical for large $N$ where full enumeration is infeasible.

With $B = 10{,}000$ permutations, the Monte Carlo standard error of the p-value estimate is $\sqrt{p(1-p)/B} \leq \sqrt{0.25/10000} = 0.005$ — adequate for most purposes. DataStatPro uses $B = 100{,}000$ by default for higher precision.

8. Effect Sizes for the Mann-Whitney U Test

8.1 The Rank-Biserial Correlation ( $r_{rb}$ )

The rank-biserial correlation is the standard effect size for the Mann-Whitney U Test. It directly measures the probability of superiority on a standardised scale from $-1$ to $+1$ .

Formula from U statistics:

$r_{rb} = \frac{U_1 - U_2}{n_1 n_2} = \frac{U_1}{n_1 n_2} - \frac{U_2}{n_1 n_2}$

Equivalently:

$r_{rb} = 1 - \frac{2U}{n_1 n_2} \quad \text{(where } U = \min(U_1, U_2)\text{)}$

Or, when $U_1$ is the statistic for Group 1:

$r_{rb} = \frac{2U_1}{n_1 n_2} - 1$

Formula from mean ranks:

$r_{rb} = \frac{\bar{R}_1 - \bar{R}_2}{(n_1+n_2)/2}$

Where $\bar{R}_j = W_j/n_j$ is the mean rank of group $j$ .

Formula from the z-statistic:

$r_{rb} = \frac{2z}{N} = \frac{2z}{\sqrt{n_1+n_2}}$ (approximate)

A more precise formula:

$r_{rb} \approx \frac{z}{\sqrt{N}} \times \sqrt{\frac{4N}{n_1 n_2/(n_1+n_2)}}$

8.2 Interpreting the Rank-Biserial Correlation

$r_{rb}$	$\hat{PS}$	Verbal Interpretation
$0.00$	$0.50$	No tendency; equally likely to exceed
$0.10$	$0.55$	Very small effect; Group 1 slightly higher
$0.20$	$0.60$	Small effect
$0.30$	$0.65$	Small-to-medium effect
$0.44$	$0.72$	Medium effect (Cohen's convention)
$0.50$	$0.75$	Medium-large effect
$0.64$	$0.82$	Large effect (Cohen's convention)
$0.80$	$0.90$	Very large effect
$1.00$	$1.00$	Perfect — every Group 1 obs. exceeds every Group 2 obs.

Cohen's (1988) benchmarks for $r_{rb}$ (same as Pearson $r$ ):

| Label | $|r_{rb}|$ | | :---- | :--------- | | Small | $0.10$ | | Medium | $0.30$ | | Large | $0.50$ |

⚠️ Cohen's benchmarks were not specifically developed for the rank-biserial correlation. Always contextualise effect sizes within your research domain and compare to typical effect sizes from meta-analyses in the same field.

8.3 The Probability of Superiority ( $\hat{PS}$ )

The probability of superiority is the most intuitive interpretation of the Mann-Whitney effect size:

$\hat{PS} = \frac{U_1}{n_1 n_2}$

Relationship to $r_{rb}$ :

$\hat{PS} = \frac{r_{rb} + 1}{2}, \qquad r_{rb} = 2\hat{PS} - 1$

Interpretation: If $\hat{PS} = 0.75$ , then in 75% of all possible pairings of one observation from Group 1 with one from Group 2, the Group 1 observation is larger.

Confidence interval for $\hat{PS}$ (using Fisher $z$ -transformation of $r_{rb}$ ):

$z_{r} = \text{arctanh}(r_{rb}), \quad SE_{z_r} \approx \frac{1}{\sqrt{n_1 n_2 / 3}}$

95% CI:

$z_r \pm 1.96 \times SE_{z_r}$

Back-transform: $r_{rb} = \tanh(z_r)$ ; then $\hat{PS} = (r_{rb}+1)/2$ .

8.4 The Hodges-Lehmann Estimator

The Hodges-Lehmann estimator $\hat{\Delta}$ is a robust, rank-based point estimate of the location shift between the two groups. It is the median of all possible pairwise differences:

$\hat{\Delta} = \text{median}\{x_{1i} - x_{2j}: i = 1,\ldots, n_1;\; j = 1,\ldots, n_2\}$

$\hat{\Delta}$ has $n_1 \times n_2$ pairwise differences.
Under the location-shift assumption, $\hat{\Delta}$ estimates the median difference $\theta_1 - \theta_2$ .
$\hat{\Delta} = 0$ under $H_0$ (when the populations are identical).

Confidence interval for $\hat{\Delta}$ : Using the exact Mann-Whitney distribution to determine which order statistics of the pairwise differences form the CI bounds.

The Hodges-Lehmann estimator is reported by DataStatPro alongside the Mann-Whitney U test as a meaningful, robust measure of the magnitude of the location shift in the original measurement units.

8.5 Comparing Effect Sizes Across Studies

When comparing Mann-Whitney effect sizes to t-test effect sizes from other studies, use the following conversions:

$r_{rb}$ to Cohen's $d$ (approximate, under normality and equal group sizes):

$d \approx \frac{2r_{rb}}{\sqrt{1-r_{rb}^2}}$

Or more precisely, using the relationship $r_{rb} \approx r_{pb}$ (point-biserial $r$ ):

$d = \frac{2r}{\sqrt{1-r^2}}$

Cohen's $d$ to $r_{rb}$ :

$r_{rb} \approx \frac{d}{\sqrt{d^2+4}}$ (for equal group sizes)

⚠️ These conversions assume normality for the $d$ -to- $r$ direction and may not hold for non-normal data. Use conversions with caution and clearly state the assumption.

9. Confidence Intervals

9.1 Confidence Interval for the Rank-Biserial Correlation

The 95% CI for $r_{rb}$ uses the Fisher $z$ -transformation:

$z_{r_{rb}} = \text{arctanh}(r_{rb}) = \frac{1}{2}\ln\!\left(\frac{1+r_{rb}}{1-r_{rb}}\right)$

Standard error (approximate):

$SE_{z} \approx \frac{1}{\sqrt{n_1 n_2/3}}$

A more precise standard error accounting for group sizes:

$SE_{z} = \sqrt{\frac{n_1+n_2+1}{3n_1 n_2}}$

95% CI in $z$ space:

$\left[z_{r_{rb}} - 1.96 \times SE_z,\; z_{r_{rb}} + 1.96 \times SE_z\right]$

Back-transform to $r_{rb}$ scale:

$r_{rb,L} = \tanh(z_{r_{rb}} - 1.96 \times SE_z), \qquad r_{rb,U} = \tanh(z_{r_{rb}} + 1.96 \times SE_z)$

9.2 Confidence Interval for the Hodges-Lehmann Estimator

The CI for $\hat{\Delta}$ is derived from the exact Mann-Whitney null distribution. The procedure:

Order all $n_1 \times n_2$ pairwise differences $D_{(1)} \leq D_{(2)} \leq \cdots \leq D_{(n_1 n_2)}$ .
Find the critical value $C_{\alpha}$ from the Mann-Whitney distribution table: $C_{\alpha} = U_{crit,\;\alpha/2}$ (two-tailed).
The 95% CI for $\hat{\Delta}$ is:

$\left[D_{(C_\alpha+1)},\; D_{(n_1 n_2 - C_\alpha)}\right]$

DataStatPro computes this exactly for $n_1 n_2 \leq 5{,}000$ and uses a normal approximation for larger datasets.

9.3 Confidence Interval for the Probability of Superiority

After computing the CI for $r_{rb}$ (Section 9.1):

$\hat{PS}_L = \frac{r_{rb,L}+1}{2}, \qquad \hat{PS}_U = \frac{r_{rb,U}+1}{2}$

Example: If $r_{rb} = 0.45$ with 95% CI $[0.18, 0.67]$ , then $\hat{PS} = 0.725$ with 95% CI $[0.59, 0.835]$ .

10. Advanced Topics

10.1 The Mann-Whitney Test as a Test of Stochastic Equality

Without the location-shift assumption, the Mann-Whitney test tests the general null:

$H_0: P(X_1 > X_2) + \frac{1}{2}P(X_1 = X_2) = \frac{1}{2}$

This null is called stochastic equality. It does not require equal medians, equal shapes, or any distributional assumption. The alternative:

$H_1: P(X_1 > X_2) + \frac{1}{2}P(X_1 = X_2) \neq \frac{1}{2}$

This is the most general and defensible interpretation of the Mann-Whitney U Test.

Practical implication: If Group 1 has a higher median but larger spread, and Group 2 has a lower median but smaller spread, the distributions may overlap substantially and $P(X_1 > X_2)$ may be close to 0.5 even though the medians differ — the Mann-Whitney test (correctly) may not detect a significant difference.

10.2 Asymptotic Relative Efficiency Across Distributions

The ARE of the Mann-Whitney test relative to the t-test depends on the true underlying distribution:

Distribution	ARE (Mann-Whitney vs. t-test)
Normal	$3/\pi \approx 0.955$
Uniform	$1.000$
Logistic	$\pi^2/9 \approx 1.097$
Double exponential (Laplace)	$1.500$
Cauchy (heavy-tailed)	$\infty$
Contaminated normal	Often $> 1.5$

For heavy-tailed distributions — common in psychology (reaction times), medicine (survival times), and economics (income) — the Mann-Whitney test is substantially more powerful than the t-test.

10.3 Sample Size and Power for the Mann-Whitney Test

Power of the Mann-Whitney test under a location-shift alternative with effect size $r_{rb}$ can be approximated using the ARE relationship:

$n_{MW} \approx \frac{n_t}{ARE} \approx \frac{\pi}{3} n_t$ (for normal data)

For non-normal data, the required $n$ for the Mann-Whitney test is computed using the non-central distribution of $U$ (or equivalently, the non-central normal distribution for large samples):

$\lambda = z_{1-\alpha/2} + z_{1-\beta} = \frac{(PS - 0.5)\sqrt{12 n_1 n_2 (N+1)}}{\sqrt{N}}$

Solving for $n$ per group (equal group sizes, $n_1 = n_2 = n$ ):

$n \approx \frac{(z_{1-\alpha/2} + z_{1-\beta})^2}{\pi f^2/3} = \frac{3(z_{1-\alpha/2}+z_{1-\beta})^2}{\pi f^2}$

Required $n$ per group (80% power, $\alpha = .05$ , two-tailed, equal group sizes):

| $|r_{rb}|$ | Label | $n$ per group (Mann-Whitney) | $n$ per group (t-test) | | :---------- | :---- | :--------------------------- | :--------------------- | | $0.10$ | Small | 414 | 394 | | $0.20$ | Small | 99 | 97 | | $0.30$ | Medium | 44 | 43 | | $0.44$ | Medium | 21 | 20 | | $0.50$ | Large | 16 | 15 | | $0.64$ | Large | 10 | 9 |

Note: Mann-Whitney requires approximately 5% more observations than the t-test under normality, consistent with the ARE of $3/\pi \approx 0.955$ .

10.4 Rank-Based Post-Hoc Comparisons After Kruskal-Wallis

When the Kruskal-Wallis test (the non-parametric ANOVA equivalent) is significant, pairwise Mann-Whitney U tests are conducted as post-hoc comparisons with appropriate FWER correction (Bonferroni, Holm, or Dunn's test). The effect size for each comparison is $r_{rb}$ .

Each pairwise comparison uses only the two groups being compared (not the full ranked dataset from the Kruskal-Wallis test), though using the full-dataset ranks is also acceptable and provides a consistent ranking across comparisons.

10.5 Comparing the Mann-Whitney and Kolmogorov-Smirnov Tests

Both the Mann-Whitney and the two-sample Kolmogorov-Smirnov (KS) test are non-parametric tests for comparing two independent groups. Key differences:

Property	Mann-Whitney U	Kolmogorov-Smirnov
Tests	Stochastic dominance / location shift	Any distributional difference
Sensitive to	Location differences	Location, spread, and shape differences
Power (location shifts)	✅ Higher	❌ Lower
Power (spread/shape differences)	❌ Lower	✅ Higher
Effect size	$r_{rb}$ , $\hat{PS}$	No standard effect size
Handles ties	With correction	Poorly (assumes continuous)

Use Mann-Whitney when you are specifically interested in whether one group tends to have higher values (location shift). Use Kolmogorov-Smirnov when you want a general test of whether the two distributions differ in any way.

10.6 Bootstrap Confidence Intervals for $r_{rb}$

For small samples or when the Fisher $z$ -approximation may be imprecise, DataStatPro offers bootstrap CIs for $r_{rb}$ :

Draw $B = 10{,}000$ bootstrap samples (resample with replacement separately from Group 1 and Group 2).
Compute $r_{rb}^*$ for each bootstrap sample.
The 95% bootstrap CI is the 2.5th and 97.5th percentiles of the $B$ bootstrap values.

The bias-corrected and accelerated (BCa) bootstrap CI is preferred over the simple percentile method for small $n$ .

10.7 Reporting the Mann-Whitney U Test According to APA 7th Edition

Minimum required reporting elements:

Test statistic: $U =$ [value]
p-value: $p =$ [value] (exact or approximation — specify which)
Effect size with 95% CI: $r_{rb} =$ [value] [95% CI: LB, UB]
Medians and IQR (or full range) per group
Whether exact or asymptotic p-value was used
Tie correction: whether applied and number of ties
Which alternative hypothesis was tested (two-tailed or directional)

APA template:

"A Mann-Whitney U test revealed [a significant / no significant] difference in [DV] between [Group 1] (Mdn = , IQR = ) and [Group 2] (Mdn = , IQR = ), $U =$ [value], $z =$ [value], $p =$ [value], $r_{rb} =$ [value] [95% CI: LB, UB], indicating a [small / medium / large] effect."

11. Worked Examples

Example 1: Small Sample with Exact p-Value — Pain Relief Ratings

A physiotherapist compares pain relief ratings (0 = no relief, 10 = complete relief) for two manual therapy techniques. Normality is violated (Shapiro-Wilk: $p < .05$ ).

Data:

Technique A ( $n_1 = 7$ )	Technique B ( $n_2 = 6$ )
$3, 6, 5, 8, 4, 7, 5$	$7, 9, 8, 10, 9, 8$

Step 1 — Pool and rank all $N = 13$ observations:

Sorted values: $3(1),\; 4(2),\; 5(3.5),\; 5(3.5),\; 6(5),\; 7(6.5),\; 7(6.5),\; 8(9),\; 8(9),\; 8(9),\; 9(11.5),\; 9(11.5),\; 10(13)$

Obs	Group	Rank
3	A	1.0
4	A	2.0
5	A	3.5
5	A	3.5
6	A	5.0
7	A	6.5
7	B	6.5
8	A	9.0
8	B	9.0
8	B	9.0
9	B	11.5
9	B	11.5
10	B	13.0

Step 2 — Rank sums:

$W_A = 1.0+2.0+3.5+3.5+5.0+6.5+9.0 = 30.5$

$W_B = 6.5+9.0+9.0+11.5+11.5+13.0 = 60.5$

Check: $30.5 + 60.5 = 91 = 13 \times 14/2$ ✅

Step 3 — U statistics:

$U_A = 7 \times 6 + \frac{7 \times 8}{2} - 30.5 = 42 + 28 - 30.5 = 39.5$

$U_B = 7 \times 6 + \frac{6 \times 7}{2} - 60.5 = 42 + 21 - 60.5 = 2.5$

Check: $39.5 + 2.5 = 42 = n_1 n_2$ ✅

$U = \min(39.5, 2.5) = 2.5$

Step 4 — Tie correction and z-statistic:

$\mu_U = 7 \times 6 / 2 = 21$

Ties: value 5 ( $t=2$ ), 7 ( $t=2$ ), 8 ( $t=3$ ), 9 ( $t=2$ ):

$\sum_k(t_k^3-t_k) = (8-2)+(8-2)+(27-3)+(8-2) = 6+6+24+6 = 42$

$\sigma_U^2 = \frac{7 \times 6}{12}\left[(13+1) - \frac{42}{13 \times 12}\right] = \frac{42}{12}\left[14 - \frac{42}{156}\right] = 3.5[14 - 0.269] = 3.5 \times 13.731 = 48.059$

$\sigma_U = \sqrt{48.059} = 6.933$

$z = \frac{U_A - \mu_U}{\sigma_U} = \frac{39.5 - 21}{6.933} = \frac{18.5}{6.933} = 2.669$

(Using $U_A$ to preserve sign; positive means A tends to be smaller than B)

Actually, to reflect direction: $z = \frac{U_B - \mu_U}{\sigma_U} = \frac{2.5-21}{6.933} = \frac{-18.5}{6.933} = -2.669$

Two-tailed: $p = 2 \times P(Z \leq -2.669) = 2 \times .0038 = .008$

Exact p-value (DataStatPro): $p_{exact} = .006$

Step 5 — Effect size:

$r_{rb} = 1 - \frac{2U}{n_1 n_2} = 1 - \frac{2 \times 2.5}{42} = 1 - 0.119 = 0.881$

(Positive $r_{rb}$ : Group B tends to have higher values)

$\hat{PS} = U_B/n_1 n_2 = 39.5/42 = 0.940$

95% CI for $r_{rb}$ :

$z_r = \text{arctanh}(0.881) = 1.375$

$SE_z = \sqrt{(7+6+1)/(3 \times 7 \times 6)} = \sqrt{14/126} = \sqrt{0.1111} = 0.333$

95% CI: $1.375 \pm 1.96 \times 0.333 = [0.722, 2.028]$

$r_{rb,L} = \tanh(0.722) = 0.619, \quad r_{rb,U} = \tanh(2.028) = 0.967$

Hodges-Lehmann estimator $\hat{\Delta}$ :

All $7 \times 6 = 42$ pairwise differences (Technique A $-$ Technique B):

Median of these 42 differences $= -3.0$ (Technique A scores are typically 3 points lower)

Summary:

Statistic	Value	Interpretation
Technique A Median (IQR)	$5.0$ (3.5–7.5)	Lower ratings
Technique B Median (IQR)	$8.5$ (7.75–9.25)	Higher ratings
$U$	$2.5$
$z$	$-2.669$
$p$ (exact)	$.006$	Significant at $\alpha = .05$
$r_{rb}$	$0.881$	Very large effect
95% CI for $r_{rb}$	$[0.619, 0.967]$
$\hat{PS}$	$94.0\%$
Hodges-Lehmann $\hat{\Delta}$	$-3.0$ points	A is 3 points lower

APA write-up: "A Mann-Whitney U test (exact) was conducted to compare pain relief ratings between Technique A ( $n = 7$ , Mdn $= 5.0$ , IQR $= 3.5$ – $7.5$ ) and Technique B ( $n = 6$ , Mdn $= 8.5$ , IQR $= 7.75$ – $9.25$ ). Technique B produced significantly higher ratings, $U = 2.5$ , $p = .006$ (exact), $r_{rb} = 0.88$ [95% CI: 0.62, 0.97], indicating a very large effect. The Hodges-Lehmann estimator indicated a median difference of 3.0 points (Technique B higher)."

Example 2: Larger Sample with Normal Approximation — Reaction Times

A cognitive psychologist compares simple reaction times (ms) between neurotypical adults ( $n_1 = 25$ ) and adults with ADHD ( $n_2 = 22$ ). Data are positively skewed (Shapiro-Wilk $p < .01$ for ADHD group).

Summary statistics (pre-computed):

$W_1 = 523$ (neurotypical), $W_2 = 730$ (ADHD), $N = 47$

Neurotypical: Mdn $= 248$ ms, IQR $= 218$ – $274$ ms ADHD: Mdn $= 291$ ms, IQR $= 265$ – $318$ ms

Step 1 — U statistics:

$U_1 = 25 \times 22 + \frac{25 \times 26}{2} - 523 = 550 + 325 - 523 = 352$

$U_2 = 25 \times 22 + \frac{22 \times 23}{2} - 730 = 550 + 253 - 730 = 73$

Check: $352 + 73 = 425 \neq 550$ ... recheck.

$U_2 = 550 - U_1 = 550 - 352 = 198$

Let me recalculate: $W_1 + W_2 = 523 + 730 = 1253 = 47 \times 48/2 = 1128$ . Inconsistency — let me use self-consistent values.

With $N = 47$ : $W_1 + W_2 = 47 \times 48/2 = 1128$

Let $W_1 = 498$ (neurotypical), $W_2 = 630$ (ADHD).

$U_1 = 25 \times 22 + \frac{25 \times 26}{2} - 498 = 550 + 325 - 498 = 377$

$U_2 = 550 + \frac{22 \times 23}{2} - 630 = 550 + 253 - 630 = 173$

Check: $377 + 173 = 550 = n_1 n_2$ ✅

$U = \min(377, 173) = 173$

Step 2 — z-statistic (no ties assumed for this example):

$\mu_U = 550/2 = 275$

$\sigma_U = \sqrt{\frac{25 \times 22 \times 48}{12}} = \sqrt{\frac{26400}{12}} = \sqrt{2200} = 46.90$

$z = \frac{U_2 - \mu_U}{\sigma_U} = \frac{173 - 275}{46.90} = \frac{-102}{46.90} = -2.175$

(Negative: ADHD group has lower rank sum than expected under $H_0$ ... but ADHD should be higher. Use $U_1$ :)

$z = \frac{U_1 - \mu_U}{\sigma_U} = \frac{377 - 275}{46.90} = \frac{102}{46.90} = 2.175$

Two-tailed: $p = 2 \times P(Z \geq 2.175) = 2 \times .015 = .030$

Step 3 — Effect size:

$r_{rb} = \frac{U_1 - U_2}{n_1 n_2} = \frac{377 - 173}{550} = \frac{204}{550} = 0.371$

(Positive: neurotypical faster — lower reaction times)

Wait — neurotypical should have lower RTs (faster), so $U_1 > U_2$ means neurotypical observations tend to have higher ranks? No — higher RT (slower) = higher rank. Let me reclarify: higher rank = longer RT = slower. ADHD group should have higher RTs = higher ranks = larger $W_2$ . So $W_2 > W_1$ , $U_2 > U_1$ , and $r_{rb}$ is calculated as:

$r_{rb} = \frac{U_{ADHD} - U_{neuro}}{n_1 n_2}$

For clear directionality, always specify which group is Group 1 and which is Group 2.

Let Group 1 = ADHD, Group 2 = Neurotypical, $W_1 = 630$ (ADHD), $W_2 = 498$ (neuro):

$U_1 = 550 + 253 - 630 = 173$ (ADHD wins in pairings) $U_2 = 550 + 325 - 498 = 377$ (Neuro wins in pairings)

$r_{rb} = (173-377)/550 = -204/550 = -0.371$

Interpretation: ADHD group tends to have higher RT (larger values) than neurotypical; $r_{rb} = -0.371$ indicates that neurotypical observations tend to be smaller (faster).

$|\,r_{rb}| = 0.371$ — medium-to-large effect (approaching Cohen's medium = 0.30).

$\hat{PS} = U_1/(n_1 n_2) = 173/550 = 0.315$

Interpretation: In only 31.5% of all pairings does an ADHD participant have a faster RT than a neurotypical participant — i.e., neurotypical participants are faster in 68.5% of pairings.

95% CI for $r_{rb}$ :

$z_r = \text{arctanh}(0.371) = 0.389$

$SE_z = \sqrt{48/(3 \times 550)} = \sqrt{0.02909} = 0.1706$

95% CI: $0.389 \pm 1.96 \times 0.171 = [0.054, 0.724]$

$r_{rb,L} = \tanh(0.054) = 0.054, \quad r_{rb,U} = \tanh(0.724) = 0.619$

Summary:

Statistic	Value
Neurotypical: Mdn (IQR)	$248$ ms ( $218$ – $274$ )
ADHD: Mdn (IQR)	$291$ ms ( $265$ – $318$ )
$U$	$173$
$z$	$2.175$
$p$ (two-tailed, approx.)	$.030$
$r_{rb}$	$0.371$ (medium-large)
95% CI for $r_{rb}$	$[0.054, 0.619]$
$\hat{PS}$ (neuro > ADHD in RT)	$68.5\%$

APA write-up: "A Mann-Whitney U test was conducted to compare reaction times between neurotypical adults ( $n = 25$ , Mdn $= 248$ ms, IQR $= 218$ – $274$ ) and adults with ADHD ( $n = 22$ , Mdn $= 291$ ms, IQR $= 265$ – $318$ ). Adults with ADHD showed significantly longer reaction times, $U = 173$ , $z = 2.18$ , $p = .030$ , $r_{rb} = 0.37$ [95% CI: 0.05, 0.62], indicating a medium-to-large effect. In 68.5% of all possible pairings, a neurotypical participant was faster than an ADHD participant."

Example 3: Interpreting a Non-Significant Result

A researcher tests whether customer satisfaction ratings differ between two service delivery formats (in-person vs. online; $n_1 = n_2 = 30$ ; 5-point scale).

Given: $U = 408$ , $n_1 = n_2 = 30$ , $N = 60$ .

$\mu_U = 30 \times 30/2 = 450$

$\sigma_U = \sqrt{30 \times 30 \times 61/12} = \sqrt{4575} = 67.64$

$z = (408 - 450)/67.64 = -42/67.64 = -0.621$

$p = 2 \times P(Z \leq -0.621) = 2 \times .267 = .534$

$r_{rb} = (408 - 492)/(30 \times 30) = -84/900 = -0.093$

95% CI for $r_{rb}$ :

$z_r = \text{arctanh}(0.093) = 0.093$ , $SE_z = \sqrt{61/(3 \times 900)} = \sqrt{0.0226} = 0.150$

95% CI: $[-0.207, 0.023]$ → $r_{rb}$ CI: $[-0.205, 0.023]$ (approximately)

Equivalence test: With bounds $r_{rb} = \pm 0.20$ (trivially small effect), the 90% CI for $r_{rb}$ is $[-0.178, 0.003]$ . The lower bound $(-0.178)$ is within $(-0.20)$ but the upper bound $(0.003)$ is just within $(+0.20)$ — equivalence is borderline. Increase $n$ for a more powerful equivalence test.

Interpretation: The test is not significant ( $p = .534$ ). The effect size is trivially small ( $|r_{rb}| = 0.093$ , 95% CI spanning from practically zero to a small negative effect). The CI is relatively wide given the sample size. This is genuinely null-like, but a formal equivalence test with $n = 100+$ per group would provide more definitive evidence.

APA write-up: "A Mann-Whitney U test found no significant difference in satisfaction ratings between in-person ( $n = 30$ ) and online ( $n = 30$ ) formats, $U = 408$ , $z = -0.62$ , $p = .534$ , $r_{rb} = -0.09$ [95% CI: $-0.21$ , $0.02$ ]. The small effect size and wide confidence interval suggest that any true difference, if present, is negligibly small. An equivalence test would be required to formally establish the absence of a meaningful difference."

12. Common Mistakes and How to Avoid Them

Mistake 1: Using the Mann-Whitney Test for Paired Data

Problem: Applying the Mann-Whitney U Test to pre-post or matched-pairs data as if the groups were independent. This ignores the within-pair correlation, produces an inflated error term, and substantially reduces power. It also violates the independence assumption.

Solution: For paired or matched data, use the Wilcoxon Signed-Rank Test — the non-parametric equivalent of the paired t-test. Verify whether data represent independent groups (different participants) or related measurements (same participants or matched pairs) before choosing the test.

Mistake 2: Interpreting the Mann-Whitney Test as Always Testing Medians

Problem: Claiming that "the Mann-Whitney test compares medians" without acknowledging that this interpretation requires the location-shift assumption (equal distribution shapes). When distributions differ in shape or spread, the test may be significant even when medians are equal, or non-significant when medians differ considerably.

Solution: State the null hypothesis precisely: "The Mann-Whitney U test tests whether the probability that a randomly selected observation from Group 1 exceeds a randomly selected observation from Group 2 is 0.5." Only invoke median interpretation when the location-shift assumption is plausible and checked.

Mistake 3: Reporting Only U and p Without an Effect Size

Problem: Reporting $U = 23$ , $p = .03$ without the rank-biserial correlation or probability of superiority. Like the t-test, the Mann-Whitney test is influenced by sample size — a significant result says nothing about the magnitude of the effect.

Solution: Always report $r_{rb}$ (and/or $\hat{PS}$ ) with 95% CI alongside $U$ and $p$ . DataStatPro computes these automatically. Small-sample significant results with small $r_{rb}$ should be interpreted cautiously; large-sample non-significant results with $r_{rb} = 0.40$ may indicate insufficient power.

Mistake 4: Defaulting to Mann-Whitney When t-Test Assumptions Are Met

Problem: Using the Mann-Whitney test "to be safe" when the independent t-test's assumptions are fully satisfied (normal data, no severe outliers, equal variances). The Mann-Whitney test sacrifices approximately 5% power under normality — a real but small cost.

Solution: Run the Shapiro-Wilk test and inspect Q-Q plots. If normality holds (and sample sizes are adequate), use the independent t-test (Welch's version) for slightly greater power and the ability to report intuitive mean differences. Reserve Mann-Whitney for genuinely non-normal or ordinal data.

Mistake 5: Not Applying Tie Correction

Problem: Using the uncorrected variance formula $\sigma_U^2 = n_1 n_2(N+1)/12$ when ties are present. This overestimates $\sigma_U$ , producing a z-statistic that is too small and a p-value that is too large — making the test conservative.

Solution: Always apply the tie correction to the variance when ties are present. DataStatPro applies this automatically. Report the number and proportion of tied observations in the methods section.

Mistake 6: Reporting Means Instead of Medians for Mann-Whitney

Problem: Reporting group means alongside a Mann-Whitney test result. Since the test is rank-based and makes no assumptions about means, reporting means as the primary descriptive statistic is inconsistent with the test's rationale.

Solution: Report medians and IQRs (interquartile ranges) as the primary descriptive statistics alongside Mann-Whitney results. Means can be additionally reported as secondary information if useful, clearly labelled as supplementary.

Mistake 7: Confusing $U_1$ and $U_2$ , Leading to Sign Errors in $r_{rb}$

Problem: Confusing which $U$ statistic belongs to which group. $r_{rb} = (U_1-U_2)/n_1 n_2$ : a positive value means Group 1 tends to be larger; negative means Group 2 tends to be larger. Swapping $U_1$ and $U_2$ reverses the sign.

Solution: Always clearly label which group is Group 1 and which is Group 2 before computing. State the direction of the effect in the results: "Group X tended to have higher values than Group Y."

Mistake 8: Using Mann-Whitney for More Than Two Groups

Problem: Running multiple pairwise Mann-Whitney tests across three or more groups without a prior omnibus test and without FWER correction. This inflates the familywise Type I error rate.

Solution: For three or more independent groups, run the Kruskal-Wallis Test as the omnibus test first. Only if significant, conduct pairwise Mann-Whitney or Dunn's tests with Bonferroni or Holm FWER correction.

Mistake 9: Ignoring the Directionality of a Significant Result

Problem: Reporting a significant Mann-Whitney result without stating which group had higher rankings. A significant $U$ does not tell you direction — you must inspect the rank sums or mean ranks to determine which group tends to be higher.

Solution: Always report mean ranks $(\bar{R}_1, \bar{R}_2)$ or medians alongside $U$ , so the direction is unambiguous. Check: if $\bar{R}_1 > \bar{R}_2$ , Group 1 tends to have higher values; $r_{rb} > 0$ .

Mistake 10: Treating a Non-Significant Mann-Whitney as Proof of Equal Distributions

Problem: Concluding from $p > .05$ that the two populations are identical (or that the medians are equal). A non-significant result indicates insufficient evidence against $H_0$ — not evidence for $H_0$ .

Solution: Report $r_{rb}$ and its 95% CI. A non-significant result with a wide CI (e.g., $r_{rb} = 0.15$ [95% CI: $-0.10$ , $0.38$ ]) indicates low power, not a true null effect. Conduct an equivalence test with pre-specified bounds to positively establish that the effect is negligibly small.

13. Troubleshooting

Problem	Likely Cause	Solution
$U_1 + U_2 \neq n_1 n_2$	Calculation error in $U$ or $W$	Verify $W_1+W_2 = N(N+1)/2$ first; recompute $U$ from corrected $W$
$W_1 + W_2 \neq N(N+1)/2$	Ranking error; ties not averaged	Recheck averaging of tied ranks; verify all observations are ranked
$p_{exact} \neq p_{approx}$ noticeably	Small $n$ or many ties	Use exact $p$ for $n < 15$ per group; report exact if available
$r_{rb} > 1.0$ or $< -1.0$	Computational error	Check $U_1, U_2$ sum to $n_1 n_2$ ; verify direction of subtraction
Very large $z$ but non-significant	$\sigma_U$ severely underestimated due to ties without correction	Apply tie correction; use exact p-value
Mann-Whitney significant but t-test not	Non-normality causing t-test to lose power; or different hypotheses	Trust Mann-Whitney for non-normal data; they test different things
t-test significant but Mann-Whitney not	Outlier driving mean difference but not systematically affecting ranks	Investigate outlier; report both with explanation
Many ties ( $> 30\%$ )	Coarse measurement scale (e.g., 5-point Likert)	Use tie-corrected variance; report tie proportion; consider ordinal regression
Hodges-Lehmann $\hat{\Delta} = 0$ but $r_{rb} \neq 0$	Distributions differ in shape/spread but not location shift	Report $\hat{PS}$ rather than $\hat{\Delta}$ ; location-shift assumption may be violated
Brunner-Munzel and Mann-Whitney give different conclusions	Distribution shapes differ; location-shift violated	Use Brunner-Munzel as more appropriate; report both
Exact p-value computation takes too long	$N$ too large for enumeration	Switch to permutation test or normal approximation with tie correction
Negative $U$ value	Formula error (U cannot be negative)	Re-examine formula; $U = n_1 n_2 + n_j(n_j+1)/2 - W_j \geq 0$ always
Bootstrap CI very wide	Small $n$	Report wide CI as reflecting genuine uncertainty; collect more data
$r_{rb}$ from z-formula $\neq$ $r_{rb}$ from $U$ formula	Formula approximation discrepancy	Use $r_{rb} = (U_1-U_2)/(n_1 n_2)$ as primary; z-based formula is only approximate

14. Quick Reference Cheat Sheet

Core Formulas

Formula	Description
$W_j = \sum_{i=1}^{n_j} R_i$	Rank sum for group $j$
$W_1 + W_2 = N(N+1)/2$	Verification check
$U_1 = n_1 n_2 + n_1(n_1+1)/2 - W_1$	$U$ for Group 1
$U_2 = n_1 n_2 + n_2(n_2+1)/2 - W_2$	$U$ for Group 2
$U_1 + U_2 = n_1 n_2$	Verification check
$U = \min(U_1, U_2)$	Test statistic
$\mu_U = n_1 n_2/2$	Mean of $U$ under $H_0$
$\sigma_U^2 = n_1 n_2(N+1)/12$	Variance of $U$ (no ties)
$\sigma_U^2 = \frac{n_1 n_2}{12}\left[(N+1) - \frac{\sum_k(t_k^3-t_k)}{N(N-1)}\right]$	Variance (with tie correction)
$z = (U_1 - \mu_U)/\sigma_U$	Standardised statistic
$p = 2[1-\Phi(\lvert z \rvert)]$	Two-tailed p-value
$\hat{PS} = U_1/(n_1 n_2)$	Probability of superiority

Effect Size Formulas

Formula	Description
$r_{rb} = (U_1 - U_2)/(n_1 n_2)$	Rank-biserial correlation (primary)
$r_{rb} = 1 - 2U/(n_1 n_2)$	From $U = \min(U_1,U_2)$
$r_{rb} = (\bar{R}_1-\bar{R}_2)/((n_1+n_2)/2)$	From mean ranks
$r_{rb} \approx 2z/\sqrt{N}$	Approximate from $z$
$\hat{PS} = U_1/(n_1 n_2)$	Probability Group 1 > Group 2
$\hat{PS} = (r_{rb}+1)/2$	Convert $r_{rb}$ to PS
$r_{rb} = 2\hat{PS}-1$	Convert PS to $r_{rb}$
$\hat{\Delta} = \text{median}\{x_{1i}-x_{2j}\}$	Hodges-Lehmann estimator
$z_r = \text{arctanh}(r_{rb})$	Fisher $z$ -transform of $r_{rb}$
$SE_z = \sqrt{(N+1)/(3n_1 n_2)}$	SE for CI of $r_{rb}$

Conversions Between Effect Sizes

From	To	Formula
$r_{rb}$	Cohen's $d$	$d \approx 2r_{rb}/\sqrt{1-r_{rb}^2}$
Cohen's $d$	$r_{rb}$	$r_{rb} \approx d/\sqrt{d^2+4}$ (equal groups)
$r_{rb}$	$\hat{PS}$	$\hat{PS} = (r_{rb}+1)/2$
$\hat{PS}$	$r_{rb}$	$r_{rb} = 2\hat{PS}-1$
$z$	$r_{rb}$	$r_{rb} \approx 2z/\sqrt{N}$ (approx)

Cohen's Benchmarks for $r_{rb}$

Label	$\lvert r_{rb} \rvert$	$\hat{PS}$
Negligible	$< 0.10$	$0.45$ – $0.55$
Small	$0.10$	$0.55$
Medium	$0.30$	$0.65$
Large	$0.50$	$0.75$
Very large	$0.70$	$0.85$

Required $n$ per Group (80% Power, $\alpha = .05$ , Two-Tailed)

$\lvert r_{rb} \rvert$	Label	Mann-Whitney	t-Test (if normal)
0.10	Small	414	394
0.20	Small	99	97
0.30	Medium	44	43
0.44	Medium	21	20
0.50	Large	16	15
0.64	Large	10	9

Decision Guide: Mann-Whitney vs. Alternatives

Situation	Test
Two independent groups, non-normal or ordinal	Mann-Whitney U ✅
Two independent groups, normal, equal variances	Independent t-test (or Welch's)
Two independent groups, unequal shapes/spreads	Brunner-Munzel test
Two paired/related groups, non-normal	Wilcoxon Signed-Rank test
Three or more independent groups, non-normal	Kruskal-Wallis test
Two independent groups, very small $n$ ( $< 5$ per group)	Fisher's exact (binary), exact Mann-Whitney
General distributional difference (not just location)	Kolmogorov-Smirnov test

Tie Correction Reference

Proportion of Ties	Impact on $\sigma_U$	Recommended p-Value Method
$< 10\%$	Negligible	Standard or tie-corrected approximation
$10\%$ – $25\%$	Moderate reduction	Tie-corrected approximation
$> 25\%$	Substantial reduction	Exact p-value; permutation test

APA Reporting Template

"A Mann-Whitney U test [with / without] continuity correction [with exact / asymptotic p-value] was conducted to compare [DV] between [Group 1] (Mdn = [value], IQR = [LB]–[UB], $n =$ ) and [Group 2] (Mdn = [value], IQR = [LB]–[UB], $n =$ ). [Group X] had significantly [higher / lower] [DV] than [Group Y], $U =$ [value], $z =$ [value], $p =$ [value], $r_{rb} =$ [value] [95% CI: LB, UB], indicating a [small / medium / large] effect. In [PS%] of all possible pairings, a [Group X] observation exceeded a [Group Y] observation."

Assumption Checks Checklist

Check	Method	Action if Violated
Independence	Study design review	Wilcoxon SR (paired); multilevel methods (clustered)
Ordinal scale	Measurement review	Chi-squared (nominal); ordinal regression
Two independent groups	Design review	Kruskal-Wallis ( $K > 2$ ); Wilcoxon SR (paired)
Location-shift assumption	Boxplots; IQR comparison	Brunner-Munzel test
Excessive ties ( $> 25\%$ )	Count ties	Exact p-value; permutation test; tie-corrected variance

Mann-Whitney Reporting Checklist

Item	Required
$U$ statistic	✅ Always
$z$ -statistic (if approximation used)	✅ When applicable
Whether exact or asymptotic p-value	✅ Always
Exact p-value	✅ Preferred for $n < 15$ per group
$r_{rb}$ with 95% CI	✅ Always
Probability of superiority ( $\hat{PS}$ )	✅ Recommended
Medians and IQRs per group	✅ Always
Sample sizes per group	✅ Always
Direction of the effect	✅ Always
Tie correction applied	✅ When ties present
Number/proportion of tied observations	✅ When ties substantial
Hodges-Lehmann estimator	✅ Recommended
Whether two-tailed or directional	✅ Always
Normality violation justification	✅ When used instead of t-test

This tutorial provides a comprehensive foundation for understanding, conducting, and reporting the Mann-Whitney U Test within the DataStatPro application. For further reading, consult Mann & Whitney's (1947) original paper "On a test of whether one of two random variables is stochastically larger than the other" (Annals of Mathematical Statistics), Wilcoxon's (1945) foundational paper, Hollander, Wolfe & Chicken's "Nonparametric Statistical Methods" (3rd ed., 2014) for rigorous theoretical coverage, Conover's "Practical Nonparametric Statistics" (3rd ed., 1999) for applied guidance, and Brunner & Munzel's (2000) "The Nonparametric Behrens-Fisher Problem" (Biometrical Journal) for the robust alternative when distribution shapes differ. For the rank-biserial correlation as an effect size, see Kerby (2014) in the Comprehensive Psychology journal. For feature requests or support, contact the DataStatPro team.

$n_1$	$n_2 = 5$	$n_2 = 6$	$n_2 = 7$	$n_2 = 8$	$n_2 = 10$	$n_2 = 15$	$n_2 = 20$
5	2	5	6	8	11	20	27
6	5	7	8	10	14	24	34
7	6	8	11	13	17	28	39
8	8	10	13	15	20	33	45
10	11	14	17	20	27	42	59

$n_1$	$n_2 = 5$	$n_2 = 6$	$n_2 = 7$	$n_2 = 8$	$n_2 = 10$	$n_2 = 15$	$n_2 = 20$
5	2	5	6	8	11	20	27
6	5	7	8	10	14	24	34
7	6	8	11	13	17	28	39
8	8	10	13	15	20	33	45
10	11	14	17	20	27	42	59

Mann-Whitney U Test

Mann-Whitney U Test: Zero to Hero Tutorial

Table of Contents

1. Prerequisites and Background Concepts

1.1 Parametric vs. Non-Parametric Inference

1.2 Ordinal Data and Ranks

1.3 The Concept of Stochastic Dominance

1.4 The Independent Samples t-Test and Its Limitations

1.5 Statistical Power and Asymptotic Relative Efficiency

1.6 The Probability of Superiority

1.7 Hypothesis Testing Framework

2. What is the Mann-Whitney U Test?

2.1 The Core Idea

2.2 Research Questions the Mann-Whitney U Test Answers

2.3 When to Use the Mann-Whitney U Test

2.4 The Mann-Whitney U Test vs. the Independent Samples t-Test

2.5 Real-World Applications

3. The Mathematics Behind the Mann-Whitney U Test

3.1 The Rank-Sum Formulation (Wilcoxon)

3.2 The U Statistic (Mann-Whitney)

3.3 The Null Distribution of U

3.4 The z-Approximation for Large Samples

3.5 Computing the Two-Tailed p-Value

3.6 The Exact Computation via Pairwise Comparisons

3.7 Critical Values for Small Samples

4. Assumptions of the Mann-Whitney U Test

4.1 Independence of Observations

4.2 Ordinal or Higher Scale of Measurement

4.3 Two Independent Groups

4.4 The Location-Shift Assumption (for Median Interpretation)

4.5 No Assumption of Normality

4.6 Handling Ties

4.7 Assumption Summary Table

5. Variants and Related Tests

5.1 The Wilcoxon Rank-Sum Test

5.2 One-Tailed vs. Two-Tailed Tests

5.3 The Brunner-Munzel Test

5.4 Permutation Test Alternative

6. Using the Mann-Whitney U Test Calculator Component

Step-by-Step Guide

7. Exact vs. Approximate Methods

7.1 When to Use Exact Methods

7.2 The Normal Approximation — When Is It Adequate?

7.3 The Permutation Approach

8. Effect Sizes for the Mann-Whitney U Test

8.1 The Rank-Biserial Correlation (rrbr_{rb}rrb​)

8.2 Interpreting the Rank-Biserial Correlation

8.3 The Probability of Superiority (PS^\hat{PS}PS^)

8.4 The Hodges-Lehmann Estimator

8.5 Comparing Effect Sizes Across Studies

9. Confidence Intervals

9.1 Confidence Interval for the Rank-Biserial Correlation

9.2 Confidence Interval for the Hodges-Lehmann Estimator

9.3 Confidence Interval for the Probability of Superiority

10. Advanced Topics

10.1 The Mann-Whitney Test as a Test of Stochastic Equality

10.2 Asymptotic Relative Efficiency Across Distributions

10.3 Sample Size and Power for the Mann-Whitney Test

10.4 Rank-Based Post-Hoc Comparisons After Kruskal-Wallis

10.5 Comparing the Mann-Whitney and Kolmogorov-Smirnov Tests

10.6 Bootstrap Confidence Intervals for rrbr_{rb}rrb​

10.7 Reporting the Mann-Whitney U Test According to APA 7th Edition

11. Worked Examples

Example 1: Small Sample with Exact p-Value — Pain Relief Ratings

Example 2: Larger Sample with Normal Approximation — Reaction Times

Example 3: Interpreting a Non-Significant Result

12. Common Mistakes and How to Avoid Them

Mistake 1: Using the Mann-Whitney Test for Paired Data

Mistake 2: Interpreting the Mann-Whitney Test as Always Testing Medians

Mistake 3: Reporting Only U and p Without an Effect Size

Mistake 4: Defaulting to Mann-Whitney When t-Test Assumptions Are Met

Mistake 5: Not Applying Tie Correction

Mistake 6: Reporting Means Instead of Medians for Mann-Whitney

Mistake 7: Confusing U1U_1U1​ and U2U_2U2​, Leading to Sign Errors in rrbr_{rb}rrb​

Mistake 8: Using Mann-Whitney for More Than Two Groups

Mistake 9: Ignoring the Directionality of a Significant Result

Mistake 10: Treating a Non-Significant Mann-Whitney as Proof of Equal Distributions

13. Troubleshooting

14. Quick Reference Cheat Sheet

Core Formulas

8.1 The Rank-Biserial Correlation ( $r_{rb}$ )

8.3 The Probability of Superiority ( $\hat{PS}$ )

10.6 Bootstrap Confidence Intervals for $r_{rb}$

Mistake 7: Confusing $U_1$ and $U_2$ , Leading to Sign Errors in $r_{rb}$

Cohen's Benchmarks for $r_{rb}$

Required $n$ per Group (80% Power, $\alpha = .05$ , Two-Tailed)

$n_1$	$n_2 = 5$	$n_2 = 6$	$n_2 = 7$	$n_2 = 8$	$n_2 = 10$	$n_2 = 15$	$n_2 = 20$
5	2	5	6	8	11	20	27
6	5	7	8	10	14	24	34
7	6	8	11	13	17	28	39
8	8	10	13	15	20	33	45
10	11	14	17	20	27	42	59