Knowledge Base / Mann-Whitney U Test Inferential Statistics 50 min read

Mann-Whitney U Test

Comprehensive reference guide for Mann-Whitney U test (non-parametric alternative to independent t-test).

Mann-Whitney U Test: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of non-parametric inference all the way through the complete theory, mathematics, assumptions, effect sizes, interpretation, reporting, and practical usage of the Mann-Whitney U Test within the DataStatPro application. Whether you are encountering the Mann-Whitney U Test for the first time or seeking a deeper understanding of rank-based methods for comparing two independent groups, this guide builds your knowledge systematically from the ground up.


Table of Contents

  1. Prerequisites and Background Concepts
  2. What is the Mann-Whitney U Test?
  3. The Mathematics Behind the Mann-Whitney U Test
  4. Assumptions of the Mann-Whitney U Test
  5. Variants and Related Tests
  6. Using the Mann-Whitney U Test Calculator Component
  7. Exact vs. Approximate Methods
  8. Effect Sizes for the Mann-Whitney U Test
  9. Confidence Intervals
  10. Advanced Topics
  11. Worked Examples
  12. Common Mistakes and How to Avoid Them
  13. Troubleshooting
  14. Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into the Mann-Whitney U Test, it is essential to be comfortable with the following foundational statistical concepts. Each is briefly reviewed below.

1.1 Parametric vs. Non-Parametric Inference

Parametric tests (e.g., the independent samples t-test) make explicit assumptions about the shape of the population distribution — typically normality — and estimate specific population parameters (e.g., μ\mu, σ2\sigma^2). Their validity depends on those distributional assumptions being met.

Non-parametric tests (also called distribution-free tests) do not assume a specific functional form for the population distribution. The Mann-Whitney U Test is non-parametric: it does not assume normality. Instead of operating on raw scores, it operates on the ranks of those scores.

⚠️ "Distribution-free" is not synonymous with "assumption-free." The Mann-Whitney U Test has its own set of assumptions, reviewed in Section 4. Violating these assumptions can invalidate its conclusions just as surely as violating normality invalidates the t-test.

1.2 Ordinal Data and Ranks

Ordinal data convey the relative ordering of observations but not the magnitude of differences between them. Examples include:

Ranking is the process of replacing each raw score with its position in an ordered list. For NN total observations:

Example: Raw scores {3,7,7,9}\{3, 7, 7, 9\} become ranks {1,2.5,2.5,4}\{1, 2.5, 2.5, 4\} (the two tied 7s share ranks 2 and 3, so each receives (2+3)/2=2.5(2+3)/2 = 2.5).

1.3 The Concept of Stochastic Dominance

The Mann-Whitney U Test is fundamentally a test of stochastic dominance. Group 1 stochastically dominates Group 2 if a randomly chosen observation from Group 1 tends to be larger than a randomly chosen observation from Group 2:

P(X1>X2)>0.5P(X_1 > X_2) > 0.5

The test statistic UU directly estimates this probability, making the Mann-Whitney U Test one of the most intuitively interpretable inferential tests in statistics.

1.4 The Independent Samples t-Test and Its Limitations

The independent samples t-test is the parametric alternative to the Mann-Whitney U Test. It is appropriate when:

When normality is markedly violated (especially with small samples) or when data are ordinal, the t-test is inappropriate and the Mann-Whitney U Test is preferred.

1.5 Statistical Power and Asymptotic Relative Efficiency

The Asymptotic Relative Efficiency (ARE) compares the power of two tests as nn \to \infty. The ARE of the Mann-Whitney U Test relative to the t-test is:

ARE=3π0.955ARE = \frac{3}{\pi} \approx 0.955 (under normality)

This means:

This near-equivalence under normality makes the Mann-Whitney test a safe default when normality is uncertain.

1.6 The Probability of Superiority

The probability of superiority (PS) — equivalent to the Common Language Effect Size — is the probability that a randomly selected observation from Group 1 exceeds a randomly selected observation from Group 2:

PS=P(X1>X2)+0.5P(X1=X2)PS = P(X_1 > X_2) + 0.5 \cdot P(X_1 = X_2)

This interpretation is central to understanding the Mann-Whitney U Test's effect size.

1.7 Hypothesis Testing Framework

Every Mann-Whitney U Test operates within the standard hypothesis testing framework:

Step 1 — State the hypotheses (see Section 4 for precise formulations).

Step 2 — Choose α\alpha — the significance level (conventionally α=.05\alpha = .05).

Step 3 — Compute the test statistic UU (or its standardised form zz).

Step 4 — Compute the p-value — the probability of observing a UU statistic at least as extreme as the one obtained, assuming H0H_0.

Step 5 — Make a decision — reject H0H_0 if pαp \leq \alpha.

Step 6 — Compute and report the effect sizerrbr_{rb} (rank-biserial correlation) with 95% confidence interval.


2. What is the Mann-Whitney U Test?

2.1 The Core Idea

The Mann-Whitney U Test (also called the Wilcoxon Rank-Sum Test, or Wilcoxon-Mann- Whitney Test) is a non-parametric test for comparing two independent groups. Rather than comparing group means directly (as the t-test does), it assesses whether observations from one group tend to be systematically larger or smaller than observations from the other group.

The test was independently developed by:

The two formulations are mathematically equivalent: they produce the same p-value.

2.2 Research Questions the Mann-Whitney U Test Answers

The Mann-Whitney U Test answers:

"Do observations from Group 1 tend to have systematically higher (or lower) values than observations from Group 2?"

More formally, under the location-shift assumption (see Section 4):

"Is the median of Group 1 equal to the median of Group 2?"

2.3 When to Use the Mann-Whitney U Test

The Mann-Whitney U Test is the appropriate choice when:

ConditionDetails
Two independent groupsDifferent participants in each group
Non-normal distributionNormality violated; Shapiro-Wilk significant
Ordinal dependent variableLikert scales, pain ratings, satisfaction scores
Small sample sizen<15n < 15 per group; CLT may not apply
Presence of outliersExtreme values distort the t-test
Skewed distributionsReaction times, income, response latencies
Bounded scalesCeiling or floor effects distorting normality

2.4 The Mann-Whitney U Test vs. the Independent Samples t-Test

PropertyIndependent t-TestMann-Whitney U Test
TestsMean differenceDistributional dominance / median shift
ScaleInterval / RatioOrdinal or higher
Assumes normality✅ Yes❌ No
Sensitive to outliers✅ High❌ Low (rank-based)
Power (when normal)Slightly higher95.5%\approx 95.5\% of t-test
Power (when non-normal)Can be lowerCan exceed t-test
Effect sizeCohen's ddRank-biserial rrbr_{rb}
Parametric✅ Yes❌ No

2.5 Real-World Applications

FieldApplicationExample
Clinical PsychologySymptom severity between two treatment armsPTSD symptom score: EMDR vs. CBT
MedicineRecovery time between two surgical techniquesDays to discharge: laparoscopic vs. open
EducationExam performance between two instructional methodsGrades: problem-based vs. lecture
MarketingConsumer preference ratings for two productsRating (1–10): Product A vs. B
EcologySpecies abundance between two habitatsCount of species: Forest A vs. B
NeuroscienceResponse latencies between patient and control groupsRT (ms): ADHD vs. neurotypical
Organisational PsychologyJob satisfaction between two departmentsSurvey score: Dept A vs. Dept B
Public HealthPhysical activity levels between two communitiesSteps/day: urban vs. rural

3. The Mathematics Behind the Mann-Whitney U Test

3.1 The Rank-Sum Formulation (Wilcoxon)

Step 1 — Pool and rank all observations.

Combine all n1+n2=Nn_1 + n_2 = N observations from both groups into a single ordered list. Assign ranks from 1 (smallest) to NN (largest). For tied values, assign average ranks (mid-ranks).

Step 2 — Compute the rank sums.

W1=i=1n1Ri(sum of ranks for Group 1)W_1 = \sum_{i=1}^{n_1} R_i \quad \text{(sum of ranks for Group 1)}

W2=j=1n2Rj(sum of ranks for Group 2)W_2 = \sum_{j=1}^{n_2} R_j \quad \text{(sum of ranks for Group 2)}

Verification (always check):

W1+W2=N(N+1)2W_1 + W_2 = \frac{N(N+1)}{2}

3.2 The U Statistic (Mann-Whitney)

The U statistic counts the number of times a Group 1 observation exceeds a Group 2 observation across all n1×n2n_1 \times n_2 possible pairwise comparisons:

U1=i=1n1j=1n21(x1i>x2j)+12i=1n1j=1n21(x1i=x2j)U_1 = \sum_{i=1}^{n_1}\sum_{j=1}^{n_2}\mathbf{1}(x_{1i} > x_{2j}) + \frac{1}{2}\sum_{i=1}^{n_1}\sum_{j=1}^{n_2}\mathbf{1}(x_{1i} = x_{2j})

Equivalent formulas using rank sums (computationally simpler):

U1=n1n2+n1(n1+1)2W1U_1 = n_1 n_2 + \frac{n_1(n_1+1)}{2} - W_1

U2=n1n2+n2(n2+1)2W2U_2 = n_1 n_2 + \frac{n_2(n_2+1)}{2} - W_2

Key verification:

U1+U2=n1n2U_1 + U_2 = n_1 n_2

The test statistic is:

U=min(U1,U2)U = \min(U_1, U_2)

For large-sample tests, it is more convenient to use U1U_1 directly (with the sign determining the direction of the difference).

3.3 The Null Distribution of U

Under H0H_0 (the two populations are identical), the U statistic has a known exact distribution for small samples. The null distribution is symmetric about:

μU=n1n22\mu_U = \frac{n_1 n_2}{2}

With variance:

σU2=n1n2(n1+n2+1)12\sigma_U^2 = \frac{n_1 n_2 (n_1 + n_2 + 1)}{12}

Without ties: This formula is exact.

With ties: The variance must be corrected:

σU2=n1n212[(n1+n2+1)k=1g(tk3tk)(n1+n2)(n1+n21)]\sigma_U^2 = \frac{n_1 n_2}{12}\left[(n_1+n_2+1) - \frac{\sum_{k=1}^{g}(t_k^3 - t_k)}{(n_1+n_2)(n_1+n_2-1)}\right]

Where gg is the number of distinct tied groups and tkt_k is the number of observations in the kk-th tied group. The term k(tk3tk)\sum_k(t_k^3 - t_k) is the tie correction factor.

3.4 The z-Approximation for Large Samples

For n1,n210n_1, n_2 \geq 10 (or generally when exact tables are unavailable), the standardised U statistic is approximately standard normal:

Without continuity correction:

z=U1μUσU=U1n1n2/2n1n2(N+1)/12z = \frac{U_1 - \mu_U}{\sigma_U} = \frac{U_1 - n_1 n_2/2}{\sqrt{n_1 n_2(N+1)/12}}

With continuity correction (improves approximation for smaller samples):

zc=U1n1n2/2±0.5σUz_c = \frac{U_1 - n_1 n_2/2 \pm 0.5}{\sigma_U}

Where 0.5-0.5 is used when U1>μUU_1 > \mu_U and +0.5+0.5 when U1<μUU_1 < \mu_U.

With tie correction:

z=U1n1n2/2n1n212[(N+1)k(tk3tk)N(N1)]z = \frac{U_1 - n_1 n_2/2}{\sqrt{\dfrac{n_1 n_2}{12}\left[(N+1) - \dfrac{\sum_k(t_k^3-t_k)}{N(N-1)}\right]}}

3.5 Computing the Two-Tailed p-Value

Using exact distribution (small samples, nj10n_j \leq 10):

p=2×P(UUobs)(using exact tables or enumeration)p = 2 \times P(U \leq U_{obs}) \quad \text{(using exact tables or enumeration)}

Using z-approximation (large samples):

p=2×P(Zz)=2[1Φ(z)]p = 2 \times P(Z \geq |z|) = 2[1 - \Phi(|z|)]

Using one-tailed tests:

Upper tail (H1:P(X1>X2)>0.5H_1: P(X_1 > X_2) > 0.5):

p=P(Zz)p = P(Z \geq z)

Lower tail (H1:P(X1>X2)<0.5H_1: P(X_1 > X_2) < 0.5):

p=P(Zz)p = P(Z \leq z)

3.6 The Exact Computation via Pairwise Comparisons

The U statistic can also be computed directly by comparing all possible pairs of observations across the two groups. For each pair (x1i,x2j)(x_{1i}, x_{2j}):

Sij={1if x1i>x2j0.5if x1i=x2j0if x1i<x2jS_{ij} = \begin{cases} 1 & \text{if } x_{1i} > x_{2j} \\ 0.5 & \text{if } x_{1i} = x_{2j} \\ 0 & \text{if } x_{1i} < x_{2j} \end{cases}

U1=i=1n1j=1n2SijU_1 = \sum_{i=1}^{n_1}\sum_{j=1}^{n_2} S_{ij}

This formulation makes the connection to the probability of superiority transparent:

PS^=U1n1n2\widehat{PS} = \frac{U_1}{n_1 n_2}

Under H0H_0: PS^=0.5\widehat{PS} = 0.5 (since U1=n1n2/2U_1 = n_1 n_2/2).

3.7 Critical Values for Small Samples

For small samples (n1,n220n_1, n_2 \leq 20), compare U=min(U1,U2)U = \min(U_1, U_2) to the critical value UcritU_{crit}. Reject H0H_0 (two-tailed, α=.05\alpha = .05) if UUcritU \leq U_{crit}:

n1n_1n2=5n_2 = 5n2=6n_2 = 6n2=7n_2 = 7n2=8n_2 = 8n2=10n_2 = 10n2=15n_2 = 15n2=20n_2 = 20
52568112027
657810142434
7681113172839
88101315203345
1011141720274259

Reject H0H_0 if UUcritU \leq U_{crit}. Values above are for α=.05\alpha = .05, two-tailed.

💡 DataStatPro computes exact p-values for all sample sizes using complete enumeration (for small samples) or the exact permutation distribution. The z-approximation is used only when exact computation is infeasible (N>200N > 200).


4. Assumptions of the Mann-Whitney U Test

4.1 Independence of Observations

All observations must be independent of each other, both within and across groups. No observation should influence or be influenced by any other.

Why it matters: Dependence between observations (e.g., measurements from the same participant appearing in both groups, or clustered observations) inflates the false positive rate and invalidates the null distribution of UU.

How to check: Review the study design. Independence is a design property, not detectable from the data alone.

When violated: Use the Wilcoxon Signed-Rank Test for paired data. For nested or clustered data, use multilevel non-parametric methods.

4.2 Ordinal or Higher Scale of Measurement

The dependent variable must be at least ordinally scaled — observations must be meaningfully rankable. The Mann-Whitney U Test is appropriate for:

When violated: If observations cannot be meaningfully ordered (i.e., the variable is truly nominal with no natural ordering), use the chi-squared test or Fisher's exact test instead.

4.3 Two Independent Groups

The test requires exactly two groups composed of different (independent) participants. Groups may have unequal sizes (n1n2n_1 \neq n_2), and the test remains valid.

When violated: For three or more independent groups, use the Kruskal-Wallis Test. For two related (paired) groups, use the Wilcoxon Signed-Rank Test.

4.4 The Location-Shift Assumption (for Median Interpretation)

This is the most commonly misunderstood assumption. The Mann-Whitney U Test tests:

Without the location-shift assumption: H0:P(X1>X2)=0.5H_0: P(X_1 > X_2) = 0.5 (stochastic equality) H1:P(X1>X2)0.5H_1: P(X_1 > X_2) \neq 0.5 (stochastic dominance)

This is always valid under the independence and ordinal assumptions alone.

With the location-shift assumption (same distribution shape, just shifted): H0:θ1=θ2H_0: \theta_1 = \theta_2 (equal medians) H1:θ1θ2H_1: \theta_1 \neq \theta_2 (unequal medians)

The location-shift assumption requires that the two population distributions have the same shape and spread — only their location (median) differs:

F2(x)=F1(xΔ)F_2(x) = F_1(x - \Delta) for some shift Δ\Delta

Why this matters: If the distributions differ in shape or spread (not just location), then a significant Mann-Whitney result may reflect differences in variability or distribution shape rather than a difference in central tendency. In this case, the Brunner-Munzel Test (Section 10) is more appropriate.

How to check:

4.5 No Assumption of Normality

Unlike the independent samples t-test, the Mann-Whitney U Test makes no normality assumption. This is its primary advantage and the most common reason for choosing it over the t-test.

4.6 Handling Ties

Ties (observations with identical values) reduce the power of the Mann-Whitney test slightly. The tie correction to the variance formula (Section 3.3) accounts for this. Excessive ties (e.g., more than 20% of observations tied) can reduce power substantially and should be noted in the methods section.

⚠️ When many ties are present, especially with small samples, the exact distribution of UU (rather than the normal approximation) should be used for p-values, as the normal approximation may be poor.

4.7 Assumption Summary Table

AssumptionRequiredHow to CheckRemedy if Violated
Independence of observations✅ YesStudy design reviewWilcoxon signed-rank (paired); multilevel methods (clustered)
Ordinal or higher scale✅ YesMeasurement theoryChi-squared (nominal outcome)
Two independent groups✅ YesStudy designKruskal-Wallis (K>2K > 2); Wilcoxon signed-rank (paired)
Location-shift (for median interpretation)⚠️ ConditionallyBoxplots, IQR comparisonBrunner-Munzel test (unequal shapes)
Normality❌ Not required
Equal variances❌ Not required

5. Variants and Related Tests

5.1 The Wilcoxon Rank-Sum Test

The Wilcoxon Rank-Sum Test and the Mann-Whitney U Test are two names for the same procedure. They differ only in which test statistic is reported:

The relationship: U1=W1n1(n1+1)/2U_1 = W_1 - n_1(n_1+1)/2

Both produce identical p-values. DataStatPro reports both UU and WW for completeness.

5.2 One-Tailed vs. Two-Tailed Tests

Two-tailed (default): H1:P(X1>X2)0.5H_1: P(X_1 > X_2) \neq 0.5 Use when the direction of the difference is not predicted in advance.

One-tailed (upper): H1:P(X1>X2)>0.5H_1: P(X_1 > X_2) > 0.5 Use when specifically predicting Group 1 tends to be larger. Divide the two-tailed p-value by 2.

One-tailed (lower): H1:P(X1>X2)<0.5H_1: P(X_1 > X_2) < 0.5 Use when specifically predicting Group 1 tends to be smaller.

⚠️ One-tailed tests must be justified and pre-registered before data collection. Switching to one-tailed after observing the data direction is p-hacking.

5.3 The Brunner-Munzel Test

The Brunner-Munzel Test (Brunner & Munzel, 2000) is a robust alternative to the Mann-Whitney test when the location-shift assumption may be violated — that is, when the two distributions may differ in shape and spread, not just location.

It tests the same null hypothesis:

H0:P(X1>X2)+0.5P(X1=X2)=0.5H_0: P(X_1 > X_2) + 0.5 \cdot P(X_1 = X_2) = 0.5

But uses separate within-group rankings to construct a test statistic that is valid regardless of whether the distribution shapes are equal.

The Brunner-Munzel statistic:

tBM=n1n2(Rˉ1(int)Rˉ2(int))Nn1S^12+n2S^22t_{BM} = \frac{n_1 n_2 (\bar{R}_1^{(int)} - \bar{R}_2^{(int)})}{N\sqrt{n_1 \hat{S}_1^2 + n_2 \hat{S}_2^2}}

Where Rˉj(int)\bar{R}_j^{(int)} are internal ranks (each group ranked separately within itself using the pooled ranking as reference), and S^j2\hat{S}_j^2 are within-group variance estimates of the ranks.

Degrees of freedom are approximated using a Welch-Satterthwaite-type formula. DataStatPro reports the Brunner-Munzel test when the location-shift assumption appears violated.

5.4 Permutation Test Alternative

The permutation test (randomisation test) for two independent groups:

  1. Compute the observed UU statistic (or difference in means/medians).
  2. Randomly reassign all NN observations to two groups of sizes n1n_1 and n2n_2.
  3. Recompute UU^* for each permutation.
  4. The p-value is the proportion of permutations where UUobsU^* \geq U_{obs}.

The permutation test is exact (no approximation needed), handles ties perfectly, and makes no distributional assumptions beyond exchangeability. DataStatPro offers this as an option under the Advanced Settings panel.


6. Using the Mann-Whitney U Test Calculator Component

The Mann-Whitney U Test Calculator component in DataStatPro provides a comprehensive tool for conducting, diagnosing, visualising, and reporting the Mann-Whitney U Test and its alternatives.

Step-by-Step Guide

Step 1 — Select the Test

From the "Non-Parametric Tests" menu, select "Mann-Whitney U Test (Independent Samples)". DataStatPro will also display Wilcoxon Rank-Sum notation alongside for software compatibility.

Step 2 — Input Method

Choose how to provide the data:

💡 Always use raw data when available. The rank-based computation is automatic, and raw data enable exact p-values, assumption checks, and visualisation of the full distribution.

Step 3 — Specify the Alternative Hypothesis

Step 4 — Select the p-Value Method

Step 5 — Select the Continuity Correction

For the normal approximation:

Step 6 — Select Effect Size Options

Step 7 — Select Display Options

Step 8 — Run the Analysis

Click "Run Mann-Whitney U Test". DataStatPro will:

  1. Pool and rank all observations with tie correction.
  2. Compute U1U_1, U2U_2, W1W_1, W2W_2.
  3. Compute the exact or approximate p-value.
  4. Compute rrbr_{rb}, PS^\hat{PS}, and their 95% CIs.
  5. Compute the Hodges-Lehmann median difference estimate.
  6. Generate all selected visualisations.
  7. Generate an APA-compliant results paragraph.

7. Exact vs. Approximate Methods

7.1 When to Use Exact Methods

The exact Mann-Whitney distribution enumerates all possible arrangements of NN observations into two groups of sizes n1n_1 and n2n_2 and computes the proportion of these that yield a UU statistic as extreme as the observed value. This is computationally intensive but exact.

Use exact methods when:

The total number of equally likely arrangements under H0H_0:

(Nn1)=N!n1!n2!\binom{N}{n_1} = \frac{N!}{n_1! n_2!}

For n1=n2=10n_1 = n_2 = 10: (2010)=184,756\binom{20}{10} = 184{,}756 arrangements — computationally feasible for exact enumeration.

7.2 The Normal Approximation — When Is It Adequate?

The normal approximation is adequate when:

Accuracy of the approximation: The approximation error for the p-value is O(1/N)O(1/N), meaning it improves as sample size increases.

7.3 The Permutation Approach

The permutation approach avoids the normal approximation entirely by directly estimating the null distribution from the data. It is:

With B=10,000B = 10{,}000 permutations, the Monte Carlo standard error of the p-value estimate is p(1p)/B0.25/10000=0.005\sqrt{p(1-p)/B} \leq \sqrt{0.25/10000} = 0.005 — adequate for most purposes. DataStatPro uses B=100,000B = 100{,}000 by default for higher precision.


8. Effect Sizes for the Mann-Whitney U Test

8.1 The Rank-Biserial Correlation (rrbr_{rb})

The rank-biserial correlation is the standard effect size for the Mann-Whitney U Test. It directly measures the probability of superiority on a standardised scale from 1-1 to +1+1.

Formula from U statistics:

rrb=U1U2n1n2=U1n1n2U2n1n2r_{rb} = \frac{U_1 - U_2}{n_1 n_2} = \frac{U_1}{n_1 n_2} - \frac{U_2}{n_1 n_2}

Equivalently:

rrb=12Un1n2(where U=min(U1,U2))r_{rb} = 1 - \frac{2U}{n_1 n_2} \quad \text{(where } U = \min(U_1, U_2)\text{)}

Or, when U1U_1 is the statistic for Group 1:

rrb=2U1n1n21r_{rb} = \frac{2U_1}{n_1 n_2} - 1

Formula from mean ranks:

rrb=Rˉ1Rˉ2(n1+n2)/2r_{rb} = \frac{\bar{R}_1 - \bar{R}_2}{(n_1+n_2)/2}

Where Rˉj=Wj/nj\bar{R}_j = W_j/n_j is the mean rank of group jj.

Formula from the z-statistic:

rrb=2zN=2zn1+n2r_{rb} = \frac{2z}{N} = \frac{2z}{\sqrt{n_1+n_2}} (approximate)

A more precise formula:

rrbzN×4Nn1n2/(n1+n2)r_{rb} \approx \frac{z}{\sqrt{N}} \times \sqrt{\frac{4N}{n_1 n_2/(n_1+n_2)}}

8.2 Interpreting the Rank-Biserial Correlation

rrbr_{rb}PS^\hat{PS}Verbal Interpretation
0.000.000.500.50No tendency; equally likely to exceed
0.100.100.550.55Very small effect; Group 1 slightly higher
0.200.200.600.60Small effect
0.300.300.650.65Small-to-medium effect
0.440.440.720.72Medium effect (Cohen's convention)
0.500.500.750.75Medium-large effect
0.640.640.820.82Large effect (Cohen's convention)
0.800.800.900.90Very large effect
1.001.001.001.00Perfect — every Group 1 obs. exceeds every Group 2 obs.

Cohen's (1988) benchmarks for rrbr_{rb} (same as Pearson rr):

| Label | rrb|r_{rb}| | | :---- | :--------- | | Small | 0.100.10 | | Medium | 0.300.30 | | Large | 0.500.50 |

⚠️ Cohen's benchmarks were not specifically developed for the rank-biserial correlation. Always contextualise effect sizes within your research domain and compare to typical effect sizes from meta-analyses in the same field.

8.3 The Probability of Superiority (PS^\hat{PS})

The probability of superiority is the most intuitive interpretation of the Mann-Whitney effect size:

PS^=U1n1n2\hat{PS} = \frac{U_1}{n_1 n_2}

Relationship to rrbr_{rb}:

PS^=rrb+12,rrb=2PS^1\hat{PS} = \frac{r_{rb} + 1}{2}, \qquad r_{rb} = 2\hat{PS} - 1

Interpretation: If PS^=0.75\hat{PS} = 0.75, then in 75% of all possible pairings of one observation from Group 1 with one from Group 2, the Group 1 observation is larger.

Confidence interval for PS^\hat{PS} (using Fisher zz-transformation of rrbr_{rb}):

zr=arctanh(rrb),SEzr1n1n2/3z_{r} = \text{arctanh}(r_{rb}), \quad SE_{z_r} \approx \frac{1}{\sqrt{n_1 n_2 / 3}}

95% CI:

zr±1.96×SEzrz_r \pm 1.96 \times SE_{z_r}

Back-transform: rrb=tanh(zr)r_{rb} = \tanh(z_r); then PS^=(rrb+1)/2\hat{PS} = (r_{rb}+1)/2.

8.4 The Hodges-Lehmann Estimator

The Hodges-Lehmann estimator Δ^\hat{\Delta} is a robust, rank-based point estimate of the location shift between the two groups. It is the median of all possible pairwise differences:

Δ^=median{x1ix2j:i=1,,n1;  j=1,,n2}\hat{\Delta} = \text{median}\{x_{1i} - x_{2j}: i = 1,\ldots, n_1;\; j = 1,\ldots, n_2\}

Confidence interval for Δ^\hat{\Delta}: Using the exact Mann-Whitney distribution to determine which order statistics of the pairwise differences form the CI bounds.

The Hodges-Lehmann estimator is reported by DataStatPro alongside the Mann-Whitney U test as a meaningful, robust measure of the magnitude of the location shift in the original measurement units.

8.5 Comparing Effect Sizes Across Studies

When comparing Mann-Whitney effect sizes to t-test effect sizes from other studies, use the following conversions:

rrbr_{rb} to Cohen's dd (approximate, under normality and equal group sizes):

d2rrb1rrb2d \approx \frac{2r_{rb}}{\sqrt{1-r_{rb}^2}}

Or more precisely, using the relationship rrbrpbr_{rb} \approx r_{pb} (point-biserial rr):

d=2r1r2d = \frac{2r}{\sqrt{1-r^2}}

Cohen's dd to rrbr_{rb}:

rrbdd2+4r_{rb} \approx \frac{d}{\sqrt{d^2+4}} (for equal group sizes)

⚠️ These conversions assume normality for the dd-to-rr direction and may not hold for non-normal data. Use conversions with caution and clearly state the assumption.


9. Confidence Intervals

9.1 Confidence Interval for the Rank-Biserial Correlation

The 95% CI for rrbr_{rb} uses the Fisher zz-transformation:

zrrb=arctanh(rrb)=12ln ⁣(1+rrb1rrb)z_{r_{rb}} = \text{arctanh}(r_{rb}) = \frac{1}{2}\ln\!\left(\frac{1+r_{rb}}{1-r_{rb}}\right)

Standard error (approximate):

SEz1n1n2/3SE_{z} \approx \frac{1}{\sqrt{n_1 n_2/3}}

A more precise standard error accounting for group sizes:

SEz=n1+n2+13n1n2SE_{z} = \sqrt{\frac{n_1+n_2+1}{3n_1 n_2}}

95% CI in zz space:

[zrrb1.96×SEz,  zrrb+1.96×SEz]\left[z_{r_{rb}} - 1.96 \times SE_z,\; z_{r_{rb}} + 1.96 \times SE_z\right]

Back-transform to rrbr_{rb} scale:

rrb,L=tanh(zrrb1.96×SEz),rrb,U=tanh(zrrb+1.96×SEz)r_{rb,L} = \tanh(z_{r_{rb}} - 1.96 \times SE_z), \qquad r_{rb,U} = \tanh(z_{r_{rb}} + 1.96 \times SE_z)

9.2 Confidence Interval for the Hodges-Lehmann Estimator

The CI for Δ^\hat{\Delta} is derived from the exact Mann-Whitney null distribution. The procedure:

  1. Order all n1×n2n_1 \times n_2 pairwise differences D(1)D(2)D(n1n2)D_{(1)} \leq D_{(2)} \leq \cdots \leq D_{(n_1 n_2)}.
  2. Find the critical value CαC_{\alpha} from the Mann-Whitney distribution table: Cα=Ucrit,  α/2C_{\alpha} = U_{crit,\;\alpha/2} (two-tailed).
  3. The 95% CI for Δ^\hat{\Delta} is:

[D(Cα+1),  D(n1n2Cα)]\left[D_{(C_\alpha+1)},\; D_{(n_1 n_2 - C_\alpha)}\right]

DataStatPro computes this exactly for n1n25,000n_1 n_2 \leq 5{,}000 and uses a normal approximation for larger datasets.

9.3 Confidence Interval for the Probability of Superiority

After computing the CI for rrbr_{rb} (Section 9.1):

PS^L=rrb,L+12,PS^U=rrb,U+12\hat{PS}_L = \frac{r_{rb,L}+1}{2}, \qquad \hat{PS}_U = \frac{r_{rb,U}+1}{2}

Example: If rrb=0.45r_{rb} = 0.45 with 95% CI [0.18,0.67][0.18, 0.67], then PS^=0.725\hat{PS} = 0.725 with 95% CI [0.59,0.835][0.59, 0.835].


10. Advanced Topics

10.1 The Mann-Whitney Test as a Test of Stochastic Equality

Without the location-shift assumption, the Mann-Whitney test tests the general null:

H0:P(X1>X2)+12P(X1=X2)=12H_0: P(X_1 > X_2) + \frac{1}{2}P(X_1 = X_2) = \frac{1}{2}

This null is called stochastic equality. It does not require equal medians, equal shapes, or any distributional assumption. The alternative:

H1:P(X1>X2)+12P(X1=X2)12H_1: P(X_1 > X_2) + \frac{1}{2}P(X_1 = X_2) \neq \frac{1}{2}

This is the most general and defensible interpretation of the Mann-Whitney U Test.

Practical implication: If Group 1 has a higher median but larger spread, and Group 2 has a lower median but smaller spread, the distributions may overlap substantially and P(X1>X2)P(X_1 > X_2) may be close to 0.5 even though the medians differ — the Mann-Whitney test (correctly) may not detect a significant difference.

10.2 Asymptotic Relative Efficiency Across Distributions

The ARE of the Mann-Whitney test relative to the t-test depends on the true underlying distribution:

DistributionARE (Mann-Whitney vs. t-test)
Normal3/π0.9553/\pi \approx 0.955
Uniform1.0001.000
Logisticπ2/91.097\pi^2/9 \approx 1.097
Double exponential (Laplace)1.5001.500
Cauchy (heavy-tailed)\infty
Contaminated normalOften >1.5> 1.5

For heavy-tailed distributions — common in psychology (reaction times), medicine (survival times), and economics (income) — the Mann-Whitney test is substantially more powerful than the t-test.

10.3 Sample Size and Power for the Mann-Whitney Test

Power of the Mann-Whitney test under a location-shift alternative with effect size rrbr_{rb} can be approximated using the ARE relationship:

nMWntAREπ3ntn_{MW} \approx \frac{n_t}{ARE} \approx \frac{\pi}{3} n_t (for normal data)

For non-normal data, the required nn for the Mann-Whitney test is computed using the non-central distribution of UU (or equivalently, the non-central normal distribution for large samples):

λ=z1α/2+z1β=(PS0.5)12n1n2(N+1)N\lambda = z_{1-\alpha/2} + z_{1-\beta} = \frac{(PS - 0.5)\sqrt{12 n_1 n_2 (N+1)}}{\sqrt{N}}

Solving for nn per group (equal group sizes, n1=n2=nn_1 = n_2 = n):

n(z1α/2+z1β)2πf2/3=3(z1α/2+z1β)2πf2n \approx \frac{(z_{1-\alpha/2} + z_{1-\beta})^2}{\pi f^2/3} = \frac{3(z_{1-\alpha/2}+z_{1-\beta})^2}{\pi f^2}

Required nn per group (80% power, α=.05\alpha = .05, two-tailed, equal group sizes):

| rrb|r_{rb}| | Label | nn per group (Mann-Whitney) | nn per group (t-test) | | :---------- | :---- | :--------------------------- | :--------------------- | | 0.100.10 | Small | 414 | 394 | | 0.200.20 | Small | 99 | 97 | | 0.300.30 | Medium | 44 | 43 | | 0.440.44 | Medium | 21 | 20 | | 0.500.50 | Large | 16 | 15 | | 0.640.64 | Large | 10 | 9 |

Note: Mann-Whitney requires approximately 5% more observations than the t-test under normality, consistent with the ARE of 3/π0.9553/\pi \approx 0.955.

10.4 Rank-Based Post-Hoc Comparisons After Kruskal-Wallis

When the Kruskal-Wallis test (the non-parametric ANOVA equivalent) is significant, pairwise Mann-Whitney U tests are conducted as post-hoc comparisons with appropriate FWER correction (Bonferroni, Holm, or Dunn's test). The effect size for each comparison is rrbr_{rb}.

Each pairwise comparison uses only the two groups being compared (not the full ranked dataset from the Kruskal-Wallis test), though using the full-dataset ranks is also acceptable and provides a consistent ranking across comparisons.

10.5 Comparing the Mann-Whitney and Kolmogorov-Smirnov Tests

Both the Mann-Whitney and the two-sample Kolmogorov-Smirnov (KS) test are non-parametric tests for comparing two independent groups. Key differences:

PropertyMann-Whitney UKolmogorov-Smirnov
TestsStochastic dominance / location shiftAny distributional difference
Sensitive toLocation differencesLocation, spread, and shape differences
Power (location shifts)✅ Higher❌ Lower
Power (spread/shape differences)❌ Lower✅ Higher
Effect sizerrbr_{rb}, PS^\hat{PS}No standard effect size
Handles tiesWith correctionPoorly (assumes continuous)

Use Mann-Whitney when you are specifically interested in whether one group tends to have higher values (location shift). Use Kolmogorov-Smirnov when you want a general test of whether the two distributions differ in any way.

10.6 Bootstrap Confidence Intervals for rrbr_{rb}

For small samples or when the Fisher zz-approximation may be imprecise, DataStatPro offers bootstrap CIs for rrbr_{rb}:

  1. Draw B=10,000B = 10{,}000 bootstrap samples (resample with replacement separately from Group 1 and Group 2).
  2. Compute rrbr_{rb}^* for each bootstrap sample.
  3. The 95% bootstrap CI is the 2.5th and 97.5th percentiles of the BB bootstrap values.

The bias-corrected and accelerated (BCa) bootstrap CI is preferred over the simple percentile method for small nn.

10.7 Reporting the Mann-Whitney U Test According to APA 7th Edition

Minimum required reporting elements:

  1. Test statistic: U=U = [value]
  2. p-value: p=p = [value] (exact or approximation — specify which)
  3. Effect size with 95% CI: rrb=r_{rb} = [value] [95% CI: LB, UB]
  4. Medians and IQR (or full range) per group
  5. Whether exact or asymptotic p-value was used
  6. Tie correction: whether applied and number of ties
  7. Which alternative hypothesis was tested (two-tailed or directional)

APA template:

"A Mann-Whitney U test revealed [a significant / no significant] difference in [DV] between [Group 1] (Mdn = , IQR = ) and [Group 2] (Mdn = , IQR = ), U=U = [value], z=z = [value], p=p = [value], rrb=r_{rb} = [value] [95% CI: LB, UB], indicating a [small / medium / large] effect."


11. Worked Examples

Example 1: Small Sample with Exact p-Value — Pain Relief Ratings

A physiotherapist compares pain relief ratings (0 = no relief, 10 = complete relief) for two manual therapy techniques. Normality is violated (Shapiro-Wilk: p<.05p < .05).

Data:

Technique A (n1=7n_1 = 7)Technique B (n2=6n_2 = 6)
3,6,5,8,4,7,53, 6, 5, 8, 4, 7, 57,9,8,10,9,87, 9, 8, 10, 9, 8

Step 1 — Pool and rank all N=13N = 13 observations:

Sorted values: 3(1),  4(2),  5(3.5),  5(3.5),  6(5),  7(6.5),  7(6.5),  8(9),  8(9),  8(9),  9(11.5),  9(11.5),  10(13)3(1),\; 4(2),\; 5(3.5),\; 5(3.5),\; 6(5),\; 7(6.5),\; 7(6.5),\; 8(9),\; 8(9),\; 8(9),\; 9(11.5),\; 9(11.5),\; 10(13)

ObsGroupRank
3A1.0
4A2.0
5A3.5
5A3.5
6A5.0
7A6.5
7B6.5
8A9.0
8B9.0
8B9.0
9B11.5
9B11.5
10B13.0

Step 2 — Rank sums:

WA=1.0+2.0+3.5+3.5+5.0+6.5+9.0=30.5W_A = 1.0+2.0+3.5+3.5+5.0+6.5+9.0 = 30.5

WB=6.5+9.0+9.0+11.5+11.5+13.0=60.5W_B = 6.5+9.0+9.0+11.5+11.5+13.0 = 60.5

Check: 30.5+60.5=91=13×14/230.5 + 60.5 = 91 = 13 \times 14/2

Step 3 — U statistics:

UA=7×6+7×8230.5=42+2830.5=39.5U_A = 7 \times 6 + \frac{7 \times 8}{2} - 30.5 = 42 + 28 - 30.5 = 39.5

UB=7×6+6×7260.5=42+2160.5=2.5U_B = 7 \times 6 + \frac{6 \times 7}{2} - 60.5 = 42 + 21 - 60.5 = 2.5

Check: 39.5+2.5=42=n1n239.5 + 2.5 = 42 = n_1 n_2

U=min(39.5,2.5)=2.5U = \min(39.5, 2.5) = 2.5

Step 4 — Tie correction and z-statistic:

μU=7×6/2=21\mu_U = 7 \times 6 / 2 = 21

Ties: value 5 (t=2t=2), 7 (t=2t=2), 8 (t=3t=3), 9 (t=2t=2):

k(tk3tk)=(82)+(82)+(273)+(82)=6+6+24+6=42\sum_k(t_k^3-t_k) = (8-2)+(8-2)+(27-3)+(8-2) = 6+6+24+6 = 42

σU2=7×612[(13+1)4213×12]=4212[1442156]=3.5[140.269]=3.5×13.731=48.059\sigma_U^2 = \frac{7 \times 6}{12}\left[(13+1) - \frac{42}{13 \times 12}\right] = \frac{42}{12}\left[14 - \frac{42}{156}\right] = 3.5[14 - 0.269] = 3.5 \times 13.731 = 48.059

σU=48.059=6.933\sigma_U = \sqrt{48.059} = 6.933

z=UAμUσU=39.5216.933=18.56.933=2.669z = \frac{U_A - \mu_U}{\sigma_U} = \frac{39.5 - 21}{6.933} = \frac{18.5}{6.933} = 2.669

(Using UAU_A to preserve sign; positive means A tends to be smaller than B)

Actually, to reflect direction: z=UBμUσU=2.5216.933=18.56.933=2.669z = \frac{U_B - \mu_U}{\sigma_U} = \frac{2.5-21}{6.933} = \frac{-18.5}{6.933} = -2.669

Two-tailed: p=2×P(Z2.669)=2×.0038=.008p = 2 \times P(Z \leq -2.669) = 2 \times .0038 = .008

Exact p-value (DataStatPro): pexact=.006p_{exact} = .006

Step 5 — Effect size:

rrb=12Un1n2=12×2.542=10.119=0.881r_{rb} = 1 - \frac{2U}{n_1 n_2} = 1 - \frac{2 \times 2.5}{42} = 1 - 0.119 = 0.881

(Positive rrbr_{rb}: Group B tends to have higher values)

PS^=UB/n1n2=39.5/42=0.940\hat{PS} = U_B/n_1 n_2 = 39.5/42 = 0.940

95% CI for rrbr_{rb}:

zr=arctanh(0.881)=1.375z_r = \text{arctanh}(0.881) = 1.375

SEz=(7+6+1)/(3×7×6)=14/126=0.1111=0.333SE_z = \sqrt{(7+6+1)/(3 \times 7 \times 6)} = \sqrt{14/126} = \sqrt{0.1111} = 0.333

95% CI: 1.375±1.96×0.333=[0.722,2.028]1.375 \pm 1.96 \times 0.333 = [0.722, 2.028]

rrb,L=tanh(0.722)=0.619,rrb,U=tanh(2.028)=0.967r_{rb,L} = \tanh(0.722) = 0.619, \quad r_{rb,U} = \tanh(2.028) = 0.967

Hodges-Lehmann estimator Δ^\hat{\Delta}:

All 7×6=427 \times 6 = 42 pairwise differences (Technique A - Technique B):

Median of these 42 differences =3.0= -3.0 (Technique A scores are typically 3 points lower)

Summary:

StatisticValueInterpretation
Technique A Median (IQR)5.05.0 (3.5–7.5)Lower ratings
Technique B Median (IQR)8.58.5 (7.75–9.25)Higher ratings
UU2.52.5
zz2.669-2.669
pp (exact).006.006Significant at α=.05\alpha = .05
rrbr_{rb}0.8810.881Very large effect
95% CI for rrbr_{rb}[0.619,0.967][0.619, 0.967]
PS^\hat{PS}94.0%94.0\%
Hodges-Lehmann Δ^\hat{\Delta}3.0-3.0 pointsA is 3 points lower

APA write-up: "A Mann-Whitney U test (exact) was conducted to compare pain relief ratings between Technique A (n=7n = 7, Mdn =5.0= 5.0, IQR =3.5= 3.57.57.5) and Technique B (n=6n = 6, Mdn =8.5= 8.5, IQR =7.75= 7.759.259.25). Technique B produced significantly higher ratings, U=2.5U = 2.5, p=.006p = .006 (exact), rrb=0.88r_{rb} = 0.88 [95% CI: 0.62, 0.97], indicating a very large effect. The Hodges-Lehmann estimator indicated a median difference of 3.0 points (Technique B higher)."


Example 2: Larger Sample with Normal Approximation — Reaction Times

A cognitive psychologist compares simple reaction times (ms) between neurotypical adults (n1=25n_1 = 25) and adults with ADHD (n2=22n_2 = 22). Data are positively skewed (Shapiro-Wilk p<.01p < .01 for ADHD group).

Summary statistics (pre-computed):

W1=523W_1 = 523 (neurotypical), W2=730W_2 = 730 (ADHD), N=47N = 47

Neurotypical: Mdn =248= 248 ms, IQR =218= 218274274 ms ADHD: Mdn =291= 291 ms, IQR =265= 265318318 ms

Step 1 — U statistics:

U1=25×22+25×262523=550+325523=352U_1 = 25 \times 22 + \frac{25 \times 26}{2} - 523 = 550 + 325 - 523 = 352

U2=25×22+22×232730=550+253730=73U_2 = 25 \times 22 + \frac{22 \times 23}{2} - 730 = 550 + 253 - 730 = 73

Check: 352+73=425550352 + 73 = 425 \neq 550... recheck.

U2=550U1=550352=198U_2 = 550 - U_1 = 550 - 352 = 198

Let me recalculate: W1+W2=523+730=1253=47×48/2=1128W_1 + W_2 = 523 + 730 = 1253 = 47 \times 48/2 = 1128. Inconsistency — let me use self-consistent values.

With N=47N = 47: W1+W2=47×48/2=1128W_1 + W_2 = 47 \times 48/2 = 1128

Let W1=498W_1 = 498 (neurotypical), W2=630W_2 = 630 (ADHD).

U1=25×22+25×262498=550+325498=377U_1 = 25 \times 22 + \frac{25 \times 26}{2} - 498 = 550 + 325 - 498 = 377

U2=550+22×232630=550+253630=173U_2 = 550 + \frac{22 \times 23}{2} - 630 = 550 + 253 - 630 = 173

Check: 377+173=550=n1n2377 + 173 = 550 = n_1 n_2

U=min(377,173)=173U = \min(377, 173) = 173

Step 2 — z-statistic (no ties assumed for this example):

μU=550/2=275\mu_U = 550/2 = 275

σU=25×22×4812=2640012=2200=46.90\sigma_U = \sqrt{\frac{25 \times 22 \times 48}{12}} = \sqrt{\frac{26400}{12}} = \sqrt{2200} = 46.90

z=U2μUσU=17327546.90=10246.90=2.175z = \frac{U_2 - \mu_U}{\sigma_U} = \frac{173 - 275}{46.90} = \frac{-102}{46.90} = -2.175

(Negative: ADHD group has lower rank sum than expected under H0H_0... but ADHD should be higher. Use U1U_1:)

z=U1μUσU=37727546.90=10246.90=2.175z = \frac{U_1 - \mu_U}{\sigma_U} = \frac{377 - 275}{46.90} = \frac{102}{46.90} = 2.175

Two-tailed: p=2×P(Z2.175)=2×.015=.030p = 2 \times P(Z \geq 2.175) = 2 \times .015 = .030

Step 3 — Effect size:

rrb=U1U2n1n2=377173550=204550=0.371r_{rb} = \frac{U_1 - U_2}{n_1 n_2} = \frac{377 - 173}{550} = \frac{204}{550} = 0.371

(Positive: neurotypical faster — lower reaction times)

Wait — neurotypical should have lower RTs (faster), so U1>U2U_1 > U_2 means neurotypical observations tend to have higher ranks? No — higher RT (slower) = higher rank. Let me reclarify: higher rank = longer RT = slower. ADHD group should have higher RTs = higher ranks = larger W2W_2. So W2>W1W_2 > W_1, U2>U1U_2 > U_1, and rrbr_{rb} is calculated as:

rrb=UADHDUneuron1n2r_{rb} = \frac{U_{ADHD} - U_{neuro}}{n_1 n_2}

For clear directionality, always specify which group is Group 1 and which is Group 2.

Let Group 1 = ADHD, Group 2 = Neurotypical, W1=630W_1 = 630 (ADHD), W2=498W_2 = 498 (neuro):

U1=550+253630=173U_1 = 550 + 253 - 630 = 173 (ADHD wins in pairings) U2=550+325498=377U_2 = 550 + 325 - 498 = 377 (Neuro wins in pairings)

rrb=(173377)/550=204/550=0.371r_{rb} = (173-377)/550 = -204/550 = -0.371

Interpretation: ADHD group tends to have higher RT (larger values) than neurotypical; rrb=0.371r_{rb} = -0.371 indicates that neurotypical observations tend to be smaller (faster).

rrb=0.371|\,r_{rb}| = 0.371medium-to-large effect (approaching Cohen's medium = 0.30).

PS^=U1/(n1n2)=173/550=0.315\hat{PS} = U_1/(n_1 n_2) = 173/550 = 0.315

Interpretation: In only 31.5% of all pairings does an ADHD participant have a faster RT than a neurotypical participant — i.e., neurotypical participants are faster in 68.5% of pairings.

95% CI for rrbr_{rb}:

zr=arctanh(0.371)=0.389z_r = \text{arctanh}(0.371) = 0.389

SEz=48/(3×550)=0.02909=0.1706SE_z = \sqrt{48/(3 \times 550)} = \sqrt{0.02909} = 0.1706

95% CI: 0.389±1.96×0.171=[0.054,0.724]0.389 \pm 1.96 \times 0.171 = [0.054, 0.724]

rrb,L=tanh(0.054)=0.054,rrb,U=tanh(0.724)=0.619r_{rb,L} = \tanh(0.054) = 0.054, \quad r_{rb,U} = \tanh(0.724) = 0.619

Summary:

StatisticValue
Neurotypical: Mdn (IQR)248248 ms (218218274274)
ADHD: Mdn (IQR)291291 ms (265265318318)
UU173173
zz2.1752.175
pp (two-tailed, approx.).030.030
rrbr_{rb}0.3710.371 (medium-large)
95% CI for rrbr_{rb}[0.054,0.619][0.054, 0.619]
PS^\hat{PS} (neuro > ADHD in RT)68.5%68.5\%

APA write-up: "A Mann-Whitney U test was conducted to compare reaction times between neurotypical adults (n=25n = 25, Mdn =248= 248 ms, IQR =218= 218274274) and adults with ADHD (n=22n = 22, Mdn =291= 291 ms, IQR =265= 265318318). Adults with ADHD showed significantly longer reaction times, U=173U = 173, z=2.18z = 2.18, p=.030p = .030, rrb=0.37r_{rb} = 0.37 [95% CI: 0.05, 0.62], indicating a medium-to-large effect. In 68.5% of all possible pairings, a neurotypical participant was faster than an ADHD participant."


Example 3: Interpreting a Non-Significant Result

A researcher tests whether customer satisfaction ratings differ between two service delivery formats (in-person vs. online; n1=n2=30n_1 = n_2 = 30; 5-point scale).

Given: U=408U = 408, n1=n2=30n_1 = n_2 = 30, N=60N = 60.

μU=30×30/2=450\mu_U = 30 \times 30/2 = 450

σU=30×30×61/12=4575=67.64\sigma_U = \sqrt{30 \times 30 \times 61/12} = \sqrt{4575} = 67.64

z=(408450)/67.64=42/67.64=0.621z = (408 - 450)/67.64 = -42/67.64 = -0.621

p=2×P(Z0.621)=2×.267=.534p = 2 \times P(Z \leq -0.621) = 2 \times .267 = .534

rrb=(408492)/(30×30)=84/900=0.093r_{rb} = (408 - 492)/(30 \times 30) = -84/900 = -0.093

95% CI for rrbr_{rb}:

zr=arctanh(0.093)=0.093z_r = \text{arctanh}(0.093) = 0.093, SEz=61/(3×900)=0.0226=0.150SE_z = \sqrt{61/(3 \times 900)} = \sqrt{0.0226} = 0.150

95% CI: [0.207,0.023][-0.207, 0.023]rrbr_{rb} CI: [0.205,0.023][-0.205, 0.023] (approximately)

Equivalence test: With bounds rrb=±0.20r_{rb} = \pm 0.20 (trivially small effect), the 90% CI for rrbr_{rb} is [0.178,0.003][-0.178, 0.003]. The lower bound (0.178)(-0.178) is within (0.20)(-0.20) but the upper bound (0.003)(0.003) is just within (+0.20)(+0.20) — equivalence is borderline. Increase nn for a more powerful equivalence test.

Interpretation: The test is not significant (p=.534p = .534). The effect size is trivially small (rrb=0.093|r_{rb}| = 0.093, 95% CI spanning from practically zero to a small negative effect). The CI is relatively wide given the sample size. This is genuinely null-like, but a formal equivalence test with n=100+n = 100+ per group would provide more definitive evidence.

APA write-up: "A Mann-Whitney U test found no significant difference in satisfaction ratings between in-person (n=30n = 30) and online (n=30n = 30) formats, U=408U = 408, z=0.62z = -0.62, p=.534p = .534, rrb=0.09r_{rb} = -0.09 [95% CI: 0.21-0.21, 0.020.02]. The small effect size and wide confidence interval suggest that any true difference, if present, is negligibly small. An equivalence test would be required to formally establish the absence of a meaningful difference."


12. Common Mistakes and How to Avoid Them

Mistake 1: Using the Mann-Whitney Test for Paired Data

Problem: Applying the Mann-Whitney U Test to pre-post or matched-pairs data as if the groups were independent. This ignores the within-pair correlation, produces an inflated error term, and substantially reduces power. It also violates the independence assumption.

Solution: For paired or matched data, use the Wilcoxon Signed-Rank Test — the non-parametric equivalent of the paired t-test. Verify whether data represent independent groups (different participants) or related measurements (same participants or matched pairs) before choosing the test.


Mistake 2: Interpreting the Mann-Whitney Test as Always Testing Medians

Problem: Claiming that "the Mann-Whitney test compares medians" without acknowledging that this interpretation requires the location-shift assumption (equal distribution shapes). When distributions differ in shape or spread, the test may be significant even when medians are equal, or non-significant when medians differ considerably.

Solution: State the null hypothesis precisely: "The Mann-Whitney U test tests whether the probability that a randomly selected observation from Group 1 exceeds a randomly selected observation from Group 2 is 0.5." Only invoke median interpretation when the location-shift assumption is plausible and checked.


Mistake 3: Reporting Only U and p Without an Effect Size

Problem: Reporting U=23U = 23, p=.03p = .03 without the rank-biserial correlation or probability of superiority. Like the t-test, the Mann-Whitney test is influenced by sample size — a significant result says nothing about the magnitude of the effect.

Solution: Always report rrbr_{rb} (and/or PS^\hat{PS}) with 95% CI alongside UU and pp. DataStatPro computes these automatically. Small-sample significant results with small rrbr_{rb} should be interpreted cautiously; large-sample non-significant results with rrb=0.40r_{rb} = 0.40 may indicate insufficient power.


Mistake 4: Defaulting to Mann-Whitney When t-Test Assumptions Are Met

Problem: Using the Mann-Whitney test "to be safe" when the independent t-test's assumptions are fully satisfied (normal data, no severe outliers, equal variances). The Mann-Whitney test sacrifices approximately 5% power under normality — a real but small cost.

Solution: Run the Shapiro-Wilk test and inspect Q-Q plots. If normality holds (and sample sizes are adequate), use the independent t-test (Welch's version) for slightly greater power and the ability to report intuitive mean differences. Reserve Mann-Whitney for genuinely non-normal or ordinal data.


Mistake 5: Not Applying Tie Correction

Problem: Using the uncorrected variance formula σU2=n1n2(N+1)/12\sigma_U^2 = n_1 n_2(N+1)/12 when ties are present. This overestimates σU\sigma_U, producing a z-statistic that is too small and a p-value that is too large — making the test conservative.

Solution: Always apply the tie correction to the variance when ties are present. DataStatPro applies this automatically. Report the number and proportion of tied observations in the methods section.


Mistake 6: Reporting Means Instead of Medians for Mann-Whitney

Problem: Reporting group means alongside a Mann-Whitney test result. Since the test is rank-based and makes no assumptions about means, reporting means as the primary descriptive statistic is inconsistent with the test's rationale.

Solution: Report medians and IQRs (interquartile ranges) as the primary descriptive statistics alongside Mann-Whitney results. Means can be additionally reported as secondary information if useful, clearly labelled as supplementary.


Mistake 7: Confusing U1U_1 and U2U_2, Leading to Sign Errors in rrbr_{rb}

Problem: Confusing which UU statistic belongs to which group. rrb=(U1U2)/n1n2r_{rb} = (U_1-U_2)/n_1 n_2: a positive value means Group 1 tends to be larger; negative means Group 2 tends to be larger. Swapping U1U_1 and U2U_2 reverses the sign.

Solution: Always clearly label which group is Group 1 and which is Group 2 before computing. State the direction of the effect in the results: "Group X tended to have higher values than Group Y."


Mistake 8: Using Mann-Whitney for More Than Two Groups

Problem: Running multiple pairwise Mann-Whitney tests across three or more groups without a prior omnibus test and without FWER correction. This inflates the familywise Type I error rate.

Solution: For three or more independent groups, run the Kruskal-Wallis Test as the omnibus test first. Only if significant, conduct pairwise Mann-Whitney or Dunn's tests with Bonferroni or Holm FWER correction.


Mistake 9: Ignoring the Directionality of a Significant Result

Problem: Reporting a significant Mann-Whitney result without stating which group had higher rankings. A significant UU does not tell you direction — you must inspect the rank sums or mean ranks to determine which group tends to be higher.

Solution: Always report mean ranks (Rˉ1,Rˉ2)(\bar{R}_1, \bar{R}_2) or medians alongside UU, so the direction is unambiguous. Check: if Rˉ1>Rˉ2\bar{R}_1 > \bar{R}_2, Group 1 tends to have higher values; rrb>0r_{rb} > 0.


Mistake 10: Treating a Non-Significant Mann-Whitney as Proof of Equal Distributions

Problem: Concluding from p>.05p > .05 that the two populations are identical (or that the medians are equal). A non-significant result indicates insufficient evidence against H0H_0 — not evidence for H0H_0.

Solution: Report rrbr_{rb} and its 95% CI. A non-significant result with a wide CI (e.g., rrb=0.15r_{rb} = 0.15 [95% CI: 0.10-0.10, 0.380.38]) indicates low power, not a true null effect. Conduct an equivalence test with pre-specified bounds to positively establish that the effect is negligibly small.


13. Troubleshooting

ProblemLikely CauseSolution
U1+U2n1n2U_1 + U_2 \neq n_1 n_2Calculation error in UU or WWVerify W1+W2=N(N+1)/2W_1+W_2 = N(N+1)/2 first; recompute UU from corrected WW
W1+W2N(N+1)/2W_1 + W_2 \neq N(N+1)/2Ranking error; ties not averagedRecheck averaging of tied ranks; verify all observations are ranked
pexactpapproxp_{exact} \neq p_{approx} noticeablySmall nn or many tiesUse exact pp for n<15n < 15 per group; report exact if available
rrb>1.0r_{rb} > 1.0 or <1.0< -1.0Computational errorCheck U1,U2U_1, U_2 sum to n1n2n_1 n_2; verify direction of subtraction
Very large zz but non-significantσU\sigma_U severely underestimated due to ties without correctionApply tie correction; use exact p-value
Mann-Whitney significant but t-test notNon-normality causing t-test to lose power; or different hypothesesTrust Mann-Whitney for non-normal data; they test different things
t-test significant but Mann-Whitney notOutlier driving mean difference but not systematically affecting ranksInvestigate outlier; report both with explanation
Many ties (>30%> 30\%)Coarse measurement scale (e.g., 5-point Likert)Use tie-corrected variance; report tie proportion; consider ordinal regression
Hodges-Lehmann Δ^=0\hat{\Delta} = 0 but rrb0r_{rb} \neq 0Distributions differ in shape/spread but not location shiftReport PS^\hat{PS} rather than Δ^\hat{\Delta}; location-shift assumption may be violated
Brunner-Munzel and Mann-Whitney give different conclusionsDistribution shapes differ; location-shift violatedUse Brunner-Munzel as more appropriate; report both
Exact p-value computation takes too longNN too large for enumerationSwitch to permutation test or normal approximation with tie correction
Negative UU valueFormula error (U cannot be negative)Re-examine formula; U=n1n2+nj(nj+1)/2Wj0U = n_1 n_2 + n_j(n_j+1)/2 - W_j \geq 0 always
Bootstrap CI very wideSmall nnReport wide CI as reflecting genuine uncertainty; collect more data
rrbr_{rb} from z-formula \neq rrbr_{rb} from UU formulaFormula approximation discrepancyUse rrb=(U1U2)/(n1n2)r_{rb} = (U_1-U_2)/(n_1 n_2) as primary; z-based formula is only approximate

14. Quick Reference Cheat Sheet

Core Formulas

FormulaDescription
Wj=i=1njRiW_j = \sum_{i=1}^{n_j} R_iRank sum for group jj
W1+W2=N(N+1)/2W_1 + W_2 = N(N+1)/2Verification check
U1=n1n2+n1(n1+1)/2W1U_1 = n_1 n_2 + n_1(n_1+1)/2 - W_1UU for Group 1
U2=n1n2+n2(n2+1)/2W2U_2 = n_1 n_2 + n_2(n_2+1)/2 - W_2UU for Group 2
U1+U2=n1n2U_1 + U_2 = n_1 n_2Verification check
U=min(U1,U2)U = \min(U_1, U_2)Test statistic
μU=n1n2/2\mu_U = n_1 n_2/2Mean of UU under H0H_0
σU2=n1n2(N+1)/12\sigma_U^2 = n_1 n_2(N+1)/12Variance of UU (no ties)
σU2=n1n212[(N+1)k(tk3tk)N(N1)]\sigma_U^2 = \frac{n_1 n_2}{12}\left[(N+1) - \frac{\sum_k(t_k^3-t_k)}{N(N-1)}\right]Variance (with tie correction)
z=(U1μU)/σUz = (U_1 - \mu_U)/\sigma_UStandardised statistic
p=2[1Φ(z)]p = 2[1-\Phi(\lvert z \rvert)]Two-tailed p-value
PS^=U1/(n1n2)\hat{PS} = U_1/(n_1 n_2)Probability of superiority

Effect Size Formulas

FormulaDescription
rrb=(U1U2)/(n1n2)r_{rb} = (U_1 - U_2)/(n_1 n_2)Rank-biserial correlation (primary)
rrb=12U/(n1n2)r_{rb} = 1 - 2U/(n_1 n_2)From U=min(U1,U2)U = \min(U_1,U_2)
rrb=(Rˉ1Rˉ2)/((n1+n2)/2)r_{rb} = (\bar{R}_1-\bar{R}_2)/((n_1+n_2)/2)From mean ranks
rrb2z/Nr_{rb} \approx 2z/\sqrt{N}Approximate from zz
PS^=U1/(n1n2)\hat{PS} = U_1/(n_1 n_2)Probability Group 1 > Group 2
PS^=(rrb+1)/2\hat{PS} = (r_{rb}+1)/2Convert rrbr_{rb} to PS
rrb=2PS^1r_{rb} = 2\hat{PS}-1Convert PS to rrbr_{rb}
Δ^=median{x1ix2j}\hat{\Delta} = \text{median}\{x_{1i}-x_{2j}\}Hodges-Lehmann estimator
zr=arctanh(rrb)z_r = \text{arctanh}(r_{rb})Fisher zz-transform of rrbr_{rb}
SEz=(N+1)/(3n1n2)SE_z = \sqrt{(N+1)/(3n_1 n_2)}SE for CI of rrbr_{rb}

Conversions Between Effect Sizes

FromToFormula
rrbr_{rb}Cohen's ddd2rrb/1rrb2d \approx 2r_{rb}/\sqrt{1-r_{rb}^2}
Cohen's ddrrbr_{rb}rrbd/d2+4r_{rb} \approx d/\sqrt{d^2+4} (equal groups)
rrbr_{rb}PS^\hat{PS}PS^=(rrb+1)/2\hat{PS} = (r_{rb}+1)/2
PS^\hat{PS}rrbr_{rb}rrb=2PS^1r_{rb} = 2\hat{PS}-1
zzrrbr_{rb}rrb2z/Nr_{rb} \approx 2z/\sqrt{N} (approx)

Cohen's Benchmarks for rrbr_{rb}

Labelrrb\lvert r_{rb} \rvertPS^\hat{PS}
Negligible<0.10< 0.100.450.450.550.55
Small0.100.100.550.55
Medium0.300.300.650.65
Large0.500.500.750.75
Very large0.700.700.850.85

Required nn per Group (80% Power, α=.05\alpha = .05, Two-Tailed)

rrb\lvert r_{rb} \rvertLabelMann-Whitneyt-Test (if normal)
0.10Small414394
0.20Small9997
0.30Medium4443
0.44Medium2120
0.50Large1615
0.64Large109

Decision Guide: Mann-Whitney vs. Alternatives

SituationTest
Two independent groups, non-normal or ordinalMann-Whitney U
Two independent groups, normal, equal variancesIndependent t-test (or Welch's)
Two independent groups, unequal shapes/spreadsBrunner-Munzel test
Two paired/related groups, non-normalWilcoxon Signed-Rank test
Three or more independent groups, non-normalKruskal-Wallis test
Two independent groups, very small nn (<5< 5 per group)Fisher's exact (binary), exact Mann-Whitney
General distributional difference (not just location)Kolmogorov-Smirnov test

Tie Correction Reference

Proportion of TiesImpact on σU\sigma_URecommended p-Value Method
<10%< 10\%NegligibleStandard or tie-corrected approximation
10%10\%25%25\%Moderate reductionTie-corrected approximation
>25%> 25\%Substantial reductionExact p-value; permutation test

APA Reporting Template

"A Mann-Whitney U test [with / without] continuity correction [with exact / asymptotic p-value] was conducted to compare [DV] between [Group 1] (Mdn = [value], IQR = [LB]–[UB], n=n = ) and [Group 2] (Mdn = [value], IQR = [LB]–[UB], n=n = ). [Group X] had significantly [higher / lower] [DV] than [Group Y], U=U = [value], z=z = [value], p=p = [value], rrb=r_{rb} = [value] [95% CI: LB, UB], indicating a [small / medium / large] effect. In [PS%] of all possible pairings, a [Group X] observation exceeded a [Group Y] observation."

Assumption Checks Checklist

CheckMethodAction if Violated
IndependenceStudy design reviewWilcoxon SR (paired); multilevel methods (clustered)
Ordinal scaleMeasurement reviewChi-squared (nominal); ordinal regression
Two independent groupsDesign reviewKruskal-Wallis (K>2K > 2); Wilcoxon SR (paired)
Location-shift assumptionBoxplots; IQR comparisonBrunner-Munzel test
Excessive ties (>25%> 25\%)Count tiesExact p-value; permutation test; tie-corrected variance

Mann-Whitney Reporting Checklist

ItemRequired
UU statistic✅ Always
zz-statistic (if approximation used)✅ When applicable
Whether exact or asymptotic p-value✅ Always
Exact p-value✅ Preferred for n<15n < 15 per group
rrbr_{rb} with 95% CI✅ Always
Probability of superiority (PS^\hat{PS})✅ Recommended
Medians and IQRs per group✅ Always
Sample sizes per group✅ Always
Direction of the effect✅ Always
Tie correction applied✅ When ties present
Number/proportion of tied observations✅ When ties substantial
Hodges-Lehmann estimator✅ Recommended
Whether two-tailed or directional✅ Always
Normality violation justification✅ When used instead of t-test

This tutorial provides a comprehensive foundation for understanding, conducting, and reporting the Mann-Whitney U Test within the DataStatPro application. For further reading, consult Mann & Whitney's (1947) original paper "On a test of whether one of two random variables is stochastically larger than the other" (Annals of Mathematical Statistics), Wilcoxon's (1945) foundational paper, Hollander, Wolfe & Chicken's "Nonparametric Statistical Methods" (3rd ed., 2014) for rigorous theoretical coverage, Conover's "Practical Nonparametric Statistics" (3rd ed., 1999) for applied guidance, and Brunner & Munzel's (2000) "The Nonparametric Behrens-Fisher Problem" (Biometrical Journal) for the robust alternative when distribution shapes differ. For the rank-biserial correlation as an effect size, see Kerby (2014) in the Comprehensive Psychology journal. For feature requests or support, contact the DataStatPro team.