Knowledge Base / Wilcoxon Signed Rank Test Inferential Statistics 74 min read

Wilcoxon Signed Rank Test

Comprehensive reference guide for Wilcoxon Signed Rank test (non-parametric alternative to paired t-test).

Wilcoxon Signed-Rank Test: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of non-parametric inference all the way through the mathematics, assumptions, variants, effect sizes, interpretation, reporting, and practical usage of the Wilcoxon Signed-Rank Test within the DataStatPro application. Whether you are encountering the Wilcoxon Signed-Rank Test for the first time or seeking a rigorous understanding of rank-based within-subjects comparison, this guide builds your knowledge systematically from the ground up.


Table of Contents

  1. Prerequisites and Background Concepts
  2. What is the Wilcoxon Signed-Rank Test?
  3. The Mathematics Behind the Wilcoxon Signed-Rank Test
  4. Assumptions of the Wilcoxon Signed-Rank Test
  5. Variants of the Wilcoxon Signed-Rank Test
  6. Using the Wilcoxon Signed-Rank Test Calculator Component
  7. Full Step-by-Step Procedure
  8. Effect Sizes for the Wilcoxon Signed-Rank Test
  9. Confidence Intervals
  10. Power Analysis and Sample Size Planning
  11. Advanced Topics
  12. Worked Examples
  13. Common Mistakes and How to Avoid Them
  14. Troubleshooting
  15. Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into the Wilcoxon Signed-Rank Test, it is essential to be comfortable with the following foundational statistical concepts. Each is briefly reviewed below.

1.1 Parametric vs. Non-Parametric Inference

Parametric tests (such as the paired t-test) make specific assumptions about the shape of the population distribution — typically that data are drawn from a normally distributed population. Their test statistics are derived from distributional assumptions, and their validity depends on how well those assumptions are met.

Non-parametric tests (also called distribution-free tests) do not assume a specific parametric form for the population distribution. Instead, they are based on the ranks of the data rather than the raw values themselves. Because ranks carry less information than raw values, non-parametric tests are generally less powerful than their parametric counterparts when parametric assumptions are met — but they can be more powerful when those assumptions are violated.

The Wilcoxon Signed-Rank Test is the leading non-parametric alternative to the paired t-test for comparing two related conditions when the normality of difference scores cannot be assumed.

1.2 The Concept of Ranks

Ranking transforms raw data values into their relative order positions. Given a set of values {x1,x2,,xn}\{x_1, x_2, \ldots, x_n\}:

Example:

ValueRank
3.13.11
5.45.42
5.45.42 (midrank of ranks 2 and 3)
7.97.94
12.112.15

Ranking discards information about the precise magnitude of differences between values (e.g., whether the gap between ranks 1 and 2 is 0.1 or 100 units) but preserves the ordinal information (which values are larger or smaller). This makes rank-based tests robust to extreme values and non-normal distributions.

1.3 Ordinal, Interval, and Ratio Scales

The level of measurement determines which statistical tests are appropriate:

ScalePropertiesExamplesAppropriate Summaries
NominalCategories onlyGender, blood typeMode, frequencies
OrdinalOrdered categories; unequal intervalsLikert items, pain ratings, ranksMedian, percentiles
IntervalEqual intervals; no true zeroTemperature (°C), IQ scoresMean, SD
RatioEqual intervals; true zeroHeight, weight, reaction timeMean, SD, ratios

The Wilcoxon Signed-Rank Test is appropriate for ordinal data and for interval/ratio data that violate the normality assumption of the paired t-test.

1.4 The Median as a Measure of Central Tendency

The median is the value that divides the distribution into two equal halves — 50% of observations fall below it and 50% above it. Unlike the mean, the median is:

The pseudo-median (also called the Hodges-Lehmann estimator) is the median of all pairwise averages (di+dj)/2(d_i + d_j)/2 for iji \leq j, including each observation paired with itself.

1.5 Signed Ranks: Combining Magnitude and Direction

The Wilcoxon Signed-Rank Test uniquely combines two pieces of information from difference scores:

  1. Magnitude: How large is each difference, relative to the others? (Captured by the rank of the absolute difference.)
  2. Direction: Is each difference positive or negative? (Captured by the sign attached to the rank.)

By ranking absolute differences and then restoring the sign, the test gives more weight to large differences than to small ones — unlike the sign test, which ignores magnitude entirely. This is why the Wilcoxon Signed-Rank Test is more powerful than the sign test.

1.6 The Null and Alternative Hypotheses

The Wilcoxon Signed-Rank Test operates under the following hypotheses:

Under the symmetry assumption:

H0:H_0: The population of difference scores is symmetrically distributed about zero.

H1:H_1: The population of difference scores is NOT symmetrically distributed about zero.

Equivalently (under symmetry):

H0:pseudo-median of differences=0H_0: \text{pseudo-median of differences} = 0

Without the symmetry assumption (more general interpretation):

H0:P(di>0)=P(di<0)=0.5H_0: P(d_i > 0) = P(d_i < 0) = 0.5

(The probability of a positive difference equals the probability of a negative difference.)

Directional alternatives:

H1:P(di>0)>P(di<0)H_1: P(d_i > 0) > P(d_i < 0) (upper one-tailed)

H1:P(di>0)<P(di<0)H_1: P(d_i > 0) < P(d_i < 0) (lower one-tailed)

1.7 The Asymptotic Relative Efficiency

The Asymptotic Relative Efficiency (ARE) of a non-parametric test relative to its parametric counterpart quantifies the relative sample sizes needed to achieve the same power as nn \to \infty.

For the Wilcoxon Signed-Rank Test vs. the paired t-test:

ARE=3π0.955ARE = \frac{3}{\pi} \approx 0.955 (for normally distributed data)

This means that for normally distributed data, the Wilcoxon test requires approximately 1/0.9551.0471/0.955 \approx 1.047 times as many observations as the paired t-test to achieve the same power — a loss of only about 5%. In exchange for this negligible efficiency cost, the Wilcoxon test gains complete robustness to non-normality.

For non-normal distributions, the Wilcoxon test can be substantially more efficient than the t-test:

DistributionARE (Wilcoxon vs. t-test)
Normal3/π0.9553/\pi \approx 0.955
Uniform1.0001.000
Double exponential (Laplace)1.5001.500
Logisticπ2/91.097\pi^2/9 \approx 1.097
Contaminated normal (10% outliers)>2.000> 2.000
Heavy-tailed distributionsCan be very large

💡 For data that are approximately normal, using the Wilcoxon test costs you only 5% efficiency. For data with heavy tails or outliers, the Wilcoxon test can dramatically outperform the t-test. This asymmetry makes the Wilcoxon test a safe default when normality is uncertain.

1.8 Type I Error, Power, and the Role of Sample Size

The Wilcoxon test achieves nearly identical power to the paired t-test for normal data and superior power for non-normal data, making it a generally safe and efficient choice for paired comparisons.


2. What is the Wilcoxon Signed-Rank Test?

2.1 The Core Idea

The Wilcoxon Signed-Rank Test (Wilcoxon, 1945) is a non-parametric inferential procedure for testing whether two related conditions (measured on the same participants or matched pairs) have the same distribution. It is the non-parametric alternative to the paired t-test when the assumption of normally distributed difference scores cannot be met.

Rather than working with raw difference scores and computing means and standard deviations (as the paired t-test does), the Wilcoxon test:

  1. Computes the absolute values of the difference scores di|d_i|.
  2. Ranks the absolute differences from smallest to largest.
  3. Restores the sign of each difference to its rank.
  4. Computes the sum of the positive ranks W+W^+ and the sum of the negative ranks WW^- as the test statistics.
  5. Evaluates whether W+W^+ and WW^- are sufficiently different from what would be expected by chance if H0H_0 were true.

Under H0H_0, positive and negative differences should be roughly equally common and roughly equally large — so W+W^+ and WW^- should be approximately equal (each approximately n(n+1)/4n(n+1)/4). Large discrepancies between W+W^+ and WW^- provide evidence against H0H_0.

2.2 When to Use the Wilcoxon Signed-Rank Test

The Wilcoxon Signed-Rank Test is the appropriate choice when:

2.3 The Wilcoxon Signed-Rank Test vs. Related Procedures

SituationAppropriate Test
Two related conditions, differences normalPaired t-test (preferred for power)
Two related conditions, differences non-normalWilcoxon Signed-Rank Test
Two related conditions, only direction of difference knownSign Test (less powerful)
One group vs. known value, non-normalWilcoxon Signed-Rank (one-sample version)
Three or more related conditions, non-normalFriedman Test
Two independent groups, non-normalMann-Whitney U Test
Two related conditions, Bayesian non-parametricBayesian Signed-Rank Test

2.4 The Wilcoxon Signed-Rank Test vs. the Sign Test

The Wilcoxon Signed-Rank Test and the Sign Test are both non-parametric tests for paired data, but they differ in the information they use:

PropertyWilcoxon Signed-RankSign Test
Information usedRank of $d_i
Requires rankable differences✅ Yes❌ No
PowerHigherLower
Robustness to outliersHighVery high
ARE vs. t-test (normal data)0.9550.637
Suitable when only direction known❌ No✅ Yes

The Wilcoxon test is preferred over the sign test in virtually all circumstances where the absolute magnitude of differences can be ranked, because it makes better use of the available information.

2.5 Two Versions: Paired and One-Sample

The Wilcoxon Signed-Rank Test has two closely related applications:

Paired version: Compare two related conditions. Compute di=x1ix2id_i = x_{1i} - x_{2i} for each pair, then apply the test to the difference scores.

One-sample version: Test whether a single sample's population median (or pseudo-median) equals a hypothesised value θ0\theta_0. Compute di=xiθ0d_i = x_i - \theta_0 and apply the test to these adjusted values.

Both versions are mathematically identical — they differ only in how the difference scores are constructed.


3. The Mathematics Behind the Wilcoxon Signed-Rank Test

3.1 Computing Difference Scores

Paired version: For nn pairs (x1i,x2i)(x_{1i}, x_{2i}), i=1,,ni = 1, \ldots, n:

di=x1ix2id_i = x_{1i} - x_{2i}

One-sample version: For nn observations xix_i tested against θ0\theta_0:

di=xiθ0d_i = x_i - \theta_0

3.2 Handling Zero Differences

Pairs where di=0d_i = 0 (exactly) are excluded from the analysis because they carry no information about the direction of an effect. Let nn' denote the number of non-zero differences remaining after exclusion. All subsequent steps use nn'.

⚠️ A large number of zero differences substantially reduces the effective sample size and thus statistical power. This is most common with coarsely measured ordinal scales (e.g., 5-point Likert items). If more than 20% of differences are zero, interpret results with particular caution and consider reporting the number of zero differences explicitly.

3.3 Ranking the Absolute Differences

Rank the absolute values di|d_i| from smallest (rank 1) to largest (rank nn').

For tied absolute values, assign the average (midrank) of the ranks they would have occupied:

If three observations are tied at the 4th, 5th, and 6th positions, each receives rank (4+5+6)/3=5(4+5+6)/3 = 5.

Notation: Let RiR_i denote the rank assigned to di|d_i|.

3.4 Computing the Test Statistics W+W^+ and WW^-

Restore the original sign to each rank:

Sum of positive ranks (ranks corresponding to di>0d_i > 0):

W+={i:  di>0}RiW^+ = \sum_{\{i:\;d_i > 0\}} R_i

Sum of negative ranks (ranks corresponding to di<0d_i < 0):

W={i:  di<0}RiW^- = \sum_{\{i:\;d_i < 0\}} R_i

Verification check:

W++W=n(n+1)2W^+ + W^- = \frac{n'(n'+1)}{2}

This provides an arithmetic check: if W++WW^+ + W^- does not equal n(n+1)/2n'(n'+1)/2, there is a computational error.

Under H0H_0, the expected values are:

E[W+]=E[W]=n(n+1)4E[W^+] = E[W^-] = \frac{n'(n'+1)}{4}

3.5 The Test Statistic WW

The conventional test statistic is:

W=min(W+,W)W = \min(W^+, W^-)

Small values of WW (far from n(n+1)/4n'(n'+1)/4) provide evidence against H0H_0.

Alternatively, many software implementations report W+W^+ directly (or T+T^+), with the p-value computed from the appropriate tail of the sampling distribution.

DataStatPro reports both W+W^+ and WW^-, highlights the minimum, and computes exact and asymptotic p-values.

3.6 Exact Distribution (Small Samples, n25n' \leq 25)

For small samples without ties, the exact null distribution of W+W^+ can be enumerated: under H0H_0, each of the 2n2^{n'} possible sign assignments is equally likely, giving W+W^+ a discrete distribution that can be tabulated exactly.

Exact p-value (two-tailed):

p=2×min ⁣[P(W+Wobs+),P(W+Wobs+)]p = 2 \times \min\!\left[P(W^+ \leq W_{obs}^+), P(W^+ \geq W_{obs}^+)\right]

DataStatPro always computes the exact p-value when n25n' \leq 25 and there are no (or few) ties, and automatically switches to the normal approximation for larger samples.

3.7 Normal Approximation (Large Samples, n>25n' > 25)

For larger samples, W+W^+ is approximately normally distributed:

E[W+]=n(n+1)4E[W^+] = \frac{n'(n'+1)}{4}

Var[W+]=n(n+1)(2n+1)24\text{Var}[W^+] = \frac{n'(n'+1)(2n'+1)}{24}

z-statistic (without continuity correction):

z=W+E[W+]Var[W+]=W+n(n+1)/4n(n+1)(2n+1)/24z = \frac{W^+ - E[W^+]}{\sqrt{\text{Var}[W^+]}} = \frac{W^+ - n'(n'+1)/4}{\sqrt{n'(n'+1)(2n'+1)/24}}

z-statistic (with continuity correction, more accurate for discrete distributions):

zcc=W+E[W+]0.5Var[W+]z_{cc} = \frac{|W^+ - E[W^+]| - 0.5}{\sqrt{\text{Var}[W^+]}}

Two-tailed p-value:

p=2×[1Φ(z)]p = 2 \times [1 - \Phi(|z|)]

Where Φ\Phi is the standard normal CDF.

3.8 Tie Correction for the Variance

When there are tied absolute difference values, the variance formula must be corrected:

Varcorrected[W+]=n(n+1)(2n+1)24k=1g(tk3tk)48\text{Var}_{corrected}[W^+] = \frac{n'(n'+1)(2n'+1)}{24} - \frac{\sum_{k=1}^{g}(t_k^3 - t_k)}{48}

Where:

The correction reduces the variance, increasing the z-statistic slightly and thus providing a more accurate p-value when ties are present.

Corrected z-statistic:

ztie=W+n(n+1)/4Varcorrected[W+]z_{tie} = \frac{W^+ - n'(n'+1)/4}{\sqrt{\text{Var}_{corrected}[W^+]}}

3.9 The Exact Probability Under H0H_0: Deriving the Null Distribution

Under H0H_0, each non-zero difference score did_i is equally likely to be positive or negative, independently of its magnitude. This means each of the 2n2^{n'} possible sign assignments to the nn' ranks is equally probable.

The total number of distinct values W+W^+ can take ranges from 00 (all negative) to n(n+1)/2n'(n'+1)/2 (all positive). The probability of any specific value of W+W^+ is the number of sign assignments producing that value divided by 2n2^{n'}.

Example for n=4n' = 4 (ranks 1, 2, 3, 4; total =10= 10):

W+W^+ can range from 0 to 10. P(W+=0)=1/16P(W^+ = 0) = 1/16 (all negative). P(W+=10)=1/16P(W^+ = 10) = 1/16 (all positive). P(W+=5)=4/16=0.25P(W^+ = 5) = 4/16 = 0.25 (four sign assignments give W+=5W^+ = 5).

3.10 Relationship Between Wilcoxon WW and the Mann-Whitney UU

The Wilcoxon Signed-Rank statistic W+W^+ is algebraically related to the Mann-Whitney UU statistic. Specifically, for the one-sample or paired case, W+W^+ counts the number of Walsh averages (di+dj)/2(d_i + d_j)/2 (for iji \leq j) that are positive:

W+=#{(i,j):ij and (di+dj)/2>0}W^+ = \#\left\{(i,j): i \leq j \text{ and } (d_i+d_j)/2 > 0\right\}

This connection to Walsh averages is the foundation of the Hodges-Lehmann estimator of the pseudo-median, which serves as the point estimate associated with the Wilcoxon test.


4. Assumptions of the Wilcoxon Signed-Rank Test

4.1 Symmetry of the Difference Score Distribution

The Wilcoxon Signed-Rank Test's primary assumption is that the population distribution of difference scores is symmetric about its median (pseudo-median). This is weaker than the normality assumption of the paired t-test but is still a meaningful constraint.

Why symmetry matters: The test is designed so that, under H0H_0, positive and negative ranks of equal magnitude are equally likely. If the difference distribution is asymmetric, the test is not testing only the location of the median — it may also respond to the shape of the distribution. In that case, H0H_0 conflates "no location shift" with "symmetric distribution."

How to check:

When violated: If difference scores are severely asymmetric (heavily skewed in one direction), the Wilcoxon test's p-value may not correctly reflect only a location shift. In this case:

⚠️ The symmetry assumption is often overlooked. A common error is applying the Wilcoxon Signed-Rank Test to heavily right-skewed difference scores (e.g., when data represent counts or reaction times with occasional very long responses) without checking symmetry. In such cases, the Sign Test or bootstrap methods are more appropriate.

4.2 Independence of Pairs

All pairs (x1i,x2i)(x_{1i}, x_{2i}) must be independent of each other. That is, knowing the difference score for pair ii gives no information about the difference score for pair jj (jij \neq i). Within each pair, the two measurements are of course dependent — this is the point of the paired design.

Common violations:

When violated: Use multilevel models or time-series methods.

4.3 Continuous (or At Least Ordinal and Rankable) Differences

The test requires that the absolute differences can be meaningfully ranked — there must be a natural ordering of the magnitudes. This is satisfied whenever:

When violated: If differences cannot be ranked (e.g., nominal categories), use the McNemar test (for binary outcomes) or other categorical tests.

4.4 Exchangeability Under H0H_0

Under H0H_0, the distribution of did_i must be exchangeable with respect to sign: did_i and di-d_i must have the same distribution. This is satisfied when the difference distribution is symmetric about zero.

This condition is equivalent to stating that the probability of a positive difference equals the probability of a negative difference of the same magnitude.

4.5 Absence of Excessive Ties

The Wilcoxon Signed-Rank Test is designed for continuous data where ties in absolute differences are rare. Excessive ties (especially many zero differences) can affect the accuracy of the p-value.

Types of ties:

How to check: Count the number of zero differences and the number of tied absolute differences. If more than 20–25% of differences are zero, the effective sample size is substantially reduced.

When excessive ties present: Use the exact permutation test version of the Wilcoxon test, which handles ties exactly. DataStatPro automatically applies the exact test with ties when n25n' \leq 25 and the standard tie-corrected approximation for larger samples.

4.6 Assumption Summary Table

AssumptionDescriptionHow to CheckRemedy if Violated
Symmetry of differencesdid_i distribution is symmetric about θ0\theta_0Histogram, Q-Q, skewness of did_iSign Test; transform data
Independence of pairsPairs are independent across observationsDesign reviewMultilevel model
Rankable differences$d_i$ can be meaningfully ordered
Exchangeabilitydid_i and di-d_i have same distributionSymmetry checkSign Test; bootstrap
No excessive tiesFew zero or tied absolute differencesCount zeros and tiesExact permutation test; sign test

5. Variants of the Wilcoxon Signed-Rank Test

5.1 Paired Version (Two-Condition Comparison)

The paired version compares two related conditions. Difference scores are computed as di=x1ix2id_i = x_{1i} - x_{2i} and the test evaluates whether the pseudo-median of the differences equals zero.

This is the most common application of the Wilcoxon Signed-Rank Test and is the primary focus of this tutorial.

5.2 One-Sample Version (Against a Hypothesised Median)

The one-sample version tests whether the population pseudo-median of a single sample equals a specified value θ0\theta_0:

H0:θ=θ0H_0: \theta = \theta_0

Compute adjusted differences: di=xiθ0d_i = x_i - \theta_0

Then apply the standard Wilcoxon procedure to these adjusted values.

Common applications:

5.3 Exact vs. Approximate (Asymptotic) p-values

Exact p-value: Computes the p-value from the complete enumeration of all possible rank assignments under H0H_0. Appropriate for small samples (n25n' \leq 25) and when ties are absent or few. DataStatPro always provides the exact p-value when feasible.

Asymptotic p-value: Uses the normal approximation to the distribution of W+W^+. Appropriate for n>25n' > 25. The tie-corrected version is more accurate when ties are present.

With continuity correction: The continuity correction (±0.5\pm 0.5 adjustment to W+W^+) improves the accuracy of the normal approximation for moderate sample sizes by accounting for the discrete nature of W+W^+.

Recommendation: Use the exact p-value whenever possible (n25n' \leq 25, few ties). For larger samples, the tie-corrected asymptotic p-value with continuity correction is generally accurate.

5.4 Permutation Version

The permutation (randomisation) version of the Wilcoxon test generates the null distribution by randomly reassigning the signs of the absolute differences BB times (e.g., B=10,000B = 10{,}000) and computing W+W^+ for each permutation. The p-value is the proportion of permuted statistics at least as extreme as the observed W+W^+.

This approach:

DataStatPro offers the permutation version under the "Exact / Permutation" option.

5.5 Pratt's Method for Zero Differences

Two conventions exist for handling zero differences (di=0d_i = 0):

Wilcoxon's original method (default): Exclude all zero differences; analyse only the nn' non-zero differences.

Pratt's method (1959): Include zero differences in the ranking, but exclude them from the sum of signed ranks. This method:

DataStatPro provides both methods when zero differences are present.


6. Using the Wilcoxon Signed-Rank Test Calculator Component

The Wilcoxon Signed-Rank Test Calculator in DataStatPro provides a comprehensive tool for running, diagnosing, visualising, and reporting the test and associated effect sizes.

Step-by-Step Guide

Step 1 — Select "Wilcoxon Signed-Rank Test"

From the "Test Type" dropdown, select:

💡 DataStatPro automatically suggests the Wilcoxon Signed-Rank Test when the normality check on difference scores is significant in the Paired t-Test component. A yellow warning banner will appear with a direct link to the Wilcoxon component.

Step 2 — Input Method

Choose how to provide the data:

Step 3 — Specify the Null Hypothesis Value θ0\theta_0

Step 4 — Select the Alternative Hypothesis

Step 5 — Select p-value Method

Step 6 — Handle Zero Differences

DataStatPro reports nn (total pairs), n0n^0 (zero differences excluded), n+n^+ (positive differences), nn^- (negative differences), and n=nn0n' = n - n^0 (effective sample size).

Step 7 — Select Display Options

Step 8 — Run the Analysis

Click "Run Wilcoxon Test". DataStatPro will:

  1. Compute all difference scores and rank them.
  2. Apply zero-exclusion (or Pratt) and tie-correction.
  3. Compute W+W^+, WW^-, zz, exact p-value, and asymptotic p-value.
  4. Estimate the Hodges-Lehmann pseudo-median with exact 95% CI.
  5. Compute effect sizes rWr_W and rrbr_{rb} with CIs.
  6. Run assumption checks and display symmetry diagnostics.
  7. Auto-generate the APA-compliant results paragraph.

7. Full Step-by-Step Procedure

7.1 Complete Computational Procedure

This section walks through every computational step for the Wilcoxon Signed-Rank Test, from raw data to a full APA-style conclusion.

Given: nn pairs of observations (x1i,x2i)(x_{1i}, x_{2i}) for i=1,2,,ni = 1, 2, \ldots, n.


Step 1 — Establish Sign Convention and Compute Difference Scores

Define di=x1ix2id_i = x_{1i} - x_{2i} consistently for all pairs. A positive did_i means Condition 1 yields a higher value than Condition 2 for participant ii.

State the sign convention explicitly: "Positive differences indicate higher scores in Condition 1 than Condition 2."


Step 2 — Identify and Exclude Zero Differences

Identify all pairs where di=0d_i = 0 exactly. Remove these from further analysis.

n=nn0n' = n - n^0 (effective sample size after exclusion)

Record n0n^0 for reporting. If n0>0n^0 > 0, state explicitly that n0n^0 pairs with di=0d_i = 0 were excluded.


Step 3 — Compute Absolute Differences and Check Symmetry

Compute di|d_i| for all nn' non-zero differences.

Symmetry check:


Step 4 — Rank the Absolute Differences

Rank d1,d2,,dn|d_1|, |d_2|, \ldots, |d_{n'}| from smallest (rank 1) to largest (rank nn').

For tied absolute values, assign the average rank to all tied observations.

Notation: RiR_i = rank of di|d_i|.

Verification: i=1nRi=n(n+1)/2\sum_{i=1}^{n'} R_i = n'(n'+1)/2


Step 5 — Assign Signed Ranks

Restore the sign of each difference to its rank:

Ri+=RiR_i^{+} = R_i if di>0d_i > 0; Ri=RiR_i^{-} = R_i if di<0d_i < 0

Create a table with columns: ii, did_i, di|d_i|, RiR_i (rank of di|d_i|), and the signed rank (+Ri+R_i if positive, Ri-R_i if negative).


Step 6 — Compute the Rank Sums

W+={i:  di>0}RiW^+ = \sum_{\{i:\; d_i > 0\}} R_i (sum of ranks where di>0d_i > 0)

W={i:  di<0}RiW^- = \sum_{\{i:\; d_i < 0\}} R_i (sum of ranks where di<0d_i < 0)

Verification: W++W=n(n+1)/2W^+ + W^- = n'(n'+1)/2

Test statistic: W=min(W+,W)W = \min(W^+, W^-)

Count: n+=n^+ = number of positive differences; n=n^- = number of negative differences; n++n=nn^+ + n^- = n'.


Step 7 — Compute the p-value

If n25n' \leq 25 and few ties: Use the exact null distribution (from tables or software enumeration).

If n>25n' > 25 or many ties: Use the normal approximation with tie correction:

E[W+]=n(n+1)4E[W^+] = \frac{n'(n'+1)}{4}

Varcorrected[W+]=n(n+1)(2n+1)24k(tk3tk)48\text{Var}_{corrected}[W^+] = \frac{n'(n'+1)(2n'+1)}{24} - \frac{\sum_k(t_k^3-t_k)}{48}

z=W+E[W+]Varcorrected[W+]z = \frac{W^+ - E[W^+]}{\sqrt{\text{Var}_{corrected}[W^+]}}

With continuity correction:

zcc=W+E[W+]0.5Varcorrected[W+]z_{cc} = \frac{|W^+ - E[W^+]| - 0.5}{\sqrt{\text{Var}_{corrected}[W^+]}}

Two-tailed p-value:

p=2×[1Φ(zcc)]p = 2 \times [1 - \Phi(|z_{cc}|)]

Compare pp to α\alpha. Reject H0H_0 if pαp \leq \alpha.


Step 8 — Compute the Hodges-Lehmann Point Estimate

The Hodges-Lehmann estimator θ^\hat{\theta} is the point estimate of the pseudo-median associated with the Wilcoxon test. It is the median of all pairwise averages of the non-zero differences:

θ^=Median{di+dj2:1ijn}\hat{\theta} = \text{Median}\left\{\frac{d_i + d_j}{2} : 1 \leq i \leq j \leq n'\right\}

There are n(n+1)/2n'(n'+1)/2 such averages (including each difference paired with itself).

This estimator is:


Step 9 — Compute the 95% CI for the Pseudo-Median

The exact 95% CI for the pseudo-median uses the order statistics of the Walsh averages (all n(n+1)/2n'(n'+1)/2 pairwise averages). The CI bounds are determined by the critical values of the Wilcoxon null distribution.

Let Wα/2+W^+_{\alpha/2} be the lower critical value from the exact Wilcoxon table:

CI bounds=\text{CI bounds} = the Wα/2+W^+_{\alpha/2}-th smallest and (n(n+1)/2Wα/2++1)(n'(n'+1)/2 - W^+_{\alpha/2} + 1)-th largest Walsh average.

DataStatPro computes these exact CI bounds numerically.

Approximate 95% CI (for large nn'):

Find Cα=zα/2Var[W+]C_\alpha = z_{\alpha/2}\sqrt{\text{Var}[W^+]}, then the CI consists of the (Wα/2++1)(W^+_{\alpha/2} + 1)-th to (W1α/2+)(W^+_{1-\alpha/2})-th ordered Walsh averages.


Step 10 — Compute Effect Sizes

Effect size rWr_W (from z-statistic):

rW=znr_W = \frac{z}{\sqrt{n'}}

Matched-pairs rank-biserial correlation rrbr_{rb} (from Kerby, 2014):

rrb=W+WW++W=W+Wn(n+1)/2r_{rb} = \frac{W^+ - W^-}{W^+ + W^-} = \frac{W^+ - W^-}{n'(n'+1)/2}

Or equivalently:

rrb=14Wn(n+1)=4W+n(n+1)1r_{rb} = 1 - \frac{4W^-}{n'(n'+1)} = \frac{4W^+}{n'(n'+1)} - 1

Both rWr_W and rrbr_{rb} range from 1-1 to +1+1.

Common Language Effect Size (CL):

CL=W+n(n+1)/2×100%CL = \frac{W^+}{n'(n'+1)/2} \times 100\% (when W+>WW^+ > W^-, i.e., most differences positive)

More precisely: CL=number of Walsh averages>0n(n+1)/2CL = \frac{\text{number of Walsh averages} > 0}{n'(n'+1)/2}


Step 11 — Interpret and Report

Combine all results into a complete APA-compliant report:

  1. State the test used and the reason (non-normality, ordinal data).
  2. Report group/condition medians.
  3. Report W+W^+ (or WW), zz, and pp.
  4. Report the Hodges-Lehmann estimate with 95% CI.
  5. Report the effect size rrbr_{rb} (or rWr_W) with its 95% CI.
  6. State the practical conclusion.

8. Effect Sizes for the Wilcoxon Signed-Rank Test

8.1 The Rank-Biserial Correlation rrbr_{rb} — Primary Effect Size

The matched-pairs rank-biserial correlation (Kerby, 2014) is the recommended primary effect size for the Wilcoxon Signed-Rank Test. It has several equivalent formulations:

From rank sums:

rrb=W+Wn(n+1)/2r_{rb} = \frac{W^+ - W^-}{n'(n'+1)/2}

From positive and negative rank proportions:

rrb=sum of positive rankssum of negative rankstotal rank sumr_{rb} = \frac{\text{sum of positive ranks} - \text{sum of negative ranks}}{\text{total rank sum}}

Interpretation: rrbr_{rb} represents the difference between the proportion of favourable and unfavourable evidence in the data.

This last property is related to the probability of superiority interpretation:

P(di>0)=1+rrb2P(d_i > 0) = \frac{1 + r_{rb}}{2} (approximately, under the symmetry assumption)

8.2 The rWr_W Effect Size — From the z-Statistic

rWr_W (sometimes written rr or rWilcoxonr_{Wilcoxon}) is the effect size computed directly from the standardised test statistic:

rW=znr_W = \frac{z}{\sqrt{n'}}

Where zz is the z-approximation to the Wilcoxon statistic and nn' is the effective sample size (excluding zero differences).

rWr_W has the same range as a Pearson correlation (1-1 to +1+1) and uses the same verbal benchmarks as Pearson rr. It is mathematically equivalent to the point-biserial correlation between a binary indicator of condition and the observed rank differences.

Relationship between rWr_W and rrbr_{rb}:

For large nn' without ties, rWrrbr_W \approx r_{rb}. They can differ for small samples or with many ties.

💡 DataStatPro reports both rWr_W and rrbr_{rb}. For primary reporting, rrbr_{rb} is recommended because it is interpretable without reference to the z-approximation and has a direct probability-of-superiority interpretation. Use rWr_W when comparing to literature that reports this variant.

8.3 Cohen's Benchmarks for rWr_W and rrbr_{rb}

Since rWr_W and rrbr_{rb} behave like correlation coefficients, Cohen's (1988) benchmarks for Pearson rr are applied:

rW\vert r_W \vert or rrb\vert r_{rb} \vertVerbal LabelEquivalent ddPower needed (nn' pairs)
0.100.10Small0.200.20264\approx 264
0.300.30Medium0.620.6252\approx 52
0.500.50Large1.151.1520\approx 20
0.700.70Very large1.961.969\approx 9
0.900.90Huge4.134.135\approx 5

Power estimates for two-tailed α=.05\alpha = .05, 80% power, Wilcoxon test.

⚠️ These benchmarks from Cohen (1988) are rough guidelines. Always contextualise effect sizes against domain-specific norms. An rrb=0.30r_{rb} = 0.30 may be large in some fields (e.g., large-scale educational interventions) and small in others (e.g., lab-controlled cognitive experiments).

8.4 Converting Between Effect Size Metrics

FromToFormula
rrbr_{rb}dd (approx)d=2rrb1rrb2d = \frac{2r_{rb}}{\sqrt{1-r_{rb}^2}}
ddrrbr_{rb} (approx)rrb=dd2+4r_{rb} = \frac{d}{\sqrt{d^2+4}}
rWr_Wddd=2rW1rW2d = \frac{2r_W}{\sqrt{1-r_W^2}}
W+W^+, nn'rrbr_{rb}rrb=(2W+n(n+1)/2)/(n(n+1)/2)r_{rb} = (2W^+ - n'(n'+1)/2)/(n'(n'+1)/2)
zz, nn'rWr_WrW=z/nr_W = z/\sqrt{n'}

⚠️ The conversions between rr and dd above use the equal-groups formula and are only approximations. Do not use these conversions for meta-analytic aggregation without accounting for the design structure.

8.5 The Hodges-Lehmann Estimator as an Effect Size

The Hodges-Lehmann pseudo-median θ^\hat{\theta} is the point estimate in original measurement units associated with the Wilcoxon test. It is:

Reporting recommendation: Always report θ^\hat{\theta} with its 95% CI alongside rrbr_{rb}. This parallels the paired t-test practice of reporting both the mean difference (in original units) and Cohen's dd.

8.6 The Common Language Effect Size for the Wilcoxon Test

The Common Language Effect Size (CL) for the Wilcoxon context is:

CL=P(randomly selected pair has di>0)CL = P(\text{randomly selected pair has } d_i > 0)

Estimated from the data:

CL^=n+n×100%\widehat{CL} = \frac{n^+}{n'} \times 100\% (simple version based on counts)

Or, more precisely using Walsh averages:

CL^=number of Walsh averages (di+dj)/2>0n(n+1)/2×100%\widehat{CL} = \frac{\text{number of Walsh averages } (d_i+d_j)/2 > 0}{n'(n'+1)/2} \times 100\%

This is the probability that a randomly selected participant scores higher in Condition 1 than in Condition 2, estimated non-parametrically from the data.


9. Confidence Intervals

9.1 Exact CI for the Hodges-Lehmann Pseudo-Median

The natural CI to report with the Wilcoxon Signed-Rank Test is the exact confidence interval for the pseudo-median (Hodges-Lehmann CI), expressed in the original measurement units.

Algorithm:

  1. Compute all M=n(n+1)/2M = n'(n'+1)/2 Walsh averages: (di+dj)/2(d_i + d_j)/2 for 1ijn1 \leq i \leq j \leq n'.
  2. Sort the Walsh averages in ascending order: A(1)A(2)A(M)A_{(1)} \leq A_{(2)} \leq \cdots \leq A_{(M)}.
  3. Find the lower critical value CLC_L from the exact Wilcoxon null distribution at the chosen α\alpha level.
  4. The 95% CI is [A(CL+1),A(MCL)]\left[A_{(C_L+1)}, A_{(M-C_L)}\right].

Where CLC_L is the largest value of W+W^+ for which P(W+CL)α/2P(W^+ \leq C_L) \leq \alpha/2 under H0H_0.

DataStatPro computes this exact CI automatically.

9.2 Number of Walsh Averages for Common Sample Sizes

nn' pairsM=n(n+1)/2M = n'(n'+1)/2 Walsh averages
515
1055
15120
20210
30465
501275
1005050

9.3 Interpreting the Hodges-Lehmann CI

The Hodges-Lehmann CI has the same interpretation as any confidence interval: if the study were repeated many times, approximately 95% of the resulting intervals would contain the true population pseudo-median.

CI interpretation rules:

CI PropertyInterpretation
Entirely above zeroPseudo-median is significantly positive; Condition 1 tends to produce higher values
Entirely below zeroPseudo-median is significantly negative; Condition 2 tends to produce higher values
Contains zeroResult is not statistically significant at level α\alpha
Narrow CIPrecise estimate (large nn')
Wide CIImprecise estimate (small nn'); interpret cautiously

9.4 CI for the Effect Size rrbr_{rb}

A bootstrap 95% CI for rrbr_{rb} is available in DataStatPro when raw data are provided:

  1. Resample nn' pairs with replacement B=10,000B = 10{,}000 times.
  2. Compute rrb(b)r_{rb}^{(b)} for each bootstrap sample.
  3. The 95% CI is the 2.5th and 97.5th percentile of the bootstrap distribution.

An asymptotic CI can also be computed using Fisher's zz-transformation:

zrrb=arctanh(rrb),SEzrrb=1n3z_{r_{rb}} = \text{arctanh}(r_{rb}), \quad SE_{z_{r_{rb}}} = \frac{1}{\sqrt{n'-3}}

95% CI for zrrb:zrrb±1.96/n395\%\text{ CI for } z_{r_{rb}}: z_{r_{rb}} \pm 1.96/\sqrt{n'-3}

Back-transform: rrb=tanh(zrrb)r_{rb} = \tanh(z_{r_{rb}})

9.5 Width of the CI as a Function of Sample Size

For rrb=0.30r_{rb} = 0.30 using the Fisher zz approximation:

nn'SEzrSE_{z_r}Approx. CI Width (rr)Precision
100.3781.16Very low
200.2430.79Low
300.1890.63Moderate
500.1450.49Moderate
1000.1020.35Good
2000.0710.25High

⚠️ The CI for rrbr_{rb} is very wide for small samples. Always report the CI to convey the uncertainty in the effect size estimate. A precise-looking point estimate of rrb=0.50r_{rb} = 0.50 from n=10n' = 10 pairs has a CI of approximately [0.15,0.84][-0.15, 0.84] — nearly uninformative about the true effect magnitude.


10. Power Analysis and Sample Size Planning

10.1 Power of the Wilcoxon Signed-Rank Test

Power analysis for the Wilcoxon Signed-Rank Test is more complex than for parametric tests because the power depends on the entire distribution of difference scores, not just the mean and variance. Three approaches are used:

Approach 1 — Use the ARE relative to the paired t-test:

Since ARE=3/π0.955ARE = 3/\pi \approx 0.955 for normal data, the required nn' for the Wilcoxon test is approximately 1/0.9551.0471/0.955 \approx 1.047 times the nn required for the paired t-test at the same power.

nWilcoxonnpaired  t×π31.047×npaired  tn'_{Wilcoxon} \approx n_{paired\;t} \times \frac{\pi}{3} \approx 1.047 \times n_{paired\;t}

This is the most practical planning approach when dzd_z is known or estimated.

Approach 2 — Use the effect size rrbr_{rb} directly (simulation-based):

DataStatPro uses Monte Carlo simulation to estimate power for specified rrbr_{rb} (or dzd_z), nn', α\alpha, and distributional shape (normal, logistic, exponential).

Approach 3 — Use the normal approximation (large samples):

For large nn', power is approximately:

PowerΦ ⁣(zλzα/2)\text{Power} \approx \Phi\!\left(|z_{\lambda}| - z_{\alpha/2}\right)

Where zλ=rrbn(n+1)/2/Var[W+]z_\lambda = r_{rb}\sqrt{n'(n'+1)/2} / \sqrt{\text{Var}[W^+]} is the non-centrality parameter.

10.2 Required Sample Size for 80% Power (α=.05\alpha = .05, Two-Tailed)

Based on converting dzd_z to Wilcoxon nn' via ARE (normal data):

nWilcoxon7.849dz2×π3n'_{Wilcoxon} \approx \frac{7.849}{d_z^2} \times \frac{\pi}{3}

dzd_z equivalentrrbr_{rb} (approx)nn' Wilcoxon (80% power)nn Paired t (80% power)Overhead
0.200.099277264+5%
0.300.148125119+5%
0.500.2434644+5%
0.800.3721918+6%
1.000.4471413+8%
1.200.514109+11%
1.500.600770%\approx 0\%

Note: For non-normal distributions (heavy tails, skewed), the Wilcoxon test may require fewer observations than the paired t-test.

10.3 Sensitivity Analysis

The minimum detectable effect size rrb,minr_{rb,min} for a given nn' and power (80%):

Using the ARE-based approximation:

dz,min7.849×π/3n=8.211nd_{z,min} \approx \sqrt{\frac{7.849 \times \pi/3}{n'}} = \sqrt{\frac{8.211}{n'}}

rrb,mindz,mindz,min2+4r_{rb,min} \approx \frac{d_{z,min}}{\sqrt{d_{z,min}^2 + 4}}

nn' pairsMin. detectable dzd_zMin. detectable rrbr_{rb}
100.9060.411
200.6410.306
300.5230.253
500.4050.199
1000.2860.142
2000.2020.101

10.4 Power Advantage Under Non-Normality

For non-normal distributions, the Wilcoxon test's power advantage over the t-test grows:

Distribution of did_iAREImplication
Normal0.955Wilcoxon needs 5% more pairs
Contaminated normal (5% outliers)1.34Wilcoxon needs 25% fewer pairs
Laplace (double exponential)1.50Wilcoxon needs 33% fewer pairs
Logistic1.10Wilcoxon needs 9% fewer pairs
Heavy Cauchy tails1\gg 1Wilcoxon dramatically more powerful

💡 When the distribution of difference scores is expected to be non-normal (e.g., for Likert-type scales, skewed physiological data, or time-to-event measures), plan sample size using the Wilcoxon test directly via DataStatPro's Monte Carlo power module rather than the ARE-based approximation.


11. Advanced Topics

11.1 Comparing the Wilcoxon Signed-Rank Test and the Paired t-Test

A common question is: given that both tests are available, which should be reported?

Decision criteria:

ConditionRecommendation
Difference scores clearly normal, no outliers, n15n \geq 15Paired t-test (slightly more powerful)
Difference scores non-normal, n<30n < 30Wilcoxon Signed-Rank Test
Difference scores ordinal or near-ordinalWilcoxon Signed-Rank Test
Severe outliers in differences that cannot be removedWilcoxon Signed-Rank Test
Uncertain normality, small nnWilcoxon Signed-Rank Test (safer)
n30n \geq 30, differences mildly non-normalEither test (CLT protects t-test)
Pre-registered choice, normality assumedPaired t-test with Wilcoxon as sensitivity

Best practice: When normality is uncertain, run both tests. If they agree (both significant or both non-significant), report the parametric result as primary with the non-parametric as a sensitivity check. If they disagree, investigate the distribution of differences and report the Wilcoxon as the primary test with an explanation.

11.2 The Sign Test as a Simpler Alternative

The Sign Test is an even simpler non-parametric test that uses only the sign of each difference (ignoring magnitude). It tests H0:P(di>0)=0.5H_0: P(d_i > 0) = 0.5 using the binomial distribution:

B=n+Binomial(n,0.5)B = n^+ \sim \text{Binomial}(n', 0.5) under H0H_0

When to use the Sign Test over Wilcoxon:

Efficiency comparison: The Sign Test has ARE =2/π0.637= 2/\pi \approx 0.637 relative to the paired t-test — substantially less efficient than the Wilcoxon test's ARE of 0.955. Use the Sign Test only when the Wilcoxon test's symmetry assumption cannot be justified.

11.3 Bootstrap Wilcoxon Test

The bootstrap version of the Wilcoxon test generates the null distribution by resampling:

  1. For each bootstrap iteration b=1,,Bb = 1, \ldots, B: a. Randomly flip the sign of each did_i with probability 0.5 (sign randomisation under H0H_0). b. Compute W+,(b)W^{+,(b)} from the sign-randomised differences.
  2. The bootstrap p-value is the proportion of W+,(b)E[W+]|W^{+,(b)} - E[W^+]| that exceeds Wobs+E[W+]|W^+_{obs} - E[W^+]|.

This approach:

11.4 Bayesian Non-Parametric Paired Test

The Bayesian Signed-Rank Test (van Doorn et al., 2018; Ly et al., 2016) extends the Bayesian framework to the Wilcoxon setting. It computes a Bayes Factor BF10BF_{10} quantifying evidence for H1H_1 (pseudo-median 0\neq 0) vs. H0H_0 (pseudo-median =0= 0) without assuming normality.

The prior on the scaled pseudo-median under H1H_1 is a Cauchy distribution (as in the Bayesian t-test), but the likelihood is based on a normal approximation to the sampling distribution of the Wilcoxon statistic.

BF10WilcoxonBF10tBF_{10}^{Wilcoxon} \approx BF_{10}^{t} evaluated at t=zWilcoxont = z_{Wilcoxon} with ν=n1\nu = n'-1

This approximation is valid for n20n' \geq 20. DataStatPro computes the Bayesian Signed-Rank Test using this approximation.

Interpretation of BF10BF_{10}: Same benchmarks as the Bayesian t-test (see Section 11.4 of the Paired t-Test tutorial).

11.5 Multiple Wilcoxon Tests and Familywise Error Control

When multiple Wilcoxon Signed-Rank Tests are conducted simultaneously (e.g., testing the same intervention on five different outcomes), the familywise error rate (FWER) inflates exactly as with multiple t-tests:

FWER=1(1α)kFWER = 1 - (1-\alpha)^k

Correction methods applicable to multiple Wilcoxon tests:

MethodAdjusted α\alphaProperties
Bonferroniα/k\alpha/kConservative; controls FWER
HolmSequentialLess conservative than Bonferroni
Benjamini-HochbergFDR controlExploratory analyses

Apply the same correction logic as for multiple parametric tests.

11.6 The Wilcoxon Test for Ordinal Likert Scale Data

A common application of the Wilcoxon Signed-Rank Test is to paired Likert scale responses. Consider a satisfaction survey where participants rate two products on a 5-point scale (1 = very dissatisfied, 5 = very satisfied).

Key considerations:

  1. Single Likert items should be treated as ordinal; the Wilcoxon test is appropriate.
  2. Composite Likert scales (sum or average of multiple items) can often be treated as approximately continuous; the paired t-test may be appropriate if the composite is approximately normally distributed.
  3. Floor and ceiling effects are common with Likert data and create many zero differences and ties — check carefully and consider Pratt's method.
  4. The Wilcoxon test cannot distinguish between a systematic shift of 1 point (each participant rates Product 1 exactly 1 point higher) and a mixed pattern (some rate it 2 points higher, others 1 point lower). The Hodges-Lehmann estimate helps clarify the typical magnitude of change.

11.7 Reporting the Wilcoxon Signed-Rank Test According to APA 7th Edition

Minimum reporting requirements (APA 7th ed.):

  1. State that the Wilcoxon Signed-Rank Test was used and why (e.g., non-normal differences, ordinal data).
  2. Report medians for each condition (or the Hodges-Lehmann pseudo-median estimate).
  3. Report the test statistic: TT or W+W^+ (or WW, the minimum), and the z-approximation if n>25n' > 25.
  4. Report the exact or asymptotic p-value.
  5. Report the effect size rrbr_{rb} (or rWr_W) with 95% CI.
  6. Report the Hodges-Lehmann estimate with 95% CI (in original units).
  7. Report nn', n+n^+, nn^-, and n0n^0 (number of zeros excluded).

12. Worked Examples

Example 1: Pre-Post Anxiety Scores (Non-Normal Differences)

A clinical psychologist evaluates an 8-week acceptance and commitment therapy (ACT) programme for anxiety. Generalised Anxiety Disorder 7-item scale (GAD-7; range 0–21; higher = more anxiety) scores are recorded for n=12n = 12 participants before and after the programme.

Shapiro-Wilk test on raw scores: Differences are right-skewed (W=0.821W = 0.821, p=.017p = .017) — normality violated. The Wilcoxon Signed-Rank Test is used.

Raw data:

iiPre-ACT (x1ix_{1i})Post-ACT (x2ix_{2i})di=x1ix2id_i = x_{1i}-x_{2i}
11697
21284
318612
414113
520812
61192
717143
815510
913112
1019109
1116133
1214122

Step 1 — Zero differences: No di=0d_i = 0, so n=12n' = 12.

Step 2 — Absolute differences and symmetry check:

di|d_i|: 7, 4, 12, 3, 12, 2, 3, 10, 2, 9, 3, 2

Symmetry check: all differences are positive (no negative differences), indicating a strong shift. The distribution of did_i is right-skewed (all positive, with some large values of 12), which is consistent with the Shapiro-Wilk violation.

Step 3 — Rank the absolute differences:

Sorted di|d_i| values and their ranks (with midranks for ties):

di\vert d_i \vert valueCountRank positionsAvg rank
231, 2, 32.0
334, 5, 65.0
4177.0
7188.0
9199.0
1011010.0
12211, 1211.5

Rank assignment:

iidid_idi\vert d_i \vertRank RiR_iSigned Rank
1778.0+8.0+8.0
2447.0+7.0+7.0
3121211.5+11.5+11.5
4335.0+5.0+5.0
5121211.5+11.5+11.5
6222.0+2.0+2.0
7335.0+5.0+5.0
8101010.0+10.0+10.0
9222.0+2.0+2.0
10999.0+9.0+9.0
11335.0+5.0+5.0
12222.0+2.0+2.0

Step 4 — Rank sums:

W+=8.0+7.0+11.5+5.0+11.5+2.0+5.0+10.0+2.0+9.0+5.0+2.0=78.0W^+ = 8.0+7.0+11.5+5.0+11.5+2.0+5.0+10.0+2.0+9.0+5.0+2.0 = 78.0

W=0.0W^- = 0.0 (no negative differences)

Check: W++W=78.0=12×13/2=78W^+ + W^- = 78.0 = 12 \times 13/2 = 78

W=min(78.0,0.0)=0.0W = \min(78.0, 0.0) = 0.0

n+=12n^+ = 12, n=0n^- = 0, n0=0n^0 = 0

Step 5 — Exact p-value (n=12n' = 12):

With W=0W = 0 (all differences positive), the exact two-tailed p-value is:

p=2×P(W+0)=2×(1/212)=2/4096=0.000488p = 2 \times P(W^+ \leq 0) = 2 \times (1/2^{12}) = 2/4096 = 0.000488

p<.001p < .001

Step 6 — Hodges-Lehmann estimator:

All n(n+1)/2=12×13/2=78n'(n'+1)/2 = 12 \times 13/2 = 78 Walsh averages (di+dj)/2(d_i+d_j)/2 are computed and sorted. The median of 78 values is the average of the 39th and 40th sorted Walsh averages.

Given all differences are positive (2, 2, 2, 3, 3, 3, 4, 7, 9, 10, 12, 12), the Walsh averages range from 2 (minimum) to 12 (maximum), all positive.

θ^=4.5\hat{\theta} = 4.5 GAD-7 points (median of Walsh averages; computed by DataStatPro)

95% CI for pseudo-median (exact): [3.0,9.5][3.0, 9.5] GAD-7 points

Step 7 — Effect sizes:

Rank-biserial correlation:

rrb=W+Wn(n+1)/2=78078=1.000r_{rb} = \frac{W^+ - W^-}{n'(n'+1)/2} = \frac{78-0}{78} = 1.000

rrb=1.0r_{rb} = 1.0 (perfect: every participant improved)

z-based effect size (n=12n' = 12, asymptotic approximation):

E[W+]=12×13/4=39E[W^+] = 12 \times 13/4 = 39

Tie correction: k(tk3tk)=(333)+(333)+(232)=24+24+6=54\sum_k(t_k^3-t_k) = (3^3-3)+(3^3-3)+(2^3-2) = 24+24+6 = 54

Varcorrected[W+]=12×13×25245448=162.51.125=161.375\text{Var}_{corrected}[W^+] = \frac{12 \times 13 \times 25}{24} - \frac{54}{48} = 162.5 - 1.125 = 161.375

z=(7839)/161.375=39/12.703=3.070z = (78-39)/\sqrt{161.375} = 39/12.703 = 3.070

rW=3.070/12=3.070/3.464=0.886r_W = 3.070/\sqrt{12} = 3.070/3.464 = 0.886

Common Language Effect Size:

CL^=n+/n=12/12=100%\widehat{CL} = n^+/n' = 12/12 = 100\% (all participants improved)

Step 8 — Summary:

StatisticValueInterpretation
Pre-ACT median15.515.5 GAD-7 ptsModerate-severe anxiety
Post-ACT median9.59.5 GAD-7 ptsMild anxiety
nn' (non-zero diff.)1212All participants showed positive change
n+n^+ / nn^- / n0n^01212 / 00 / 00
W+W^+78.078.0Maximum possible
WW^-0.00.0Zero negative ranks
WW (minimum)0.00.0
pp (exact, two-tailed)<.001< .001
HL pseudo-median θ^\hat{\theta}4.54.5 GAD-7 pts
95% CI for θ^\hat{\theta}[3.0,9.5][3.0, 9.5] ptsExcludes 0; significant
rrbr_{rb}1.0001.000Maximum possible effect
rWr_W0.8860.886Very large
CL100%100\%Every participant improved

APA write-up: "Due to non-normal distribution of difference scores (Shapiro-Wilk W=0.82W = 0.82, p=.017p = .017), a Wilcoxon Signed-Rank Test was conducted. ACT therapy produced a statistically significant reduction in anxiety (pre-ACT: Mdn = 15.5 GAD-7 points; post-ACT: Mdn = 9.5), W+=78W^+ = 78, p<.001p < .001 (exact). The Hodges-Lehmann estimate of the median reduction was 4.5 GAD-7 points [95% CI: 3.0, 9.5], rrb=1.00r_{rb} = 1.00, indicating a very large treatment effect. All 12 participants showed improvement following ACT."


Example 2: Pain Ratings — Two Physiotherapy Protocols (Ordinal DV)

A physiotherapist compares pain relief (0–10 NRS, ordinal) under two physiotherapy protocols in n=15n = 15 patients with chronic lower back pain. Each patient receives both protocols in randomised order with a 1-week washout. Lower scores indicate less pain. di=Protocol AProtocol Bd_i = \text{Protocol A} - \text{Protocol B} (negative = A produces less pain).

Raw data:

iiProtocol AProtocol Bdid_i
146−2
2770
335−2
468−2
5541
647−3
7660
836−3
9550
1079−2
1146−2
1267−1
1358−3
1435−2
1567−1

Step 1 — Exclude zeros:

di=0d_i = 0 for participants 2, 7, 9 → n0=3n^0 = 3; n=153=12n' = 15-3 = 12.

Non-zero differences: 2,2,2,1,3,3,2,2,1,3,2,1-2, -2, -2, 1, -3, -3, -2, -2, -1, -3, -2, -1

n+=1n^+ = 1 (participant 5: d5=+1d_5 = +1); n=11n^- = 11 (all others).

Step 2 — Absolute differences and ranks:

di\vert d_i \vert valueCountRank positionsAvg rank
131, 2, 32.0
264, 5, 6, 7, 8, 96.5
3310, 11, 1211.0

Rank table (non-zero differences only):

iidid_idi\vert d_i \vertRiR_iSigned Rank
1−226.5−6.5
3−226.5−6.5
4−226.5−6.5
5+112.0+2.0
6−3311.0−11.0
8−3311.0−11.0
10−226.5−6.5
11−226.5−6.5
12−112.0−2.0
13−3311.0−11.0
14−226.5−6.5
15−112.0−2.0

Step 3 — Rank sums:

W+=2.0W^+ = 2.0

W=6.5+6.5+6.5+11.0+11.0+6.5+6.5+2.0+11.0+6.5+2.0=76.0W^- = 6.5+6.5+6.5+11.0+11.0+6.5+6.5+2.0+11.0+6.5+2.0 = 76.0

Check: W++W=78=12×13/2W^+ + W^- = 78 = 12 \times 13/2

W=min(2.0,76.0)=2.0W = \min(2.0, 76.0) = 2.0

Step 4 — Exact p-value (n=12n' = 12):

From Wilcoxon signed-rank exact tables: P(W+2)=0.0020P(W^+ \leq 2) = 0.0020 (one-tail).

Two-tailed: p=2×0.0020=.004p = 2 \times 0.0020 = .004

Step 5 — z-approximation (with tie correction):

k(tk3tk)=(333)+(636)+(333)=24+210+24=258\sum_k(t_k^3-t_k) = (3^3-3)+(6^3-6)+(3^3-3) = 24+210+24 = 258

Varcorrected=162.5258/48=162.55.375=157.125\text{Var}_{corrected} = 162.5 - 258/48 = 162.5 - 5.375 = 157.125

z=(2.039)/157.125=37/12.535=2.952z = (2.0 - 39)/\sqrt{157.125} = -37/12.535 = -2.952

p=2×Φ(2.952)=2×0.00158=.003p = 2 \times \Phi(-2.952) = 2 \times 0.00158 = .003

Step 6 — Hodges-Lehmann estimate:

θ^=2.0\hat{\theta} = -2.0 NRS points (median of Walsh averages)

95% CI for θ^\hat{\theta} (exact): [3.0,1.0][-3.0, -1.0] NRS points

Step 7 — Effect sizes:

rrb=(W+W)/(n(n+1)/2)=(276)/78=74/78=0.949r_{rb} = (W^+ - W^-)/(n'(n'+1)/2) = (2-76)/78 = -74/78 = -0.949

rrb=0.949|r_{rb}| = 0.949 — very large effect (Protocol A produces substantially less pain)

rW=z/n=2.952/12=2.952/3.464=0.852r_W = z/\sqrt{n'} = -2.952/\sqrt{12} = -2.952/3.464 = -0.852

CL (proportion of differences favouring Protocol A):

CL^=n+/n=1/12=8.3%\widehat{CL} = n^+/n' = 1/12 = 8.3\%

Protocol B outperforms Protocol A in 91.7% of patients.

Summary:

StatisticValueInterpretation
Protocol A median pain5.05.0 NRS
Protocol B median pain6.06.0 NRS
n=12n' = 12 (3 zeros excluded)
n+n^+ / nn^- / n0n^011 / 1111 / 33
W+W^+2.02.0
WW^-76.076.0
WW (minimum)2.02.0
Exact pp (two-tailed).004.004Significant
HL estimate θ^\hat{\theta}2.0-2.0 NRSProtocol A lowers pain by 2 pts
95% CI[3.0,1.0][-3.0, -1.0]Excludes 0
rrbr_{rb}0.949-0.949Very large
rWr_W0.852-0.852Very large
CL91.7%91.7\% favour BProtocol B clearly superior

APA write-up: "Due to the ordinal nature of the NRS pain scale and the presence of tied differences, a Wilcoxon Signed-Rank Test was conducted. Three pairs with equal ratings were excluded, leaving n=12n' = 12 pairs. Protocol A (Mdn = 5.0) produced significantly lower pain ratings than Protocol B (Mdn = 6.0), W+=2W^+ = 2, p=.004p = .004 (exact), rrb=0.95r_{rb} = -0.95 [95% CI: −0.99, −0.73]. The Hodges-Lehmann estimate indicated that Protocol A reduced pain by a median of 2.0 NRS points compared to Protocol B [95% CI: 1.0, 3.0]. This represents a very large effect, with Protocol B producing lower pain in 11 of 12 patients with non-zero differences."


Example 3: One-Sample Wilcoxon — Daily Step Counts vs. Health Guideline

A public health researcher tests whether median daily step counts in a sample of n=18n = 18 office workers differ from the recommended health guideline of 10,000 steps per day. The distribution of step counts is right-skewed (Shapiro-Wilk p=.031p = .031); the one-sample Wilcoxon Signed-Rank Test is used.

Data (daily steps, thousands):

xix_i: 6.2, 8.4, 11.3, 7.1, 9.8, 5.6, 12.4, 8.9, 7.3, 10.1, 6.8, 9.4, 13.2, 7.6, 8.1, 11.8, 6.4, 9.2

Null hypothesis: θ0=10.0\theta_0 = 10.0 thousand steps (health guideline)

H0:θ=10.0H_0: \theta = 10.0 vs. H1:θ10.0H_1: \theta \neq 10.0

Differences from guideline: di=xi10.0d_i = x_i - 10.0:

3.8,1.6,1.3,2.9,0.2,4.4,2.4,1.1,2.7,0.1,3.2,0.6,3.2,2.4,1.9,1.8,3.6,0.8-3.8, -1.6, 1.3, -2.9, -0.2, -4.4, 2.4, -1.1, -2.7, 0.1, -3.2, -0.6, 3.2, -2.4, -1.9, 1.8, -3.6, -0.8

Step 1 — No zero differences: n=18n' = 18; n0=0n^0 = 0.

n+=5n^+ = 5 (values: 1.3, 2.4, 0.1, 3.2, 1.8); n=13n^- = 13.

Step 2 — Rank absolute differences:

Sorted di|d_i|: 0.1, 0.2, 0.6, 0.8, 1.1, 1.3, 1.6, 1.8, 1.9, 2.4, 2.4, 2.7, 2.9, 3.2, 3.2, 3.6, 3.8, 4.4

Ranks 1–18 assigned (midranks for tied values 2.4 and 3.2):

di\vert d_i \vertRiR_iSignSigned Rank
0.11++1
0.22−2
0.63−3
0.84−4
1.15−5
1.36++6
1.67−7
1.88++8
1.99−9
2.410.5++10.5
2.410.5−10.5
2.712−12
2.913−13
3.214.5++14.5
3.214.5−14.5
3.616−16
3.817−17
4.418−18

Step 3 — Rank sums:

W+=1+6+8+10.5+14.5=40.0W^+ = 1+6+8+10.5+14.5 = 40.0

W=2+3+4+5+7+9+10.5+12+13+14.5+16+17+18=131.0W^- = 2+3+4+5+7+9+10.5+12+13+14.5+16+17+18 = 131.0

Check: 40.0+131.0=171=18×19/240.0+131.0 = 171 = 18 \times 19/2

W=min(40.0,131.0)=40.0W = \min(40.0, 131.0) = 40.0

Step 4 — Normal approximation (with tie correction, n=18>14n' = 18 > 14, use asymptotic):

E[W+]=18×19/4=85.5E[W^+] = 18 \times 19/4 = 85.5

Ties: two pairs of ties (2.4 twice, 3.2 twice): k(tk3tk)=(232)+(232)=6+6=12\sum_k(t_k^3-t_k) = (2^3-2)+(2^3-2) = 6+6 = 12

Varcorrected=18×19×37241248=527.250.25=527.00\text{Var}_{corrected} = \frac{18 \times 19 \times 37}{24} - \frac{12}{48} = 527.25 - 0.25 = 527.00

z=(40.085.5)/527.00=45.5/22.956=1.982z = (40.0 - 85.5)/\sqrt{527.00} = -45.5/22.956 = -1.982

With continuity correction: zcc=(40.085.50.5)/22.956=45.0/22.956=1.960z_{cc} = (|40.0-85.5|-0.5)/22.956 = 45.0/22.956 = 1.960

p=2×[1Φ(1.960)]=2×0.025=.050p = 2 \times [1-\Phi(1.960)] = 2 \times 0.025 = .050

(Marginal; exact p-value from DataStatPro: p=.047p = .047)

Step 5 — Hodges-Lehmann estimate and CI:

θ^=1.75\hat{\theta} = -1.75 thousand steps (estimated median difference from 10,000)

Population pseudo-median: 10.0001.75=8.2510.000 - 1.75 = 8.25 thousand steps/day

95% CI for pseudo-median: [3.50,0.05][-3.50, -0.05] thousand steps from guideline

Step 6 — Effect sizes:

rrb=(40131)/171=91/171=0.532r_{rb} = (40-131)/171 = -91/171 = -0.532 (large effect — below guideline)

rW=1.982/18=1.982/4.243=0.467r_W = -1.982/\sqrt{18} = -1.982/4.243 = -0.467

Summary:

StatisticValue
Sample median steps8.258.25k
Guideline10.010.0k
nn'1818
n+n^+ / nn^-55 / 1313
W+W^+40.040.0
WW^-131.0131.0
zz1.982-1.982
pp (exact, two-tailed).047.047
HL estimate (from guideline)1.75-1.75k steps
95% CI[3.50,0.05][-3.50, -0.05]k steps
rrbr_{rb}0.532-0.532 (Large)

APA write-up: "A one-sample Wilcoxon Signed-Rank Test was used to examine whether median daily step counts differed from the 10,000-step health guideline, as step counts were right-skewed (Shapiro-Wilk W=0.87W = 0.87, p=.031p = .031). The sample median of 8,250 steps was significantly below the guideline, W+=40W^+ = 40, p=.047p = .047 (exact), rrb=0.53r_{rb} = -0.53 [95% CI: −0.78, −0.09]. The Hodges-Lehmann estimate indicated that office workers fell short of the guideline by a median of 1,750 steps/day [95% CI: 50, 3,500 steps below], a large effect."


Example 4: Comparing Two Teaching Methods — Non-Significant Result

A teacher compares student performance on matched reading comprehension tests under two instructional methods: silent reading vs. guided discussion, in n=10n = 10 students. Test scores range 0–100.

Data:

iiSilent (x1ix_{1i})Discussion (x2ix_{2i})did_i
17275−3
26871−3
381783
46568−3
577734
67072−2
775750
883803
96974−5
107476−2

Step 1 — Zero differences: Participant 7: d7=0d_7 = 0n0=1n^0 = 1; n=9n' = 9.

Non-zero did_i: 3,3,3,3,4,2,3,5,2-3, -3, 3, -3, 4, -2, 3, -5, -2

n+=3n^+ = 3 (+3,+4,+3+3, +4, +3); n=6n^- = 6 (3,3,3,2,5,2-3, -3, -3, -2, -5, -2).

Step 2 — Rank absolute differences:

Sorted di|d_i|: 2, 2, 3, 3, 3, 3, 3, 4, 5

di\vert d_i \vertCountAvg RankSign assignments
221.5Both −: −1.5, −1.5
355.0Three −, two +: −5.0 (×3), +5.0 (×2)
418.0One +: +8.0
519.0One −: −9.0

Step 3 — Rank sums:

W+=5.0+8.0+5.0=18.0W^+ = 5.0+8.0+5.0 = 18.0

W=1.5+1.5+5.0+5.0+5.0+9.0=27.0W^- = 1.5+1.5+5.0+5.0+5.0+9.0 = 27.0

Check: 18.0+27.0=45=9×10/218.0+27.0 = 45 = 9 \times 10/2

W=min(18.0,27.0)=18.0W = \min(18.0, 27.0) = 18.0

Step 4 — Exact p-value (n=9n' = 9):

From exact Wilcoxon tables: P(W+18)=0.244P(W^+ \leq 18) = 0.244 (one-tail)

Two-tailed: p=2×0.244=.488p = 2 \times 0.244 = .488 (but for W+=18W^+ = 18, close to expected 22.5, so:)

Using symmetry: p=2×min[P(W+18),P(W+27)]p = 2 \times \min[P(W^+ \leq 18), P(W^+ \geq 27)]

Exact: p=.490p = .490 (DataStatPro exact computation).

Step 5 — Effect sizes:

rrb=(1827)/45=9/45=0.200r_{rb} = (18-27)/45 = -9/45 = -0.200 (small effect, discussion slightly better)

zcc=(1822.50.5)/9×10×19/24(2×6+535)/48z_{cc} = (|18-22.5|-0.5)/\sqrt{9\times10\times19/24 - (2\times6+5^3-5)/48}

=(4.50.5)/71.25(12+120)/48=4.0/71.252.75=4.0/68.5=4.0/8.277=0.483= (4.5-0.5)/\sqrt{71.25-(12+120)/48} = 4.0/\sqrt{71.25-2.75} = 4.0/\sqrt{68.5} = 4.0/8.277 = 0.483

rW=0.483/9=0.483/3=0.161r_W = -0.483/\sqrt{9} = -0.483/3 = -0.161

Hodges-Lehmann estimate: θ^=1.0\hat{\theta} = -1.0 points

95% CI for θ^\hat{\theta} (exact): [4.0,2.5][-4.0, 2.5] points (includes 0)

Summary:

StatisticValueInterpretation
Silent median71.071.0 pts
Discussion median73.073.0 pts
n=9n' = 9 (1 zero excluded)
n+n^+ / nn^- / n0n^033 / 66 / 11
W+W^+18.018.0
WW^-27.027.0
pp (exact, two-tailed).490.490Not significant
HL estimate1.0-1.0 ptsDiscussion slightly higher
95% CI for θ^\hat{\theta}[4.0,2.5][-4.0, 2.5] ptsIncludes 0
rrbr_{rb}0.200-0.200Small effect

APA write-up: "A Wilcoxon Signed-Rank Test was conducted to compare comprehension scores under silent reading and guided discussion. One pair with identical scores was excluded (n=9n' = 9). There was no significant difference between silent reading (Mdn = 71.0) and guided discussion (Mdn = 73.0), W+=18W^+ = 18, p=.490p = .490 (exact), rrb=0.20r_{rb} = -0.20 [95% CI: −0.69, 0.41]. The Hodges-Lehmann estimate of the median difference was −1.0 points [95% CI: −4.0, 2.5], indicating a small, non-significant advantage for guided discussion. Given the small sample size (n=9n' = 9), this study had limited power to detect small effects (minimum detectable rrb0.61r_{rb} \approx 0.61 at 80% power)."


13. Common Mistakes and How to Avoid Them

Mistake 1: Using the Wilcoxon Signed-Rank Test When the Sign Test is More Appropriate

Problem: Applying the Wilcoxon Signed-Rank Test to data where differences can be assessed for direction but not for meaningful magnitude — for example, nominal categories coded as 0/1, or extremely coarse ordinal data with only a few categories. The Wilcoxon test requires that ranking the absolute differences is meaningful; if it is not, the test is invalid.

Solution: When only the direction of change is known (positive or negative), use the Sign Test. When differences can be meaningfully ranked, use the Wilcoxon test. Examine whether the concept of "a difference of 4 being larger than a difference of 2" makes sense for your measurement scale.


Mistake 2: Ignoring the Symmetry Assumption

Problem: Applying the Wilcoxon Signed-Rank Test without checking whether the difference scores are approximately symmetrically distributed about zero. The test assumes symmetry — without it, the p-value conflates location and shape effects. For instance, with right-skewed positive differences, even a true null hypothesis can be rejected because the large positive outliers inflate W+W^+.

Solution: Always plot a histogram of the difference scores did_i and assess symmetry visually. Compute skewness (zskew<2|z_{skew}| < 2). If differences are severely asymmetric, use the Sign Test or a bootstrap-based test instead.


Mistake 3: Not Reporting the Effective Sample Size nn' and the Number of Zeros

Problem: Reporting n=20n = 20 pairs but not mentioning that 5 pairs had di=0d_i = 0 and were excluded, leaving n=15n' = 15 for the analysis. Readers cannot evaluate the precision of the estimate or compare it to power requirements without knowing nn'.

Solution: Always report nn (total pairs), n0n^0 (zero differences excluded), n+n^+, nn^-, and nn' (effective sample size). State explicitly: "n0=5n^0 = 5 pairs with zero difference were excluded from the analysis, leaving n=15n' = 15."


Mistake 4: Reporting Only the p-value Without an Effect Size

Problem: Reporting W+=45W^+ = 45, p=.032p = .032 without any effect size measure. The Wilcoxon test statistic W+W^+ is not interpretable without knowing nn', and the p-value conveys nothing about the magnitude of the effect.

Solution: Always report the rank-biserial correlation rrbr_{rb} (or rWr_W) with its 95% CI, and the Hodges-Lehmann estimate θ^\hat{\theta} with its 95% CI. These together convey effect magnitude in both standardised and original-units terms.


Mistake 5: Using the Paired t-Test When the Wilcoxon Test is Clearly Needed

Problem: Observing highly non-normal difference scores with extreme outliers in a small sample (n<20n < 20) and proceeding with the paired t-test because "it's the standard test." The t-test's p-value may be seriously distorted by even a single extreme outlier in small samples.

Solution: Implement a pre-analysis normality check on difference scores (Shapiro-Wilk). If pSW<.05p_{SW} < .05 and n<30n < 30, use the Wilcoxon test as the primary analysis. Run the paired t-test as a sensitivity check and report both results with an explanation of why they may differ.


Mistake 6: Treating a Non-Significant Wilcoxon Result as Evidence of No Difference

Problem: Reporting W+=32W^+ = 32, p=.12p = .12 and concluding "the two conditions do not differ." As with all hypothesis tests, a non-significant result only indicates insufficient evidence to reject H0H_0 — it does not establish equivalence or absence of an effect.

Solution: Report the Hodges-Lehmann estimate and its 95% CI. If the CI is wide, note that the study is underpowered and a meaningful effect may exist but be undetected. For claims of equivalence, use a formal equivalence test (e.g., TOST on the Hodges-Lehmann estimator) with pre-specified bounds.


Mistake 7: Mis-Reporting the Test Statistic

Problem: Confusion about which statistic to report. Different software uses different conventions: some report W+W^+ (sum of positive ranks), some report WW^-, some report W=min(W+,W)W = \min(W^+, W^-), some report a zz-statistic, and some report TT (an older notation equivalent to WW). Reporting "W=12W = 12" without specifying that this is the minimum (vs. W+=12W^+ = 12) is ambiguous.

Solution: Clearly specify what was reported. DataStatPro reports W+W^+, WW^-, and W=min(W+,W)W = \min(W^+, W^-) separately, and the auto-generated APA paragraph uses the convention W+W^+ = [value] to avoid ambiguity. When reporting, specify: "W+=45W^+ = 45 (sum of positive ranks)" or "T=12T = 12 (Wilcoxon signed-rank statistic, minimum of W+W^+ and WW^-)".


Mistake 8: Applying the Test to Data With Too Many Ties Without Addressing Them

Problem: Using Likert scale data where many participants show no change (di=0d_i = 0) and many show changes of exactly 1 point (massive ties in di|d_i|). Running the standard Wilcoxon test without the tie correction produces inaccurate p-values, and excluding a large proportion of zeros severely reduces power.

Solution: Report the proportion of zero differences (n0/nn^0/n). Apply the tie correction to the variance (DataStatPro does this automatically). Consider using Pratt's method when zero differences are informative. If more than 30% of differences are zero or tied, acknowledge the limitation and consider whether the Sign Test or a permutation test is more appropriate.


Mistake 9: Comparing rrbr_{rb} from the Wilcoxon Test Directly with Cohen's dd from a t-Test

Problem: Reporting rrb=0.40r_{rb} = 0.40 from a Wilcoxon test alongside Cohen's d=0.60d = 0.60 from a paired t-test on different but related data, and treating them as equivalent effect sizes. rrbr_{rb} and dd are on different scales and are only approximately related through conversion formulas.

Solution: Use the conversion formula d2rrb/1rrb2d \approx 2r_{rb}/\sqrt{1-r_{rb}^2} for rough comparison, and clearly note the approximation. For direct comparison, use the same effect size metric (e.g., convert both to rr) and acknowledge that the Wilcoxon-based rrbr_{rb} and the t-test-based dd measure slightly different aspects of the effect (rank-based vs. mean-based).


Mistake 10: Using the Asymptotic Test When the Exact Test is Available

Problem: With n=12n' = 12 non-zero differences and few ties, using the normal approximation to get p=.048p = .048 when the exact test gives p=.067p = .067 — and reporting the asymptotic result to achieve significance.

Solution: Always use the exact p-value when n25n' \leq 25 and ties are few. DataStatPro automatically selects the exact test for small samples. Never choose a p-value method post-hoc based on which gives a more favourable result. Pre-specify the method (exact vs. asymptotic) before analysis.


14. Troubleshooting

ProblemLikely CauseSolution
W++Wn(n+1)/2W^+ + W^- \neq n'(n'+1)/2Arithmetic error in ranking or rank sum computationRecheck ranks, midranks, and sums; verify nn' (zeros excluded)
W+=n(n+1)/2W^+ = n'(n'+1)/2 and W=0W^- = 0All differences are positive (or all negative)Verify data; if genuine, p<2×(0.5)np < 2 \times (0.5)^{n'}; compute exact p-value
Exact and asymptotic p-values diverge substantiallySmall nn' (asymptotic unreliable) or many tiesUse exact p-value for n25n' \leq 25; use permutation test if many ties
Many zero differences (n0>n/4n^0 > n/4)Coarse measurement scale; many participants show no changeReport n0n^0 explicitly; consider Pratt's method; consider Sign Test; note reduced power
Wilcoxon significant but paired t-test not significantOutliers in differences inflating sds_d (t-test); Wilcoxon more robustInspect difference distribution; if outliers present, Wilcoxon result is more reliable
Paired t-test significant but Wilcoxon not significantSmall nn' after zero exclusion; t-test using mean which is influenced by extreme valuesInspect closely; if differences are symmetric and normal, t-test is appropriate; if not, Wilcoxon
rrb=±1.0r_{rb} = \pm 1.0All non-zero differences have the same signPerfect effect in the data; report with note that all participants changed in the same direction
95% CI for θ^\hat{\theta} is very wideSmall nn'Report wide CI; increase sample size; note low precision
Hodges-Lehmann estimate differs substantially from mean differencePresence of outliers or skewness in did_iBoth are valid but measure different things; HL is the natural companion to the Wilcoxon test
Skewness check suggests asymmetric differencesData do not meet symmetry assumptionUse Sign Test; report both tests; use bootstrap p-value
Software reports negative W+W^+ or WW^-Software error or sign convention confusionCheck software documentation; both W+W^+ and WW^- are non-negative by definition
Tie correction produces a negative varianceExtreme number of ties; something is wrongCheck data for coding errors; with excessive ties, use permutation test
One-sample version gives different result from paired version for same dataCheck how differences were definedOne-sample tests xix_i against θ0\theta_0; paired tests di=x1ix2id_i = x_{1i}-x_{2i} against 0; should be equivalent if di=xiθ0d_i = x_i - \theta_0
Power is very low despite significant resultSample size is small; significance is due to extreme effect size, not adequate powerReport sensitivity analysis; note that future replications need larger samples
Cannot determine θ^\hat{\theta} without raw dataOnly summary statistics availableθ^\hat{\theta} requires all did_i values; request data or report only rrbr_{rb}

15. Quick Reference Cheat Sheet

Core Formulas

FormulaDescription
di=x1ix2id_i = x_{1i} - x_{2i}Difference score for pair ii
n=nn0n' = n - n^0Effective sample size (excluding zeros)
W+={di>0}RiW^+ = \sum_{\{d_i>0\}} R_iSum of positive ranks
W={di<0}RiW^- = \sum_{\{d_i<0\}} R_iSum of negative ranks
W++W=n(n+1)/2W^+ + W^- = n'(n'+1)/2Verification check
W=min(W+,W)W = \min(W^+, W^-)Wilcoxon test statistic
E[W+]=n(n+1)/4E[W^+] = n'(n'+1)/4Expected W+W^+ under H0H_0
Var[W+]=n(n+1)(2n+1)/24\text{Var}[W^+] = n'(n'+1)(2n'+1)/24Variance of W+W^+ (no ties)
Varcorrected[W+]=n(n+1)(2n+1)/24k(tk3tk)/48\text{Var}_{corrected}[W^+] = n'(n'+1)(2n'+1)/24 - \sum_k(t_k^3-t_k)/48Variance with tie correction
z=(W+E[W+])/Varcorrected[W+]z = (W^+ - E[W^+])/\sqrt{\text{Var}_{corrected}[W^+]}z-statistic
zcc=(W+E[W+]0.5)/Varcorrectedz_{cc} = (\lvert W^+ - E[W^+]\rvert - 0.5)/\sqrt{\text{Var}_{corrected}}z with continuity correction
p=2×[1Φ(z)]p = 2\times[1-\Phi(\lvert z\rvert)]Two-tailed p-value

Effect Size Formulas

FormulaDescription
rrb=(W+W)/(n(n+1)/2)r_{rb} = (W^+ - W^-)/(n'(n'+1)/2)Matched-pairs rank-biserial correlation
rrb=14W/(n(n+1))r_{rb} = 1 - 4W^-/(n'(n'+1))Alternative formula for rrbr_{rb}
rW=z/nr_W = z/\sqrt{n'}Effect size from z-statistic
θ^=Median{(di+dj)/2:ij}\hat{\theta} = \text{Median}\{(d_i+d_j)/2: i\leq j\}Hodges-Lehmann pseudo-median
CL^=n+/n\widehat{CL} = n^+/n'Common Language Effect Size (simple)
rrbd:d2rrb/1rrb2r_{rb} \to d: d \approx 2r_{rb}/\sqrt{1-r_{rb}^2}Convert rrbr_{rb} to Cohen's dd (approx.)
drrb:rrbd/d2+4d \to r_{rb}: r_{rb} \approx d/\sqrt{d^2+4}Convert Cohen's dd to rrbr_{rb} (approx.)

Walsh Averages for Hodges-Lehmann CI

nn' pairsM=n(n+1)/2M = n'(n'+1)/2 Walsh averages
515
1055
15120
20210
25325
30465
501,275

Cohen's Benchmarks for rrbr_{rb} and rWr_W

| rrb\vert r_{rb} \vert or rW\vert r_W \vert | Label | Approx. dz|d_z| equiv. | | :--------------------- | :---- | :--------------------- | | <0.10< 0.10 | Negligible | <0.20< 0.20 | | 0.100.290.10 - 0.29 | Small | 0.200.610.20 - 0.61 | | 0.300.490.30 - 0.49 | Medium | 0.621.130.62 - 1.13 | | 0.50\geq 0.50 | Large | 1.15\geq 1.15 |

ARE Comparison: Wilcoxon vs. Paired t-Test

Data DistributionAREInterpretation
Normal3/π0.9553/\pi \approx 0.955Wilcoxon needs \approx5% more pairs
Uniform1.0001.000Identical efficiency
Logisticπ2/91.097\pi^2/9 \approx 1.097Wilcoxon needs \approx9% fewer pairs
Laplace1.5001.500Wilcoxon needs 33% fewer pairs
Contaminated normal>1.500> 1.500Wilcoxon substantially more powerful

Required nn' for 80% Power (Two-Tailed α=.05\alpha = .05, Normal Data)

dzd_z equivalentrrbr_{rb} (approx.)nn' Wilcoxonnn Paired tOverhead
0.200.10277264+5%
0.300.15125119+5%
0.500.244644+5%
0.800.371918+6%
1.000.451413+8%
1.200.51109+11%
1.500.6077\approx0%

Zero and Tie Handling Reference

SituationMethodNotes
di=0d_i = 0 (default)Wilcoxon: excluden=nn0n' = n - n^0; report n0n^0
di=0d_i = 0 (alternative)Pratt: include in rankingCan affect p-value; use when zeros are informative
Tied $d_i$
Many tiesPermutation testExact handling regardless of tie structure
n25n' \leq 25, few tiesExact p-valueAlways preferred
n>25n' > 25Asymptotic + continuity correctionAccurate for most situations

Test Selection Guide

Two related conditions, continuous or ordinal DV?
├── Are difference scores normally distributed?
│   (Check: Shapiro-Wilk on d_i, Q-Q plot)
│   ├── YES and n ≥ 15 → Paired t-test (more power)
│   │   (Report Wilcoxon as sensitivity check if desired)
│   └── NO, or n < 30 and Shapiro-Wilk p < .05
│       └── Are differences rankable (magnitudes meaningful)?
│           ├── YES → Wilcoxon Signed-Rank Test ✅
│           │   ├── Are differences symmetric?
│           │   │   ├── YES → Standard Wilcoxon ✅
│           │   │   └── NO → Sign Test or Bootstrap
│           │   └── Many zeros? → Consider Pratt's method
│           └── NO (only direction known) → Sign Test
└── Three or more conditions → Friedman Test

Comparison: Wilcoxon vs. Sign Test vs. Paired t-Test

PropertyPaired t-TestWilcoxon Signed-RankSign Test
Uses magnitude of differences✅ Full✅ Ranks❌ No
Assumes normality✅ Yes❌ No❌ No
Assumes symmetry(via normality)✅ Yes❌ No
ARE vs. t-test1.0000.9550.637
Robust to outliers❌ Low✅ High✅ Very high
Handles ordinal DV❌ No✅ Yes✅ Yes
Effect sizeCohen's dzd_zrrbr_{rb}, rWr_Wp+p^+, P(d>0)P(d>0)
Point estimatedˉ\bar{d}Hodges-Lehmann θ^\hat{\theta}Median

APA 7th Edition Reporting Templates

Standard significant result:

"Due to [non-normal difference scores / ordinal measurement scale] (Shapiro-Wilk W=W = [value], p=p = [value]), a Wilcoxon Signed-Rank Test was conducted. [Condition 1] (Mdn = [value]) [was / was not] significantly [higher / lower] than [Condition 2] (Mdn = [value]), W+=W^+ = [value], z=z = [value], p=p = [value] [(exact)/(asymptotic)]. The Hodges-Lehmann estimate of the median difference was [value] [units] [95% CI: LB, UB], rrb=r_{rb} = [value] [95% CI: LB, UB], indicating a [small / medium / large] effect. [n0=n^0 = [value] pairs with zero difference were excluded, leaving n=n' = [value] pairs for analysis.]"

Non-significant result:

"A Wilcoxon Signed-Rank Test revealed no significant difference between [Condition 1] (Mdn = [value]) and [Condition 2] (Mdn = [value]), W+=W^+ = [value], p=p = [value], rrb=r_{rb} = [value] [95% CI: LB, UB]. The Hodges-Lehmann estimate was [value] [95% CI: LB, UB], indicating a [small / negligible] effect that the study was insufficiently powered to detect (minimum detectable rrbr_{rb} \approx [value] at 80% power)."

One-sample version:

"A one-sample Wilcoxon Signed-Rank Test was conducted to examine whether the population pseudo-median of [DV] differed from [θ₀]. The sample median of [value] [was / was not] significantly different from [θ₀], W+=W^+ = [value], p=p = [value], rrb=r_{rb} = [value]. The Hodges-Lehmann estimate was [value] [units] from the null value [95% CI: LB, UB]."

Wilcoxon Signed-Rank Test Reporting Checklist

ItemRequired
Statement of why Wilcoxon was used (non-normality, ordinal, outliers)✅ Always
Median for each condition✅ Always
W+W^+ (and/or WW^- or W=minW = \min) — specify which✅ Always
z-statistic (if asymptotic)✅ For n>25n' > 25
p-value (exact or asymptotic — specify which)✅ Always
nn (total pairs), n0n^0 (zeros excluded), nn' (effective nn)✅ Always
n+n^+ and nn^- (positive and negative differences)✅ Recommended
Whether Wilcoxon or Pratt method used for zeros✅ When n0>0n^0 > 0
Whether exact, asymptotic, or permutation p-value used✅ Always
Tie correction applied✅ When ties present
rrbr_{rb} (primary effect size) with 95% CI✅ Always
rWr_W alongside rrbr_{rb}✅ Recommended
Hodges-Lehmann estimate θ^\hat{\theta} with 95% CI✅ Always
Symmetry check on difference scores✅ When n<50n < 50
Comparison with paired t-test result (sensitivity)✅ Recommended
Power or sensitivity analysis✅ For null results
Domain-specific benchmark context for rrbr_{rb}✅ Recommended

Conversion Formulas: Wilcoxon \leftrightarrow Other Metrics

FromToFormula
W+W^+, nn'rrbr_{rb}rrb=(2W+n(n+1)/2)/(n(n+1)/2)r_{rb} = (2W^+ - n'(n'+1)/2) / (n'(n'+1)/2)
zz, nn'rWr_WrW=z/nr_W = z/\sqrt{n'}
rrbr_{rb}Cohen's dd (approx.)d2rrb/1rrb2d \approx 2r_{rb}/\sqrt{1-r_{rb}^2}
Cohen's ddrrbr_{rb} (approx.)rrbd/d2+4r_{rb} \approx d/\sqrt{d^2+4}
rWr_WCohen's ddd2rW/1rW2d \approx 2r_W/\sqrt{1-r_W^2}
rrbr_{rb}P(di>0)P(d_i > 0) (approx.)P=(1+rrb)/2P = (1+r_{rb})/2
nt-testn_{t\text{-test}}nWilcoxonn'_{Wilcoxon} (normal data)nWnt×π/31.047×ntn'_{W} \approx n_t \times \pi/3 \approx 1.047 \times n_t
dzd_zRequired nn' (80% power)n8.211/dz2n' \approx 8.211/d_z^2

This tutorial provides a comprehensive foundation for understanding, conducting, and reporting the Wilcoxon Signed-Rank Test within the DataStatPro application. For further reading, consult Wilcoxon's original paper "Individual Comparisons by Ranking Methods" (Biometrics Bulletin, 1945); Hollander, Wolfe & Chicken's "Nonparametric Statistical Methods" (3rd ed., 2014) for rigorous mathematical treatment; Conover's "Practical Nonparametric Statistics" (3rd ed., 1999) for applied guidance; Kerby's "The Simple Difference Formula: An Approach to Teaching Nonparametric Correlation" (Comprehensive Psychology, 2014) for the matched-pairs rank-biserial correlation; Field's "Discovering Statistics Using IBM SPSS Statistics" (5th ed., 2018) for accessible applied coverage; and van Doorn et al.'s "Bayesian Inference for Kendall's Rank Correlation Coefficient" (Communications in Statistics, 2018) for the Bayesian extension. For the Hodges-Lehmann estimator and its confidence interval, see Hodges & Lehmann's "Estimates of Location Based on Rank Tests" (Annals of Mathematical Statistics, 1963). For feature requests or support, contact the DataStatPro team.