Wilcoxon Signed-Rank Test: Zero to Hero Tutorial
This comprehensive tutorial takes you from the foundational concepts of non-parametric inference all the way through the mathematics, assumptions, variants, effect sizes, interpretation, reporting, and practical usage of the Wilcoxon Signed-Rank Test within the DataStatPro application. Whether you are encountering the Wilcoxon Signed-Rank Test for the first time or seeking a rigorous understanding of rank-based within-subjects comparison, this guide builds your knowledge systematically from the ground up.
Table of Contents
- Prerequisites and Background Concepts
- What is the Wilcoxon Signed-Rank Test?
- The Mathematics Behind the Wilcoxon Signed-Rank Test
- Assumptions of the Wilcoxon Signed-Rank Test
- Variants of the Wilcoxon Signed-Rank Test
- Using the Wilcoxon Signed-Rank Test Calculator Component
- Full Step-by-Step Procedure
- Effect Sizes for the Wilcoxon Signed-Rank Test
- Confidence Intervals
- Power Analysis and Sample Size Planning
- Advanced Topics
- Worked Examples
- Common Mistakes and How to Avoid Them
- Troubleshooting
- Quick Reference Cheat Sheet
1. Prerequisites and Background Concepts
Before diving into the Wilcoxon Signed-Rank Test, it is essential to be comfortable with the following foundational statistical concepts. Each is briefly reviewed below.
1.1 Parametric vs. Non-Parametric Inference
Parametric tests (such as the paired t-test) make specific assumptions about the shape of the population distribution — typically that data are drawn from a normally distributed population. Their test statistics are derived from distributional assumptions, and their validity depends on how well those assumptions are met.
Non-parametric tests (also called distribution-free tests) do not assume a specific parametric form for the population distribution. Instead, they are based on the ranks of the data rather than the raw values themselves. Because ranks carry less information than raw values, non-parametric tests are generally less powerful than their parametric counterparts when parametric assumptions are met — but they can be more powerful when those assumptions are violated.
The Wilcoxon Signed-Rank Test is the leading non-parametric alternative to the paired t-test for comparing two related conditions when the normality of difference scores cannot be assumed.
1.2 The Concept of Ranks
Ranking transforms raw data values into their relative order positions. Given a set of values :
- Assign rank 1 to the smallest value, rank 2 to the next smallest, and so on.
- For tied values, assign the average rank (midrank) to all tied observations.
Example:
| Value | Rank |
|---|---|
| 1 | |
| 2 | |
| 2 (midrank of ranks 2 and 3) | |
| 4 | |
| 5 |
Ranking discards information about the precise magnitude of differences between values (e.g., whether the gap between ranks 1 and 2 is 0.1 or 100 units) but preserves the ordinal information (which values are larger or smaller). This makes rank-based tests robust to extreme values and non-normal distributions.
1.3 Ordinal, Interval, and Ratio Scales
The level of measurement determines which statistical tests are appropriate:
| Scale | Properties | Examples | Appropriate Summaries |
|---|---|---|---|
| Nominal | Categories only | Gender, blood type | Mode, frequencies |
| Ordinal | Ordered categories; unequal intervals | Likert items, pain ratings, ranks | Median, percentiles |
| Interval | Equal intervals; no true zero | Temperature (°C), IQ scores | Mean, SD |
| Ratio | Equal intervals; true zero | Height, weight, reaction time | Mean, SD, ratios |
The Wilcoxon Signed-Rank Test is appropriate for ordinal data and for interval/ratio data that violate the normality assumption of the paired t-test.
1.4 The Median as a Measure of Central Tendency
The median is the value that divides the distribution into two equal halves — 50% of observations fall below it and 50% above it. Unlike the mean, the median is:
- Resistant to outliers: A single extreme value does not distort the median.
- Appropriate for skewed distributions: The median better represents the "typical" value when distributions are asymmetric.
- The natural parameter for non-parametric tests: The Wilcoxon Signed-Rank Test can be interpreted as testing whether the population pseudo-median of difference scores differs from zero (under the symmetry assumption).
The pseudo-median (also called the Hodges-Lehmann estimator) is the median of all pairwise averages for , including each observation paired with itself.
1.5 Signed Ranks: Combining Magnitude and Direction
The Wilcoxon Signed-Rank Test uniquely combines two pieces of information from difference scores:
- Magnitude: How large is each difference, relative to the others? (Captured by the rank of the absolute difference.)
- Direction: Is each difference positive or negative? (Captured by the sign attached to the rank.)
By ranking absolute differences and then restoring the sign, the test gives more weight to large differences than to small ones — unlike the sign test, which ignores magnitude entirely. This is why the Wilcoxon Signed-Rank Test is more powerful than the sign test.
1.6 The Null and Alternative Hypotheses
The Wilcoxon Signed-Rank Test operates under the following hypotheses:
Under the symmetry assumption:
The population of difference scores is symmetrically distributed about zero.
The population of difference scores is NOT symmetrically distributed about zero.
Equivalently (under symmetry):
Without the symmetry assumption (more general interpretation):
(The probability of a positive difference equals the probability of a negative difference.)
Directional alternatives:
(upper one-tailed)
(lower one-tailed)
1.7 The Asymptotic Relative Efficiency
The Asymptotic Relative Efficiency (ARE) of a non-parametric test relative to its parametric counterpart quantifies the relative sample sizes needed to achieve the same power as .
For the Wilcoxon Signed-Rank Test vs. the paired t-test:
(for normally distributed data)
This means that for normally distributed data, the Wilcoxon test requires approximately times as many observations as the paired t-test to achieve the same power — a loss of only about 5%. In exchange for this negligible efficiency cost, the Wilcoxon test gains complete robustness to non-normality.
For non-normal distributions, the Wilcoxon test can be substantially more efficient than the t-test:
| Distribution | ARE (Wilcoxon vs. t-test) |
|---|---|
| Normal | |
| Uniform | |
| Double exponential (Laplace) | |
| Logistic | |
| Contaminated normal (10% outliers) | |
| Heavy-tailed distributions | Can be very large |
💡 For data that are approximately normal, using the Wilcoxon test costs you only 5% efficiency. For data with heavy tails or outliers, the Wilcoxon test can dramatically outperform the t-test. This asymmetry makes the Wilcoxon test a safe default when normality is uncertain.
1.8 Type I Error, Power, and the Role of Sample Size
- Type I error (): The probability of incorrectly rejecting when it is true. The Wilcoxon Signed-Rank Test maintains the nominal regardless of the underlying distribution (for continuous data).
- Type II error (): The probability of failing to detect a true effect.
- Power (): The probability of correctly detecting a true effect.
The Wilcoxon test achieves nearly identical power to the paired t-test for normal data and superior power for non-normal data, making it a generally safe and efficient choice for paired comparisons.
2. What is the Wilcoxon Signed-Rank Test?
2.1 The Core Idea
The Wilcoxon Signed-Rank Test (Wilcoxon, 1945) is a non-parametric inferential procedure for testing whether two related conditions (measured on the same participants or matched pairs) have the same distribution. It is the non-parametric alternative to the paired t-test when the assumption of normally distributed difference scores cannot be met.
Rather than working with raw difference scores and computing means and standard deviations (as the paired t-test does), the Wilcoxon test:
- Computes the absolute values of the difference scores .
- Ranks the absolute differences from smallest to largest.
- Restores the sign of each difference to its rank.
- Computes the sum of the positive ranks and the sum of the negative ranks as the test statistics.
- Evaluates whether and are sufficiently different from what would be expected by chance if were true.
Under , positive and negative differences should be roughly equally common and roughly equally large — so and should be approximately equal (each approximately ). Large discrepancies between and provide evidence against .
2.2 When to Use the Wilcoxon Signed-Rank Test
The Wilcoxon Signed-Rank Test is the appropriate choice when:
- The DV is measured on an ordinal scale (e.g., Likert items, pain ratings, satisfaction scores) where differences may not be meaningful in interval terms.
- The DV is continuous (interval/ratio) but the difference scores are non-normally distributed and sample size is small ().
- There are extreme outliers in the difference scores that cannot be removed or explained.
- The distribution of differences is heavily skewed, making the mean a poor representation of central tendency.
- The research question concerns whether one condition tends to produce higher values than the other rather than specifically about the mean difference.
2.3 The Wilcoxon Signed-Rank Test vs. Related Procedures
| Situation | Appropriate Test |
|---|---|
| Two related conditions, differences normal | Paired t-test (preferred for power) |
| Two related conditions, differences non-normal | Wilcoxon Signed-Rank Test |
| Two related conditions, only direction of difference known | Sign Test (less powerful) |
| One group vs. known value, non-normal | Wilcoxon Signed-Rank (one-sample version) |
| Three or more related conditions, non-normal | Friedman Test |
| Two independent groups, non-normal | Mann-Whitney U Test |
| Two related conditions, Bayesian non-parametric | Bayesian Signed-Rank Test |
2.4 The Wilcoxon Signed-Rank Test vs. the Sign Test
The Wilcoxon Signed-Rank Test and the Sign Test are both non-parametric tests for paired data, but they differ in the information they use:
| Property | Wilcoxon Signed-Rank | Sign Test |
|---|---|---|
| Information used | Rank of $ | d_i |
| Requires rankable differences | ✅ Yes | ❌ No |
| Power | Higher | Lower |
| Robustness to outliers | High | Very high |
| ARE vs. t-test (normal data) | 0.955 | 0.637 |
| Suitable when only direction known | ❌ No | ✅ Yes |
The Wilcoxon test is preferred over the sign test in virtually all circumstances where the absolute magnitude of differences can be ranked, because it makes better use of the available information.
2.5 Two Versions: Paired and One-Sample
The Wilcoxon Signed-Rank Test has two closely related applications:
Paired version: Compare two related conditions. Compute for each pair, then apply the test to the difference scores.
One-sample version: Test whether a single sample's population median (or pseudo-median) equals a hypothesised value . Compute and apply the test to these adjusted values.
Both versions are mathematically identical — they differ only in how the difference scores are constructed.
3. The Mathematics Behind the Wilcoxon Signed-Rank Test
3.1 Computing Difference Scores
Paired version: For pairs , :
One-sample version: For observations tested against :
3.2 Handling Zero Differences
Pairs where (exactly) are excluded from the analysis because they carry no information about the direction of an effect. Let denote the number of non-zero differences remaining after exclusion. All subsequent steps use .
⚠️ A large number of zero differences substantially reduces the effective sample size and thus statistical power. This is most common with coarsely measured ordinal scales (e.g., 5-point Likert items). If more than 20% of differences are zero, interpret results with particular caution and consider reporting the number of zero differences explicitly.
3.3 Ranking the Absolute Differences
Rank the absolute values from smallest (rank 1) to largest (rank ).
For tied absolute values, assign the average (midrank) of the ranks they would have occupied:
If three observations are tied at the 4th, 5th, and 6th positions, each receives rank .
Notation: Let denote the rank assigned to .
3.4 Computing the Test Statistics and
Restore the original sign to each rank:
Sum of positive ranks (ranks corresponding to ):
Sum of negative ranks (ranks corresponding to ):
Verification check:
This provides an arithmetic check: if does not equal , there is a computational error.
Under , the expected values are:
3.5 The Test Statistic
The conventional test statistic is:
Small values of (far from ) provide evidence against .
Alternatively, many software implementations report directly (or ), with the p-value computed from the appropriate tail of the sampling distribution.
DataStatPro reports both and , highlights the minimum, and computes exact and asymptotic p-values.
3.6 Exact Distribution (Small Samples, )
For small samples without ties, the exact null distribution of can be enumerated: under , each of the possible sign assignments is equally likely, giving a discrete distribution that can be tabulated exactly.
Exact p-value (two-tailed):
DataStatPro always computes the exact p-value when and there are no (or few) ties, and automatically switches to the normal approximation for larger samples.
3.7 Normal Approximation (Large Samples, )
For larger samples, is approximately normally distributed:
z-statistic (without continuity correction):
z-statistic (with continuity correction, more accurate for discrete distributions):
Two-tailed p-value:
Where is the standard normal CDF.
3.8 Tie Correction for the Variance
When there are tied absolute difference values, the variance formula must be corrected:
Where:
- = number of distinct tied groups among the ranked absolute differences.
- = number of observations in the -th tied group.
The correction reduces the variance, increasing the z-statistic slightly and thus providing a more accurate p-value when ties are present.
Corrected z-statistic:
3.9 The Exact Probability Under : Deriving the Null Distribution
Under , each non-zero difference score is equally likely to be positive or negative, independently of its magnitude. This means each of the possible sign assignments to the ranks is equally probable.
The total number of distinct values can take ranges from (all negative) to (all positive). The probability of any specific value of is the number of sign assignments producing that value divided by .
Example for (ranks 1, 2, 3, 4; total ):
can range from 0 to 10. (all negative). (all positive). (four sign assignments give ).
3.10 Relationship Between Wilcoxon and the Mann-Whitney
The Wilcoxon Signed-Rank statistic is algebraically related to the Mann-Whitney statistic. Specifically, for the one-sample or paired case, counts the number of Walsh averages (for ) that are positive:
This connection to Walsh averages is the foundation of the Hodges-Lehmann estimator of the pseudo-median, which serves as the point estimate associated with the Wilcoxon test.
4. Assumptions of the Wilcoxon Signed-Rank Test
4.1 Symmetry of the Difference Score Distribution
The Wilcoxon Signed-Rank Test's primary assumption is that the population distribution of difference scores is symmetric about its median (pseudo-median). This is weaker than the normality assumption of the paired t-test but is still a meaningful constraint.
Why symmetry matters: The test is designed so that, under , positive and negative ranks of equal magnitude are equally likely. If the difference distribution is asymmetric, the test is not testing only the location of the median — it may also respond to the shape of the distribution. In that case, conflates "no location shift" with "symmetric distribution."
How to check:
- Histogram of difference scores: look for approximate left-right symmetry about zero.
- Q-Q plot of difference scores: if symmetric, points should follow a straight line (not necessarily on the normal reference line — just linear).
- Skewness statistic: suggests no severe asymmetry.
- Density plots: visual inspection of the distribution of .
When violated: If difference scores are severely asymmetric (heavily skewed in one direction), the Wilcoxon test's p-value may not correctly reflect only a location shift. In this case:
- Use the Sign Test (which only requires that the median exists, with no symmetry assumption).
- Consider a data transformation (log, square root) to reduce skewness.
- Report the results with an explicit caveat about the asymmetry.
⚠️ The symmetry assumption is often overlooked. A common error is applying the Wilcoxon Signed-Rank Test to heavily right-skewed difference scores (e.g., when data represent counts or reaction times with occasional very long responses) without checking symmetry. In such cases, the Sign Test or bootstrap methods are more appropriate.
4.2 Independence of Pairs
All pairs must be independent of each other. That is, knowing the difference score for pair gives no information about the difference score for pair (). Within each pair, the two measurements are of course dependent — this is the point of the paired design.
Common violations:
- Multiple measurements from the same participant treated as separate pairs.
- Pairs sampled from the same cluster (classroom, family, ward).
- Longitudinal data with autocorrelated measurements.
When violated: Use multilevel models or time-series methods.
4.3 Continuous (or At Least Ordinal and Rankable) Differences
The test requires that the absolute differences can be meaningfully ranked — there must be a natural ordering of the magnitudes. This is satisfied whenever:
- The DV is measured on a ratio or interval scale.
- The DV is ordinal and differences can be ranked (e.g., a 10-point pain scale where a difference of 3 is consistently larger than a difference of 1).
When violated: If differences cannot be ranked (e.g., nominal categories), use the McNemar test (for binary outcomes) or other categorical tests.
4.4 Exchangeability Under
Under , the distribution of must be exchangeable with respect to sign: and must have the same distribution. This is satisfied when the difference distribution is symmetric about zero.
This condition is equivalent to stating that the probability of a positive difference equals the probability of a negative difference of the same magnitude.
4.5 Absence of Excessive Ties
The Wilcoxon Signed-Rank Test is designed for continuous data where ties in absolute differences are rare. Excessive ties (especially many zero differences) can affect the accuracy of the p-value.
Types of ties:
- Zero differences (): excluded from the analysis, reducing .
- Tied absolute differences ( for ): handled by midranks; the tie correction adjusts the variance.
How to check: Count the number of zero differences and the number of tied absolute differences. If more than 20–25% of differences are zero, the effective sample size is substantially reduced.
When excessive ties present: Use the exact permutation test version of the Wilcoxon test, which handles ties exactly. DataStatPro automatically applies the exact test with ties when and the standard tie-corrected approximation for larger samples.
4.6 Assumption Summary Table
| Assumption | Description | How to Check | Remedy if Violated |
|---|---|---|---|
| Symmetry of differences | distribution is symmetric about | Histogram, Q-Q, skewness of | Sign Test; transform data |
| Independence of pairs | Pairs are independent across observations | Design review | Multilevel model |
| Rankable differences | $ | d_i | $ can be meaningfully ordered |
| Exchangeability | and have same distribution | Symmetry check | Sign Test; bootstrap |
| No excessive ties | Few zero or tied absolute differences | Count zeros and ties | Exact permutation test; sign test |
5. Variants of the Wilcoxon Signed-Rank Test
5.1 Paired Version (Two-Condition Comparison)
The paired version compares two related conditions. Difference scores are computed as and the test evaluates whether the pseudo-median of the differences equals zero.
This is the most common application of the Wilcoxon Signed-Rank Test and is the primary focus of this tutorial.
5.2 One-Sample Version (Against a Hypothesised Median)
The one-sample version tests whether the population pseudo-median of a single sample equals a specified value :
Compute adjusted differences:
Then apply the standard Wilcoxon procedure to these adjusted values.
Common applications:
- Testing whether a sample's median IQ differs from the population norm of 100.
- Testing whether median response time differs from a published normative value.
- Quality control: testing whether median product weight differs from a target.
5.3 Exact vs. Approximate (Asymptotic) p-values
Exact p-value: Computes the p-value from the complete enumeration of all possible rank assignments under . Appropriate for small samples () and when ties are absent or few. DataStatPro always provides the exact p-value when feasible.
Asymptotic p-value: Uses the normal approximation to the distribution of . Appropriate for . The tie-corrected version is more accurate when ties are present.
With continuity correction: The continuity correction ( adjustment to ) improves the accuracy of the normal approximation for moderate sample sizes by accounting for the discrete nature of .
Recommendation: Use the exact p-value whenever possible (, few ties). For larger samples, the tie-corrected asymptotic p-value with continuity correction is generally accurate.
5.4 Permutation Version
The permutation (randomisation) version of the Wilcoxon test generates the null distribution by randomly reassigning the signs of the absolute differences times (e.g., ) and computing for each permutation. The p-value is the proportion of permuted statistics at least as extreme as the observed .
This approach:
- Is valid regardless of ties (handles them exactly).
- Does not rely on any distributional approximation.
- Requires more computation but is exact in principle.
- Is particularly useful for small samples with many ties.
DataStatPro offers the permutation version under the "Exact / Permutation" option.
5.5 Pratt's Method for Zero Differences
Two conventions exist for handling zero differences ():
Wilcoxon's original method (default): Exclude all zero differences; analyse only the non-zero differences.
Pratt's method (1959): Include zero differences in the ranking, but exclude them from the sum of signed ranks. This method:
- Retains the information that zero differences exist (they count toward the ranking).
- Can give slightly different p-values from the standard method.
- May be preferred when zeros are informative (e.g., zero change is substantively meaningful).
DataStatPro provides both methods when zero differences are present.
6. Using the Wilcoxon Signed-Rank Test Calculator Component
The Wilcoxon Signed-Rank Test Calculator in DataStatPro provides a comprehensive tool for running, diagnosing, visualising, and reporting the test and associated effect sizes.
Step-by-Step Guide
Step 1 — Select "Wilcoxon Signed-Rank Test"
From the "Test Type" dropdown, select:
- Wilcoxon Signed-Rank Test (Paired): For comparing two related conditions.
- Wilcoxon Signed-Rank Test (One-Sample): For testing a single sample against a hypothesised median .
💡 DataStatPro automatically suggests the Wilcoxon Signed-Rank Test when the normality check on difference scores is significant in the Paired t-Test component. A yellow warning banner will appear with a direct link to the Wilcoxon component.
Step 2 — Input Method
Choose how to provide the data:
- Raw data (paired columns): Upload or paste two columns — Condition 1 and Condition 2 — with one row per participant. DataStatPro computes all difference scores, performs symmetry checks, counts zeros and ties, and generates all statistics.
- Raw data (difference scores): Upload a single column of pre-computed difference scores. Useful for the one-sample version (enter ).
- Summary data (counts): Enter (number of positive differences), (number of negative differences), (ties at zero), and group summary statistics. Only the sign test and approximate statistics are available in this mode.
- Published results: Enter the reported (or ), , and any available tie information to compute p-values and effect sizes from a published result.
Step 3 — Specify the Null Hypothesis Value
- Paired version: Default (testing whether the median difference is zero). Enter a non-zero value for one-sample-style comparisons against a reference.
- One-sample version: Enter the hypothesised population pseudo-median .
Step 4 — Select the Alternative Hypothesis
- Two-tailed (default): The pseudo-median differs from .
- Upper one-tailed: The pseudo-median is greater than .
- Lower one-tailed: The pseudo-median is less than .
Step 5 — Select p-value Method
- Exact (recommended for ): Uses the complete enumeration of the null distribution. Automatically selected by DataStatPro for small samples.
- Asymptotic + Continuity Correction (recommended for ): Normal approximation with tie correction and continuity correction.
- Permutation ( resamples): Specify (default: ). Appropriate for any sample size, handles ties exactly.
Step 6 — Handle Zero Differences
- Wilcoxon method (default): Exclude zero differences; analyse non-zero pairs.
- Pratt's method: Include zeros in ranking but not in rank sums.
DataStatPro reports (total pairs), (zero differences excluded), (positive differences), (negative differences), and (effective sample size).
Step 7 — Select Display Options
- ✅ , , (minimum), , and p-value (exact and/or asymptotic).
- ✅ Descriptive statistics: , , , , median per condition, median difference, Hodges-Lehmann pseudo-median estimate.
- ✅ Hodges-Lehmann estimator with 95% CI.
- ✅ Rank table: individual , , , signed rank.
- ✅ Effect size (rank-biserial correlation) with 95% CI.
- ✅ Matched-pairs rank-biserial correlation .
- ✅ Common Language Effect Size (CL%).
- ✅ Assumption check panel: histogram of , Q-Q plot, skewness, zero count, tie count.
- ✅ Distribution visualisation: overlapping density plots per condition; histogram of signed ranks.
- ✅ Dot plot with connecting lines showing individual participant change.
- ✅ Comparison with paired t-test results (runs both; flags discrepancies).
- ✅ Power curve: power vs. for observed effect size.
- ✅ APA 7th edition-compliant results paragraph (auto-generated).
Step 8 — Run the Analysis
Click "Run Wilcoxon Test". DataStatPro will:
- Compute all difference scores and rank them.
- Apply zero-exclusion (or Pratt) and tie-correction.
- Compute , , , exact p-value, and asymptotic p-value.
- Estimate the Hodges-Lehmann pseudo-median with exact 95% CI.
- Compute effect sizes and with CIs.
- Run assumption checks and display symmetry diagnostics.
- Auto-generate the APA-compliant results paragraph.
7. Full Step-by-Step Procedure
7.1 Complete Computational Procedure
This section walks through every computational step for the Wilcoxon Signed-Rank Test, from raw data to a full APA-style conclusion.
Given: pairs of observations for .
Step 1 — Establish Sign Convention and Compute Difference Scores
Define consistently for all pairs. A positive means Condition 1 yields a higher value than Condition 2 for participant .
State the sign convention explicitly: "Positive differences indicate higher scores in Condition 1 than Condition 2."
Step 2 — Identify and Exclude Zero Differences
Identify all pairs where exactly. Remove these from further analysis.
(effective sample size after exclusion)
Record for reporting. If , state explicitly that pairs with were excluded.
Step 3 — Compute Absolute Differences and Check Symmetry
Compute for all non-zero differences.
Symmetry check:
- Plot a histogram of the difference scores .
- Compute skewness: if , symmetry is not severely violated.
- Inspect whether the distribution appears approximately symmetric about zero.
Step 4 — Rank the Absolute Differences
Rank from smallest (rank 1) to largest (rank ).
For tied absolute values, assign the average rank to all tied observations.
Notation: = rank of .
Verification:
Step 5 — Assign Signed Ranks
Restore the sign of each difference to its rank:
if ; if
Create a table with columns: , , , (rank of ), and the signed rank ( if positive, if negative).
Step 6 — Compute the Rank Sums
(sum of ranks where )
(sum of ranks where )
Verification:
Test statistic:
Count: number of positive differences; number of negative differences; .
Step 7 — Compute the p-value
If and few ties: Use the exact null distribution (from tables or software enumeration).
If or many ties: Use the normal approximation with tie correction:
With continuity correction:
Two-tailed p-value:
Compare to . Reject if .
Step 8 — Compute the Hodges-Lehmann Point Estimate
The Hodges-Lehmann estimator is the point estimate of the pseudo-median associated with the Wilcoxon test. It is the median of all pairwise averages of the non-zero differences:
There are such averages (including each difference paired with itself).
This estimator is:
- Robust to outliers (like the median).
- More efficient than the median for symmetric distributions.
- The natural point estimate associated with the Wilcoxon test.
Step 9 — Compute the 95% CI for the Pseudo-Median
The exact 95% CI for the pseudo-median uses the order statistics of the Walsh averages (all pairwise averages). The CI bounds are determined by the critical values of the Wilcoxon null distribution.
Let be the lower critical value from the exact Wilcoxon table:
the -th smallest and -th largest Walsh average.
DataStatPro computes these exact CI bounds numerically.
Approximate 95% CI (for large ):
Find , then the CI consists of the -th to -th ordered Walsh averages.
Step 10 — Compute Effect Sizes
Effect size (from z-statistic):
Matched-pairs rank-biserial correlation (from Kerby, 2014):
Or equivalently:
Both and range from to .
Common Language Effect Size (CL):
(when , i.e., most differences positive)
More precisely:
Step 11 — Interpret and Report
Combine all results into a complete APA-compliant report:
- State the test used and the reason (non-normality, ordinal data).
- Report group/condition medians.
- Report (or ), , and .
- Report the Hodges-Lehmann estimate with 95% CI.
- Report the effect size (or ) with its 95% CI.
- State the practical conclusion.
8. Effect Sizes for the Wilcoxon Signed-Rank Test
8.1 The Rank-Biserial Correlation — Primary Effect Size
The matched-pairs rank-biserial correlation (Kerby, 2014) is the recommended primary effect size for the Wilcoxon Signed-Rank Test. It has several equivalent formulations:
From rank sums:
From positive and negative rank proportions:
Interpretation: represents the difference between the proportion of favourable and unfavourable evidence in the data.
- : All differences are positive (every participant scores higher in Condition 1).
- : All differences are negative (every participant scores higher in Condition 2).
- : Equal evidence for positive and negative effects ().
- : 75% of the evidence favours Condition 1 over Condition 2.
This last property is related to the probability of superiority interpretation:
(approximately, under the symmetry assumption)
8.2 The Effect Size — From the z-Statistic
(sometimes written or ) is the effect size computed directly from the standardised test statistic:
Where is the z-approximation to the Wilcoxon statistic and is the effective sample size (excluding zero differences).
has the same range as a Pearson correlation ( to ) and uses the same verbal benchmarks as Pearson . It is mathematically equivalent to the point-biserial correlation between a binary indicator of condition and the observed rank differences.
Relationship between and :
For large without ties, . They can differ for small samples or with many ties.
💡 DataStatPro reports both and . For primary reporting, is recommended because it is interpretable without reference to the z-approximation and has a direct probability-of-superiority interpretation. Use when comparing to literature that reports this variant.
8.3 Cohen's Benchmarks for and
Since and behave like correlation coefficients, Cohen's (1988) benchmarks for Pearson are applied:
| or | Verbal Label | Equivalent | Power needed ( pairs) |
|---|---|---|---|
| Small | |||
| Medium | |||
| Large | |||
| Very large | |||
| Huge |
Power estimates for two-tailed , 80% power, Wilcoxon test.
⚠️ These benchmarks from Cohen (1988) are rough guidelines. Always contextualise effect sizes against domain-specific norms. An may be large in some fields (e.g., large-scale educational interventions) and small in others (e.g., lab-controlled cognitive experiments).
8.4 Converting Between Effect Size Metrics
| From | To | Formula |
|---|---|---|
| (approx) | ||
| (approx) | ||
| , | ||
| , |
⚠️ The conversions between and above use the equal-groups formula and are only approximations. Do not use these conversions for meta-analytic aggregation without accounting for the design structure.
8.5 The Hodges-Lehmann Estimator as an Effect Size
The Hodges-Lehmann pseudo-median is the point estimate in original measurement units associated with the Wilcoxon test. It is:
- Reported alongside to provide both a standardised and an unstandardised effect.
- More interpretable than for practitioners who think in original scale units.
- More robust than the mean difference to outliers and skewness.
- The natural "what is the effect size in original units?" companion to the Wilcoxon test.
Reporting recommendation: Always report with its 95% CI alongside . This parallels the paired t-test practice of reporting both the mean difference (in original units) and Cohen's .
8.6 The Common Language Effect Size for the Wilcoxon Test
The Common Language Effect Size (CL) for the Wilcoxon context is:
Estimated from the data:
(simple version based on counts)
Or, more precisely using Walsh averages:
This is the probability that a randomly selected participant scores higher in Condition 1 than in Condition 2, estimated non-parametrically from the data.
9. Confidence Intervals
9.1 Exact CI for the Hodges-Lehmann Pseudo-Median
The natural CI to report with the Wilcoxon Signed-Rank Test is the exact confidence interval for the pseudo-median (Hodges-Lehmann CI), expressed in the original measurement units.
Algorithm:
- Compute all Walsh averages: for .
- Sort the Walsh averages in ascending order: .
- Find the lower critical value from the exact Wilcoxon null distribution at the chosen level.
- The 95% CI is .
Where is the largest value of for which under .
DataStatPro computes this exact CI automatically.
9.2 Number of Walsh Averages for Common Sample Sizes
| pairs | Walsh averages |
|---|---|
| 5 | 15 |
| 10 | 55 |
| 15 | 120 |
| 20 | 210 |
| 30 | 465 |
| 50 | 1275 |
| 100 | 5050 |
9.3 Interpreting the Hodges-Lehmann CI
The Hodges-Lehmann CI has the same interpretation as any confidence interval: if the study were repeated many times, approximately 95% of the resulting intervals would contain the true population pseudo-median.
CI interpretation rules:
| CI Property | Interpretation |
|---|---|
| Entirely above zero | Pseudo-median is significantly positive; Condition 1 tends to produce higher values |
| Entirely below zero | Pseudo-median is significantly negative; Condition 2 tends to produce higher values |
| Contains zero | Result is not statistically significant at level |
| Narrow CI | Precise estimate (large ) |
| Wide CI | Imprecise estimate (small ); interpret cautiously |
9.4 CI for the Effect Size
A bootstrap 95% CI for is available in DataStatPro when raw data are provided:
- Resample pairs with replacement times.
- Compute for each bootstrap sample.
- The 95% CI is the 2.5th and 97.5th percentile of the bootstrap distribution.
An asymptotic CI can also be computed using Fisher's -transformation:
Back-transform:
9.5 Width of the CI as a Function of Sample Size
For using the Fisher approximation:
| Approx. CI Width () | Precision | ||
|---|---|---|---|
| 10 | 0.378 | 1.16 | Very low |
| 20 | 0.243 | 0.79 | Low |
| 30 | 0.189 | 0.63 | Moderate |
| 50 | 0.145 | 0.49 | Moderate |
| 100 | 0.102 | 0.35 | Good |
| 200 | 0.071 | 0.25 | High |
⚠️ The CI for is very wide for small samples. Always report the CI to convey the uncertainty in the effect size estimate. A precise-looking point estimate of from pairs has a CI of approximately — nearly uninformative about the true effect magnitude.
10. Power Analysis and Sample Size Planning
10.1 Power of the Wilcoxon Signed-Rank Test
Power analysis for the Wilcoxon Signed-Rank Test is more complex than for parametric tests because the power depends on the entire distribution of difference scores, not just the mean and variance. Three approaches are used:
Approach 1 — Use the ARE relative to the paired t-test:
Since for normal data, the required for the Wilcoxon test is approximately times the required for the paired t-test at the same power.
This is the most practical planning approach when is known or estimated.
Approach 2 — Use the effect size directly (simulation-based):
DataStatPro uses Monte Carlo simulation to estimate power for specified (or ), , , and distributional shape (normal, logistic, exponential).
Approach 3 — Use the normal approximation (large samples):
For large , power is approximately:
Where is the non-centrality parameter.
10.2 Required Sample Size for 80% Power (, Two-Tailed)
Based on converting to Wilcoxon via ARE (normal data):
| equivalent | (approx) | Wilcoxon (80% power) | Paired t (80% power) | Overhead |
|---|---|---|---|---|
| 0.20 | 0.099 | 277 | 264 | +5% |
| 0.30 | 0.148 | 125 | 119 | +5% |
| 0.50 | 0.243 | 46 | 44 | +5% |
| 0.80 | 0.372 | 19 | 18 | +6% |
| 1.00 | 0.447 | 14 | 13 | +8% |
| 1.20 | 0.514 | 10 | 9 | +11% |
| 1.50 | 0.600 | 7 | 7 |
Note: For non-normal distributions (heavy tails, skewed), the Wilcoxon test may require fewer observations than the paired t-test.
10.3 Sensitivity Analysis
The minimum detectable effect size for a given and power (80%):
Using the ARE-based approximation:
| pairs | Min. detectable | Min. detectable |
|---|---|---|
| 10 | 0.906 | 0.411 |
| 20 | 0.641 | 0.306 |
| 30 | 0.523 | 0.253 |
| 50 | 0.405 | 0.199 |
| 100 | 0.286 | 0.142 |
| 200 | 0.202 | 0.101 |
10.4 Power Advantage Under Non-Normality
For non-normal distributions, the Wilcoxon test's power advantage over the t-test grows:
| Distribution of | ARE | Implication |
|---|---|---|
| Normal | 0.955 | Wilcoxon needs 5% more pairs |
| Contaminated normal (5% outliers) | 1.34 | Wilcoxon needs 25% fewer pairs |
| Laplace (double exponential) | 1.50 | Wilcoxon needs 33% fewer pairs |
| Logistic | 1.10 | Wilcoxon needs 9% fewer pairs |
| Heavy Cauchy tails | Wilcoxon dramatically more powerful |
💡 When the distribution of difference scores is expected to be non-normal (e.g., for Likert-type scales, skewed physiological data, or time-to-event measures), plan sample size using the Wilcoxon test directly via DataStatPro's Monte Carlo power module rather than the ARE-based approximation.
11. Advanced Topics
11.1 Comparing the Wilcoxon Signed-Rank Test and the Paired t-Test
A common question is: given that both tests are available, which should be reported?
Decision criteria:
| Condition | Recommendation |
|---|---|
| Difference scores clearly normal, no outliers, | Paired t-test (slightly more powerful) |
| Difference scores non-normal, | Wilcoxon Signed-Rank Test |
| Difference scores ordinal or near-ordinal | Wilcoxon Signed-Rank Test |
| Severe outliers in differences that cannot be removed | Wilcoxon Signed-Rank Test |
| Uncertain normality, small | Wilcoxon Signed-Rank Test (safer) |
| , differences mildly non-normal | Either test (CLT protects t-test) |
| Pre-registered choice, normality assumed | Paired t-test with Wilcoxon as sensitivity |
Best practice: When normality is uncertain, run both tests. If they agree (both significant or both non-significant), report the parametric result as primary with the non-parametric as a sensitivity check. If they disagree, investigate the distribution of differences and report the Wilcoxon as the primary test with an explanation.
11.2 The Sign Test as a Simpler Alternative
The Sign Test is an even simpler non-parametric test that uses only the sign of each difference (ignoring magnitude). It tests using the binomial distribution:
under
When to use the Sign Test over Wilcoxon:
- Only the direction of change is known (not the magnitude).
- Data are binary or nominal (e.g., improved vs. not improved).
- The distribution of differences is so severely non-symmetric that even the Wilcoxon test's symmetry assumption is implausible.
Efficiency comparison: The Sign Test has ARE relative to the paired t-test — substantially less efficient than the Wilcoxon test's ARE of 0.955. Use the Sign Test only when the Wilcoxon test's symmetry assumption cannot be justified.
11.3 Bootstrap Wilcoxon Test
The bootstrap version of the Wilcoxon test generates the null distribution by resampling:
- For each bootstrap iteration : a. Randomly flip the sign of each with probability 0.5 (sign randomisation under ). b. Compute from the sign-randomised differences.
- The bootstrap p-value is the proportion of that exceeds .
This approach:
- Is valid regardless of ties or distributional shape.
- Produces exact p-values in the limit as .
- Is equivalent to the permutation test described in Section 5.4.
11.4 Bayesian Non-Parametric Paired Test
The Bayesian Signed-Rank Test (van Doorn et al., 2018; Ly et al., 2016) extends the Bayesian framework to the Wilcoxon setting. It computes a Bayes Factor quantifying evidence for (pseudo-median ) vs. (pseudo-median ) without assuming normality.
The prior on the scaled pseudo-median under is a Cauchy distribution (as in the Bayesian t-test), but the likelihood is based on a normal approximation to the sampling distribution of the Wilcoxon statistic.
evaluated at with
This approximation is valid for . DataStatPro computes the Bayesian Signed-Rank Test using this approximation.
Interpretation of : Same benchmarks as the Bayesian t-test (see Section 11.4 of the Paired t-Test tutorial).
11.5 Multiple Wilcoxon Tests and Familywise Error Control
When multiple Wilcoxon Signed-Rank Tests are conducted simultaneously (e.g., testing the same intervention on five different outcomes), the familywise error rate (FWER) inflates exactly as with multiple t-tests:
Correction methods applicable to multiple Wilcoxon tests:
| Method | Adjusted | Properties |
|---|---|---|
| Bonferroni | Conservative; controls FWER | |
| Holm | Sequential | Less conservative than Bonferroni |
| Benjamini-Hochberg | FDR control | Exploratory analyses |
Apply the same correction logic as for multiple parametric tests.
11.6 The Wilcoxon Test for Ordinal Likert Scale Data
A common application of the Wilcoxon Signed-Rank Test is to paired Likert scale responses. Consider a satisfaction survey where participants rate two products on a 5-point scale (1 = very dissatisfied, 5 = very satisfied).
Key considerations:
- Single Likert items should be treated as ordinal; the Wilcoxon test is appropriate.
- Composite Likert scales (sum or average of multiple items) can often be treated as approximately continuous; the paired t-test may be appropriate if the composite is approximately normally distributed.
- Floor and ceiling effects are common with Likert data and create many zero differences and ties — check carefully and consider Pratt's method.
- The Wilcoxon test cannot distinguish between a systematic shift of 1 point (each participant rates Product 1 exactly 1 point higher) and a mixed pattern (some rate it 2 points higher, others 1 point lower). The Hodges-Lehmann estimate helps clarify the typical magnitude of change.
11.7 Reporting the Wilcoxon Signed-Rank Test According to APA 7th Edition
Minimum reporting requirements (APA 7th ed.):
- State that the Wilcoxon Signed-Rank Test was used and why (e.g., non-normal differences, ordinal data).
- Report medians for each condition (or the Hodges-Lehmann pseudo-median estimate).
- Report the test statistic: or (or , the minimum), and the z-approximation if .
- Report the exact or asymptotic p-value.
- Report the effect size (or ) with 95% CI.
- Report the Hodges-Lehmann estimate with 95% CI (in original units).
- Report , , , and (number of zeros excluded).
12. Worked Examples
Example 1: Pre-Post Anxiety Scores (Non-Normal Differences)
A clinical psychologist evaluates an 8-week acceptance and commitment therapy (ACT) programme for anxiety. Generalised Anxiety Disorder 7-item scale (GAD-7; range 0–21; higher = more anxiety) scores are recorded for participants before and after the programme.
Shapiro-Wilk test on raw scores: Differences are right-skewed (, ) — normality violated. The Wilcoxon Signed-Rank Test is used.
Raw data:
| Pre-ACT () | Post-ACT () | ||
|---|---|---|---|
| 1 | 16 | 9 | 7 |
| 2 | 12 | 8 | 4 |
| 3 | 18 | 6 | 12 |
| 4 | 14 | 11 | 3 |
| 5 | 20 | 8 | 12 |
| 6 | 11 | 9 | 2 |
| 7 | 17 | 14 | 3 |
| 8 | 15 | 5 | 10 |
| 9 | 13 | 11 | 2 |
| 10 | 19 | 10 | 9 |
| 11 | 16 | 13 | 3 |
| 12 | 14 | 12 | 2 |
Step 1 — Zero differences: No , so .
Step 2 — Absolute differences and symmetry check:
: 7, 4, 12, 3, 12, 2, 3, 10, 2, 9, 3, 2
Symmetry check: all differences are positive (no negative differences), indicating a strong shift. The distribution of is right-skewed (all positive, with some large values of 12), which is consistent with the Shapiro-Wilk violation.
Step 3 — Rank the absolute differences:
Sorted values and their ranks (with midranks for ties):
| value | Count | Rank positions | Avg rank |
|---|---|---|---|
| 2 | 3 | 1, 2, 3 | 2.0 |
| 3 | 3 | 4, 5, 6 | 5.0 |
| 4 | 1 | 7 | 7.0 |
| 7 | 1 | 8 | 8.0 |
| 9 | 1 | 9 | 9.0 |
| 10 | 1 | 10 | 10.0 |
| 12 | 2 | 11, 12 | 11.5 |
Rank assignment:
| Rank | Signed Rank | |||
|---|---|---|---|---|
| 1 | 7 | 7 | 8.0 | |
| 2 | 4 | 4 | 7.0 | |
| 3 | 12 | 12 | 11.5 | |
| 4 | 3 | 3 | 5.0 | |
| 5 | 12 | 12 | 11.5 | |
| 6 | 2 | 2 | 2.0 | |
| 7 | 3 | 3 | 5.0 | |
| 8 | 10 | 10 | 10.0 | |
| 9 | 2 | 2 | 2.0 | |
| 10 | 9 | 9 | 9.0 | |
| 11 | 3 | 3 | 5.0 | |
| 12 | 2 | 2 | 2.0 |
Step 4 — Rank sums:
(no negative differences)
Check: ✅
, ,
Step 5 — Exact p-value ():
With (all differences positive), the exact two-tailed p-value is:
Step 6 — Hodges-Lehmann estimator:
All Walsh averages are computed and sorted. The median of 78 values is the average of the 39th and 40th sorted Walsh averages.
Given all differences are positive (2, 2, 2, 3, 3, 3, 4, 7, 9, 10, 12, 12), the Walsh averages range from 2 (minimum) to 12 (maximum), all positive.
GAD-7 points (median of Walsh averages; computed by DataStatPro)
95% CI for pseudo-median (exact): GAD-7 points
Step 7 — Effect sizes:
Rank-biserial correlation:
(perfect: every participant improved)
z-based effect size (, asymptotic approximation):
Tie correction:
Common Language Effect Size:
(all participants improved)
Step 8 — Summary:
| Statistic | Value | Interpretation |
|---|---|---|
| Pre-ACT median | GAD-7 pts | Moderate-severe anxiety |
| Post-ACT median | GAD-7 pts | Mild anxiety |
| (non-zero diff.) | All participants showed positive change | |
| / / | / / | |
| Maximum possible | ||
| Zero negative ranks | ||
| (minimum) | ||
| (exact, two-tailed) | ||
| HL pseudo-median | GAD-7 pts | |
| 95% CI for | pts | Excludes 0; significant |
| Maximum possible effect | ||
| Very large | ||
| CL | Every participant improved |
APA write-up: "Due to non-normal distribution of difference scores (Shapiro-Wilk , ), a Wilcoxon Signed-Rank Test was conducted. ACT therapy produced a statistically significant reduction in anxiety (pre-ACT: Mdn = 15.5 GAD-7 points; post-ACT: Mdn = 9.5), , (exact). The Hodges-Lehmann estimate of the median reduction was 4.5 GAD-7 points [95% CI: 3.0, 9.5], , indicating a very large treatment effect. All 12 participants showed improvement following ACT."
Example 2: Pain Ratings — Two Physiotherapy Protocols (Ordinal DV)
A physiotherapist compares pain relief (0–10 NRS, ordinal) under two physiotherapy protocols in patients with chronic lower back pain. Each patient receives both protocols in randomised order with a 1-week washout. Lower scores indicate less pain. (negative = A produces less pain).
Raw data:
| Protocol A | Protocol B | ||
|---|---|---|---|
| 1 | 4 | 6 | −2 |
| 2 | 7 | 7 | 0 |
| 3 | 3 | 5 | −2 |
| 4 | 6 | 8 | −2 |
| 5 | 5 | 4 | 1 |
| 6 | 4 | 7 | −3 |
| 7 | 6 | 6 | 0 |
| 8 | 3 | 6 | −3 |
| 9 | 5 | 5 | 0 |
| 10 | 7 | 9 | −2 |
| 11 | 4 | 6 | −2 |
| 12 | 6 | 7 | −1 |
| 13 | 5 | 8 | −3 |
| 14 | 3 | 5 | −2 |
| 15 | 6 | 7 | −1 |
Step 1 — Exclude zeros:
for participants 2, 7, 9 → ; .
Non-zero differences:
(participant 5: ); (all others).
Step 2 — Absolute differences and ranks:
| value | Count | Rank positions | Avg rank |
|---|---|---|---|
| 1 | 3 | 1, 2, 3 | 2.0 |
| 2 | 6 | 4, 5, 6, 7, 8, 9 | 6.5 |
| 3 | 3 | 10, 11, 12 | 11.0 |
Rank table (non-zero differences only):
| Signed Rank | ||||
|---|---|---|---|---|
| 1 | −2 | 2 | 6.5 | −6.5 |
| 3 | −2 | 2 | 6.5 | −6.5 |
| 4 | −2 | 2 | 6.5 | −6.5 |
| 5 | +1 | 1 | 2.0 | +2.0 |
| 6 | −3 | 3 | 11.0 | −11.0 |
| 8 | −3 | 3 | 11.0 | −11.0 |
| 10 | −2 | 2 | 6.5 | −6.5 |
| 11 | −2 | 2 | 6.5 | −6.5 |
| 12 | −1 | 1 | 2.0 | −2.0 |
| 13 | −3 | 3 | 11.0 | −11.0 |
| 14 | −2 | 2 | 6.5 | −6.5 |
| 15 | −1 | 1 | 2.0 | −2.0 |
Step 3 — Rank sums:
Check: ✅
Step 4 — Exact p-value ():
From Wilcoxon signed-rank exact tables: (one-tail).
Two-tailed:
Step 5 — z-approximation (with tie correction):
Step 6 — Hodges-Lehmann estimate:
NRS points (median of Walsh averages)
95% CI for (exact): NRS points
Step 7 — Effect sizes:
— very large effect (Protocol A produces substantially less pain)
CL (proportion of differences favouring Protocol A):
Protocol B outperforms Protocol A in 91.7% of patients.
Summary:
| Statistic | Value | Interpretation |
|---|---|---|
| Protocol A median pain | NRS | |
| Protocol B median pain | NRS | |
| (3 zeros excluded) | ||
| / / | / / | |
| (minimum) | ||
| Exact (two-tailed) | Significant | |
| HL estimate | NRS | Protocol A lowers pain by 2 pts |
| 95% CI | Excludes 0 | |
| Very large | ||
| Very large | ||
| CL | favour B | Protocol B clearly superior |
APA write-up: "Due to the ordinal nature of the NRS pain scale and the presence of tied differences, a Wilcoxon Signed-Rank Test was conducted. Three pairs with equal ratings were excluded, leaving pairs. Protocol A (Mdn = 5.0) produced significantly lower pain ratings than Protocol B (Mdn = 6.0), , (exact), [95% CI: −0.99, −0.73]. The Hodges-Lehmann estimate indicated that Protocol A reduced pain by a median of 2.0 NRS points compared to Protocol B [95% CI: 1.0, 3.0]. This represents a very large effect, with Protocol B producing lower pain in 11 of 12 patients with non-zero differences."
Example 3: One-Sample Wilcoxon — Daily Step Counts vs. Health Guideline
A public health researcher tests whether median daily step counts in a sample of office workers differ from the recommended health guideline of 10,000 steps per day. The distribution of step counts is right-skewed (Shapiro-Wilk ); the one-sample Wilcoxon Signed-Rank Test is used.
Data (daily steps, thousands):
: 6.2, 8.4, 11.3, 7.1, 9.8, 5.6, 12.4, 8.9, 7.3, 10.1, 6.8, 9.4, 13.2, 7.6, 8.1, 11.8, 6.4, 9.2
Null hypothesis: thousand steps (health guideline)
vs.
Differences from guideline: :
Step 1 — No zero differences: ; .
(values: 1.3, 2.4, 0.1, 3.2, 1.8); .
Step 2 — Rank absolute differences:
Sorted : 0.1, 0.2, 0.6, 0.8, 1.1, 1.3, 1.6, 1.8, 1.9, 2.4, 2.4, 2.7, 2.9, 3.2, 3.2, 3.6, 3.8, 4.4
Ranks 1–18 assigned (midranks for tied values 2.4 and 3.2):
| Sign | Signed Rank | ||
|---|---|---|---|
| 0.1 | 1 | + | +1 |
| 0.2 | 2 | − | −2 |
| 0.6 | 3 | − | −3 |
| 0.8 | 4 | − | −4 |
| 1.1 | 5 | − | −5 |
| 1.3 | 6 | + | +6 |
| 1.6 | 7 | − | −7 |
| 1.8 | 8 | + | +8 |
| 1.9 | 9 | − | −9 |
| 2.4 | 10.5 | + | +10.5 |
| 2.4 | 10.5 | − | −10.5 |
| 2.7 | 12 | − | −12 |
| 2.9 | 13 | − | −13 |
| 3.2 | 14.5 | + | +14.5 |
| 3.2 | 14.5 | − | −14.5 |
| 3.6 | 16 | − | −16 |
| 3.8 | 17 | − | −17 |
| 4.4 | 18 | − | −18 |
Step 3 — Rank sums:
Check: ✅
Step 4 — Normal approximation (with tie correction, , use asymptotic):
Ties: two pairs of ties (2.4 twice, 3.2 twice):
With continuity correction:
(Marginal; exact p-value from DataStatPro: )
Step 5 — Hodges-Lehmann estimate and CI:
thousand steps (estimated median difference from 10,000)
Population pseudo-median: thousand steps/day
95% CI for pseudo-median: thousand steps from guideline
Step 6 — Effect sizes:
(large effect — below guideline)
Summary:
| Statistic | Value |
|---|---|
| Sample median steps | k |
| Guideline | k |
| / | / |
| (exact, two-tailed) | |
| HL estimate (from guideline) | k steps |
| 95% CI | k steps |
| (Large) |
APA write-up: "A one-sample Wilcoxon Signed-Rank Test was used to examine whether median daily step counts differed from the 10,000-step health guideline, as step counts were right-skewed (Shapiro-Wilk , ). The sample median of 8,250 steps was significantly below the guideline, , (exact), [95% CI: −0.78, −0.09]. The Hodges-Lehmann estimate indicated that office workers fell short of the guideline by a median of 1,750 steps/day [95% CI: 50, 3,500 steps below], a large effect."
Example 4: Comparing Two Teaching Methods — Non-Significant Result
A teacher compares student performance on matched reading comprehension tests under two instructional methods: silent reading vs. guided discussion, in students. Test scores range 0–100.
Data:
| Silent () | Discussion () | ||
|---|---|---|---|
| 1 | 72 | 75 | −3 |
| 2 | 68 | 71 | −3 |
| 3 | 81 | 78 | 3 |
| 4 | 65 | 68 | −3 |
| 5 | 77 | 73 | 4 |
| 6 | 70 | 72 | −2 |
| 7 | 75 | 75 | 0 |
| 8 | 83 | 80 | 3 |
| 9 | 69 | 74 | −5 |
| 10 | 74 | 76 | −2 |
Step 1 — Zero differences: Participant 7: → ; .
Non-zero :
(); ().
Step 2 — Rank absolute differences:
Sorted : 2, 2, 3, 3, 3, 3, 3, 4, 5
| Count | Avg Rank | Sign assignments | |
|---|---|---|---|
| 2 | 2 | 1.5 | Both −: −1.5, −1.5 |
| 3 | 5 | 5.0 | Three −, two +: −5.0 (×3), +5.0 (×2) |
| 4 | 1 | 8.0 | One +: +8.0 |
| 5 | 1 | 9.0 | One −: −9.0 |
Step 3 — Rank sums:
Check: ✅
Step 4 — Exact p-value ():
From exact Wilcoxon tables: (one-tail)
Two-tailed: (but for , close to expected 22.5, so:)
Using symmetry:
Exact: (DataStatPro exact computation).
Step 5 — Effect sizes:
(small effect, discussion slightly better)
Hodges-Lehmann estimate: points
95% CI for (exact): points (includes 0)
Summary:
| Statistic | Value | Interpretation |
|---|---|---|
| Silent median | pts | |
| Discussion median | pts | |
| (1 zero excluded) | ||
| / / | / / | |
| (exact, two-tailed) | Not significant | |
| HL estimate | pts | Discussion slightly higher |
| 95% CI for | pts | Includes 0 |
| Small effect |
APA write-up: "A Wilcoxon Signed-Rank Test was conducted to compare comprehension scores under silent reading and guided discussion. One pair with identical scores was excluded (). There was no significant difference between silent reading (Mdn = 71.0) and guided discussion (Mdn = 73.0), , (exact), [95% CI: −0.69, 0.41]. The Hodges-Lehmann estimate of the median difference was −1.0 points [95% CI: −4.0, 2.5], indicating a small, non-significant advantage for guided discussion. Given the small sample size (), this study had limited power to detect small effects (minimum detectable at 80% power)."
13. Common Mistakes and How to Avoid Them
Mistake 1: Using the Wilcoxon Signed-Rank Test When the Sign Test is More Appropriate
Problem: Applying the Wilcoxon Signed-Rank Test to data where differences can be assessed for direction but not for meaningful magnitude — for example, nominal categories coded as 0/1, or extremely coarse ordinal data with only a few categories. The Wilcoxon test requires that ranking the absolute differences is meaningful; if it is not, the test is invalid.
Solution: When only the direction of change is known (positive or negative), use the Sign Test. When differences can be meaningfully ranked, use the Wilcoxon test. Examine whether the concept of "a difference of 4 being larger than a difference of 2" makes sense for your measurement scale.
Mistake 2: Ignoring the Symmetry Assumption
Problem: Applying the Wilcoxon Signed-Rank Test without checking whether the difference scores are approximately symmetrically distributed about zero. The test assumes symmetry — without it, the p-value conflates location and shape effects. For instance, with right-skewed positive differences, even a true null hypothesis can be rejected because the large positive outliers inflate .
Solution: Always plot a histogram of the difference scores and assess symmetry visually. Compute skewness (). If differences are severely asymmetric, use the Sign Test or a bootstrap-based test instead.
Mistake 3: Not Reporting the Effective Sample Size and the Number of Zeros
Problem: Reporting pairs but not mentioning that 5 pairs had and were excluded, leaving for the analysis. Readers cannot evaluate the precision of the estimate or compare it to power requirements without knowing .
Solution: Always report (total pairs), (zero differences excluded), , , and (effective sample size). State explicitly: " pairs with zero difference were excluded from the analysis, leaving ."
Mistake 4: Reporting Only the p-value Without an Effect Size
Problem: Reporting , without any effect size measure. The Wilcoxon test statistic is not interpretable without knowing , and the p-value conveys nothing about the magnitude of the effect.
Solution: Always report the rank-biserial correlation (or ) with its 95% CI, and the Hodges-Lehmann estimate with its 95% CI. These together convey effect magnitude in both standardised and original-units terms.
Mistake 5: Using the Paired t-Test When the Wilcoxon Test is Clearly Needed
Problem: Observing highly non-normal difference scores with extreme outliers in a small sample () and proceeding with the paired t-test because "it's the standard test." The t-test's p-value may be seriously distorted by even a single extreme outlier in small samples.
Solution: Implement a pre-analysis normality check on difference scores (Shapiro-Wilk). If and , use the Wilcoxon test as the primary analysis. Run the paired t-test as a sensitivity check and report both results with an explanation of why they may differ.
Mistake 6: Treating a Non-Significant Wilcoxon Result as Evidence of No Difference
Problem: Reporting , and concluding "the two conditions do not differ." As with all hypothesis tests, a non-significant result only indicates insufficient evidence to reject — it does not establish equivalence or absence of an effect.
Solution: Report the Hodges-Lehmann estimate and its 95% CI. If the CI is wide, note that the study is underpowered and a meaningful effect may exist but be undetected. For claims of equivalence, use a formal equivalence test (e.g., TOST on the Hodges-Lehmann estimator) with pre-specified bounds.
Mistake 7: Mis-Reporting the Test Statistic
Problem: Confusion about which statistic to report. Different software uses different conventions: some report (sum of positive ranks), some report , some report , some report a -statistic, and some report (an older notation equivalent to ). Reporting "" without specifying that this is the minimum (vs. ) is ambiguous.
Solution: Clearly specify what was reported. DataStatPro reports , , and separately, and the auto-generated APA paragraph uses the convention = [value] to avoid ambiguity. When reporting, specify: " (sum of positive ranks)" or " (Wilcoxon signed-rank statistic, minimum of and )".
Mistake 8: Applying the Test to Data With Too Many Ties Without Addressing Them
Problem: Using Likert scale data where many participants show no change () and many show changes of exactly 1 point (massive ties in ). Running the standard Wilcoxon test without the tie correction produces inaccurate p-values, and excluding a large proportion of zeros severely reduces power.
Solution: Report the proportion of zero differences (). Apply the tie correction to the variance (DataStatPro does this automatically). Consider using Pratt's method when zero differences are informative. If more than 30% of differences are zero or tied, acknowledge the limitation and consider whether the Sign Test or a permutation test is more appropriate.
Mistake 9: Comparing from the Wilcoxon Test Directly with Cohen's from a t-Test
Problem: Reporting from a Wilcoxon test alongside Cohen's from a paired t-test on different but related data, and treating them as equivalent effect sizes. and are on different scales and are only approximately related through conversion formulas.
Solution: Use the conversion formula for rough comparison, and clearly note the approximation. For direct comparison, use the same effect size metric (e.g., convert both to ) and acknowledge that the Wilcoxon-based and the t-test-based measure slightly different aspects of the effect (rank-based vs. mean-based).
Mistake 10: Using the Asymptotic Test When the Exact Test is Available
Problem: With non-zero differences and few ties, using the normal approximation to get when the exact test gives — and reporting the asymptotic result to achieve significance.
Solution: Always use the exact p-value when and ties are few. DataStatPro automatically selects the exact test for small samples. Never choose a p-value method post-hoc based on which gives a more favourable result. Pre-specify the method (exact vs. asymptotic) before analysis.
14. Troubleshooting
| Problem | Likely Cause | Solution |
|---|---|---|
| Arithmetic error in ranking or rank sum computation | Recheck ranks, midranks, and sums; verify (zeros excluded) | |
| and | All differences are positive (or all negative) | Verify data; if genuine, ; compute exact p-value |
| Exact and asymptotic p-values diverge substantially | Small (asymptotic unreliable) or many ties | Use exact p-value for ; use permutation test if many ties |
| Many zero differences () | Coarse measurement scale; many participants show no change | Report explicitly; consider Pratt's method; consider Sign Test; note reduced power |
| Wilcoxon significant but paired t-test not significant | Outliers in differences inflating (t-test); Wilcoxon more robust | Inspect difference distribution; if outliers present, Wilcoxon result is more reliable |
| Paired t-test significant but Wilcoxon not significant | Small after zero exclusion; t-test using mean which is influenced by extreme values | Inspect closely; if differences are symmetric and normal, t-test is appropriate; if not, Wilcoxon |
| All non-zero differences have the same sign | Perfect effect in the data; report with note that all participants changed in the same direction | |
| 95% CI for is very wide | Small | Report wide CI; increase sample size; note low precision |
| Hodges-Lehmann estimate differs substantially from mean difference | Presence of outliers or skewness in | Both are valid but measure different things; HL is the natural companion to the Wilcoxon test |
| Skewness check suggests asymmetric differences | Data do not meet symmetry assumption | Use Sign Test; report both tests; use bootstrap p-value |
| Software reports negative or | Software error or sign convention confusion | Check software documentation; both and are non-negative by definition |
| Tie correction produces a negative variance | Extreme number of ties; something is wrong | Check data for coding errors; with excessive ties, use permutation test |
| One-sample version gives different result from paired version for same data | Check how differences were defined | One-sample tests against ; paired tests against 0; should be equivalent if |
| Power is very low despite significant result | Sample size is small; significance is due to extreme effect size, not adequate power | Report sensitivity analysis; note that future replications need larger samples |
| Cannot determine without raw data | Only summary statistics available | requires all values; request data or report only |
15. Quick Reference Cheat Sheet
Core Formulas
| Formula | Description |
|---|---|
| Difference score for pair | |
| Effective sample size (excluding zeros) | |
| Sum of positive ranks | |
| Sum of negative ranks | |
| Verification check | |
| Wilcoxon test statistic | |
| Expected under | |
| Variance of (no ties) | |
| Variance with tie correction | |
| z-statistic | |
| z with continuity correction | |
| Two-tailed p-value |
Effect Size Formulas
| Formula | Description |
|---|---|
| Matched-pairs rank-biserial correlation | |
| Alternative formula for | |
| Effect size from z-statistic | |
| Hodges-Lehmann pseudo-median | |
| Common Language Effect Size (simple) | |
| Convert to Cohen's (approx.) | |
| Convert Cohen's to (approx.) |
Walsh Averages for Hodges-Lehmann CI
| pairs | Walsh averages |
|---|---|
| 5 | 15 |
| 10 | 55 |
| 15 | 120 |
| 20 | 210 |
| 25 | 325 |
| 30 | 465 |
| 50 | 1,275 |
Cohen's Benchmarks for and
| or | Label | Approx. equiv. | | :--------------------- | :---- | :--------------------- | | | Negligible | | | | Small | | | | Medium | | | | Large | |
ARE Comparison: Wilcoxon vs. Paired t-Test
| Data Distribution | ARE | Interpretation |
|---|---|---|
| Normal | Wilcoxon needs 5% more pairs | |
| Uniform | Identical efficiency | |
| Logistic | Wilcoxon needs 9% fewer pairs | |
| Laplace | Wilcoxon needs 33% fewer pairs | |
| Contaminated normal | Wilcoxon substantially more powerful |
Required for 80% Power (Two-Tailed , Normal Data)
| equivalent | (approx.) | Wilcoxon | Paired t | Overhead |
|---|---|---|---|---|
| 0.20 | 0.10 | 277 | 264 | +5% |
| 0.30 | 0.15 | 125 | 119 | +5% |
| 0.50 | 0.24 | 46 | 44 | +5% |
| 0.80 | 0.37 | 19 | 18 | +6% |
| 1.00 | 0.45 | 14 | 13 | +8% |
| 1.20 | 0.51 | 10 | 9 | +11% |
| 1.50 | 0.60 | 7 | 7 | 0% |
Zero and Tie Handling Reference
| Situation | Method | Notes |
|---|---|---|
| (default) | Wilcoxon: exclude | ; report |
| (alternative) | Pratt: include in ranking | Can affect p-value; use when zeros are informative |
| Tied $ | d_i | $ |
| Many ties | Permutation test | Exact handling regardless of tie structure |
| , few ties | Exact p-value | Always preferred |
| Asymptotic + continuity correction | Accurate for most situations |
Test Selection Guide
Two related conditions, continuous or ordinal DV?
├── Are difference scores normally distributed?
│ (Check: Shapiro-Wilk on d_i, Q-Q plot)
│ ├── YES and n ≥ 15 → Paired t-test (more power)
│ │ (Report Wilcoxon as sensitivity check if desired)
│ └── NO, or n < 30 and Shapiro-Wilk p < .05
│ └── Are differences rankable (magnitudes meaningful)?
│ ├── YES → Wilcoxon Signed-Rank Test ✅
│ │ ├── Are differences symmetric?
│ │ │ ├── YES → Standard Wilcoxon ✅
│ │ │ └── NO → Sign Test or Bootstrap
│ │ └── Many zeros? → Consider Pratt's method
│ └── NO (only direction known) → Sign Test
└── Three or more conditions → Friedman Test
Comparison: Wilcoxon vs. Sign Test vs. Paired t-Test
| Property | Paired t-Test | Wilcoxon Signed-Rank | Sign Test |
|---|---|---|---|
| Uses magnitude of differences | ✅ Full | ✅ Ranks | ❌ No |
| Assumes normality | ✅ Yes | ❌ No | ❌ No |
| Assumes symmetry | (via normality) | ✅ Yes | ❌ No |
| ARE vs. t-test | 1.000 | 0.955 | 0.637 |
| Robust to outliers | ❌ Low | ✅ High | ✅ Very high |
| Handles ordinal DV | ❌ No | ✅ Yes | ✅ Yes |
| Effect size | Cohen's | , | , |
| Point estimate | Hodges-Lehmann | Median |
APA 7th Edition Reporting Templates
Standard significant result:
"Due to [non-normal difference scores / ordinal measurement scale] (Shapiro-Wilk [value], [value]), a Wilcoxon Signed-Rank Test was conducted. [Condition 1] (Mdn = [value]) [was / was not] significantly [higher / lower] than [Condition 2] (Mdn = [value]), [value], [value], [value] [(exact)/(asymptotic)]. The Hodges-Lehmann estimate of the median difference was [value] [units] [95% CI: LB, UB], [value] [95% CI: LB, UB], indicating a [small / medium / large] effect. [ [value] pairs with zero difference were excluded, leaving [value] pairs for analysis.]"
Non-significant result:
"A Wilcoxon Signed-Rank Test revealed no significant difference between [Condition 1] (Mdn = [value]) and [Condition 2] (Mdn = [value]), [value], [value], [value] [95% CI: LB, UB]. The Hodges-Lehmann estimate was [value] [95% CI: LB, UB], indicating a [small / negligible] effect that the study was insufficiently powered to detect (minimum detectable [value] at 80% power)."
One-sample version:
"A one-sample Wilcoxon Signed-Rank Test was conducted to examine whether the population pseudo-median of [DV] differed from [θ₀]. The sample median of [value] [was / was not] significantly different from [θ₀], [value], [value], [value]. The Hodges-Lehmann estimate was [value] [units] from the null value [95% CI: LB, UB]."
Wilcoxon Signed-Rank Test Reporting Checklist
| Item | Required |
|---|---|
| Statement of why Wilcoxon was used (non-normality, ordinal, outliers) | ✅ Always |
| Median for each condition | ✅ Always |
| (and/or or ) — specify which | ✅ Always |
| z-statistic (if asymptotic) | ✅ For |
| p-value (exact or asymptotic — specify which) | ✅ Always |
| (total pairs), (zeros excluded), (effective ) | ✅ Always |
| and (positive and negative differences) | ✅ Recommended |
| Whether Wilcoxon or Pratt method used for zeros | ✅ When |
| Whether exact, asymptotic, or permutation p-value used | ✅ Always |
| Tie correction applied | ✅ When ties present |
| (primary effect size) with 95% CI | ✅ Always |
| alongside | ✅ Recommended |
| Hodges-Lehmann estimate with 95% CI | ✅ Always |
| Symmetry check on difference scores | ✅ When |
| Comparison with paired t-test result (sensitivity) | ✅ Recommended |
| Power or sensitivity analysis | ✅ For null results |
| Domain-specific benchmark context for | ✅ Recommended |
Conversion Formulas: Wilcoxon Other Metrics
| From | To | Formula |
|---|---|---|
| , | ||
| , | ||
| Cohen's (approx.) | ||
| Cohen's | (approx.) | |
| Cohen's | ||
| (approx.) | ||
| (normal data) | ||
| Required (80% power) |
This tutorial provides a comprehensive foundation for understanding, conducting, and reporting the Wilcoxon Signed-Rank Test within the DataStatPro application. For further reading, consult Wilcoxon's original paper "Individual Comparisons by Ranking Methods" (Biometrics Bulletin, 1945); Hollander, Wolfe & Chicken's "Nonparametric Statistical Methods" (3rd ed., 2014) for rigorous mathematical treatment; Conover's "Practical Nonparametric Statistics" (3rd ed., 1999) for applied guidance; Kerby's "The Simple Difference Formula: An Approach to Teaching Nonparametric Correlation" (Comprehensive Psychology, 2014) for the matched-pairs rank-biserial correlation; Field's "Discovering Statistics Using IBM SPSS Statistics" (5th ed., 2018) for accessible applied coverage; and van Doorn et al.'s "Bayesian Inference for Kendall's Rank Correlation Coefficient" (Communications in Statistics, 2018) for the Bayesian extension. For the Hodges-Lehmann estimator and its confidence interval, see Hodges & Lehmann's "Estimates of Location Based on Rank Tests" (Annals of Mathematical Statistics, 1963). For feature requests or support, contact the DataStatPro team.