Paired t-Test: Zero to Hero Tutorial
This comprehensive tutorial takes you from the foundational concepts of dependent-samples inference all the way through the mathematics, assumptions, variants, effect sizes, interpretation, reporting, and practical usage of the Paired t-Test within the DataStatPro application. Whether you are encountering the paired t-test for the first time or seeking a rigorous understanding of within-subjects comparison, this guide builds your knowledge systematically from the ground up.
Table of Contents
- Prerequisites and Background Concepts
- What is a Paired t-Test?
- The Mathematics Behind the Paired t-Test
- Assumptions of the Paired t-Test
- Variants of the Paired t-Test
- Using the Paired t-Test Calculator Component
- Full Step-by-Step Procedure
- Effect Sizes for the Paired t-Test
- Confidence Intervals
- Power Analysis and Sample Size Planning
- Advanced Topics
- Worked Examples
- Common Mistakes and How to Avoid Them
- Troubleshooting
- Quick Reference Cheat Sheet
1. Prerequisites and Background Concepts
Before diving into the paired t-test, it is essential to be comfortable with the following foundational statistical concepts. Each is briefly reviewed below.
1.1 Populations, Parameters, and Paired Designs
A population is the complete collection of individuals or measurements of interest. A sample is a subset drawn from that population. In a paired design, each participant (or experimental unit) contributes exactly two measurements — one under each of two conditions. The two measurements within a pair are inherently linked.
The paired t-test is an inferential procedure — it uses sample statistics computed from difference scores to draw conclusions about an unknown population parameter, namely the mean of the population difference scores .
The fundamental question: "Is the mean difference between the two paired conditions large enough to conclude that a true population-level difference exists?"
1.2 Why Pairing Matters: Removing Between-Person Variability
In most research involving repeated measurements, individuals vary considerably from one another — some participants score high on both measurements, others score low on both. This between-person variability is a source of noise that has nothing to do with the treatment or condition effect.
By computing difference scores for each participant, the paired design removes between-person variability from the error term entirely:
When (which is typical when measuring the same people twice), , meaning the paired test has a smaller denominator and greater statistical power than the independent samples t-test for the same data.
1.3 The Sampling Distribution of the Mean Difference
If we repeatedly drew samples of pairs from a population where the true mean difference is , the sampling distribution of (the mean of the difference scores) would, by the Central Limit Theorem, be approximately normal:
Because the population standard deviation of differences is unknown, we estimate it with the sample standard deviation , giving the estimated standard error of the mean difference:
This substitution of for is exactly what introduces the t-distribution rather than the standard normal distribution into the inference.
1.4 The t-Distribution and Degrees of Freedom
The Student's t-distribution arises when estimating a normally distributed population mean from a small sample with unknown variance. It is characterised by degrees of freedom . For the paired t-test:
where is the number of pairs (not the total number of observations, which would be ). As , the t-distribution converges to the standard normal .
The t-distribution has heavier tails than the standard normal, reflecting greater uncertainty from estimating from the data rather than knowing it exactly.
1.5 The Null and Alternative Hypotheses
The paired t-test operates within the Neyman-Pearson hypothesis testing framework:
(the population mean difference is zero)
(two-tailed, default)
or directional alternatives:
(upper one-tailed — Condition 1 > Condition 2)
(lower one-tailed — Condition 1 < Condition 2)
The null hypothesis can also be generalised to test against a non-zero value :
which is useful for non-inferiority, superiority, or equivalence testing.
1.6 Statistical Significance vs. Practical Significance
A paired t-test answers: "Is the mean difference statistically distinguishable from zero, given sampling variability?" It does not answer: "Is the difference large enough to matter in practice?"
With a large number of pairs, even a trivially small mean difference can be statistically significant. Always report:
- The t-statistic, degrees of freedom, and p-value (statistical significance).
- An effect size (e.g., Cohen's ) and its 95% CI (practical significance).
- The 95% CI for the mean difference (in original units).
1.7 Confidence Intervals and Their Relationship to the Test
A 95% confidence interval for is directly related to the two-tailed t-test at : the null hypothesis is rejected at if and only if lies outside the 95% CI. The CI provides strictly more information than the p-value because it communicates both the precision and magnitude of the estimated difference in original units.
1.8 Type I and Type II Errors
| Decision | True () | False () |
|---|---|---|
| Reject | Type I error () | Correct — Power () |
| Fail to Reject | Correct () | Type II error () |
- Type I error: Concluding a difference exists when none does (false positive). Rate controlled by .
- Type II error: Missing a true difference (false negative). Rate ; power .
- Power: Probability of correctly detecting a true effect of a given size.
2. What is a Paired t-Test?
2.1 The Core Idea
The paired t-test (also called: dependent samples t-test, matched pairs t-test, repeated measures t-test, or within-subjects t-test) is a parametric inferential procedure for testing whether the mean of a set of difference scores is significantly different from zero (or another specified value).
Rather than comparing two separate group means directly, the paired t-test:
- Computes a difference score for each pair of observations.
- Reduces the problem to a one-sample t-test on those difference scores.
- Tests whether the mean difference is significantly different from zero.
This reduction is elegant: the paired t-test is mathematically identical to a one-sample t-test applied to the difference scores.
2.2 When to Use a Paired t-Test
A paired t-test is appropriate when:
- The dependent variable is continuous (interval or ratio scale).
- You are comparing exactly two related conditions or two time points.
- Each observation in Condition 1 is meaningfully linked to exactly one observation in Condition 2.
- The difference scores are approximately normally distributed (or is large enough for the CLT to apply).
2.3 What Makes Observations "Paired"?
Observations are paired when there is a natural, meaningful, one-to-one correspondence between observations in the two conditions:
| Pairing Type | Example |
|---|---|
| Pre-post (same participant) | Depression score before and after therapy |
| Repeated measures (same participant) | Reaction time in noise vs. silence |
| Matched pairs (different participants) | Twins randomised to different conditions |
| Natural pairs | Left hand vs. right hand grip strength |
| Crossover designs | Drug A vs. Drug B, each participant receives both |
| Yoked controls | Each treatment participant matched to a control on age and IQ |
The key criterion is that the pairing must be established before data collection, not post-hoc. The correlation between the paired measurements must be positive (or at least non-negative) for pairing to confer a power advantage.
2.4 The Paired t-Test vs. Related Procedures
| Situation | Appropriate Test |
|---|---|
| Two related conditions, normal differences | Paired t-test |
| Two related conditions, non-normal or ordinal | Wilcoxon Signed-Rank test |
| Two independent groups | Independent samples t-test (Welch's recommended) |
| Three or more related conditions | One-way Repeated Measures ANOVA |
| Two related conditions, Bayesian inference | Bayesian paired t-test (BF) |
| Testing equivalence of two related conditions | TOST equivalence test |
2.5 The Power Advantage of Pairing
The paired t-test is more powerful than the independent samples t-test when:
- The within-pair correlation is positive (which is almost always true for repeated measures on the same participant).
- Between-person variability is large relative to within-person change.
The power gain is quantified by the relationship between paired and independent standard errors:
When : (no gain).
When : (37% reduction in SE — substantial power gain).
When : — pairing actually hurts power when correlation is low and one degree of freedom is lost for pairing.
💡 Pairing is most advantageous when the within-pair correlation is high (). When participants differ greatly from each other but respond consistently to conditions, the paired design dramatically reduces error and increases power.
3. The Mathematics Behind the Paired t-Test
3.1 The Difference Score Reduction
Let denote the pair of observations for participant , where . Define the difference score:
The sign convention matters: consistently subtracting Condition 2 from Condition 1 means a positive indicates that Condition 1 has higher values.
The mean and standard deviation of the difference scores are:
3.2 The t-Statistic
The paired t-statistic is:
Where:
- = mean of the difference scores.
- = null hypothesis value (typically ).
- = standard deviation of the difference scores.
- = number of pairs.
Under , this statistic follows a t-distribution with degrees of freedom.
3.3 Standard Error of the Mean Difference
The standard error of the mean difference measures the precision of as an estimate of :
This is the only standard error needed for the paired t-test. Note that it is computed entirely from the difference scores — the original scores and are used only to compute .
3.4 The p-value
Two-tailed p-value:
One-tailed p-value (upper):
One-tailed p-value (lower):
Where is the CDF of the t-distribution with degrees of freedom.
3.5 Relationship Between and the Raw Score Statistics
The standard deviation of differences is algebraically related to the standard deviations of the original scores and their correlation:
Where:
- = standard deviations of Condition 1 and Condition 2 scores respectively.
- = Pearson correlation between the paired measurements.
This relationship has several important implications:
When : — the paired test is equivalent to the independent test (no benefit from pairing).
When : — pairing reduces error variance and increases power.
When : — pairing increases error variance and reduces power. This is rare in practice but can occur with counterbalanced designs where learning effects operate.
3.6 The Mean Difference and Its Relationship to Raw Means
The mean difference score always equals the difference of the condition means:
This means the paired and independent tests produce identical estimates of the mean difference — the only difference is in the standard error used to evaluate that difference.
3.7 Computing the t-Statistic from Summary Statistics
If raw data are unavailable, the paired t-statistic can be computed from summary statistics in several ways:
From and :
From the correlation and group SDs:
From the t-statistic, recovering effect size:
3.8 Non-Central t-Distribution and Exact CIs for Effect Sizes
Under (when a true effect exists), the t-statistic follows a non-central t-distribution with non-centrality parameter:
The exact 95% CI for Cohen's inverts this relationship numerically:
and
No closed form exists for these bounds — DataStatPro computes them automatically using numerical iteration of the non-central t-distribution CDF.
3.9 Statistical Power of the Paired t-Test
Power is the probability that the paired t-test correctly rejects when a true effect of size exists:
Where is the non-centrality parameter.
The relationship between power, effect size, and sample size:
| Power = 0.70 ( pairs) | Power = 0.80 ( pairs) | Power = 0.90 ( pairs) | Power = 0.95 ( pairs) | |
|---|---|---|---|---|
| 0.20 | 185 | 264 | 354 | 434 |
| 0.35 | 62 | 88 | 118 | 146 |
| 0.50 | 31 | 44 | 59 | 73 |
| 0.80 | 13 | 18 | 24 | 30 |
| 1.00 | 9 | 13 | 17 | 21 |
| 1.20 | 7 | 9 | 12 | 15 |
| 1.50 | 5 | 7 | 9 | 11 |
All values assume two-tailed .
4. Assumptions of the Paired t-Test
4.1 Normality of Difference Scores
The paired t-test assumes that the difference scores are drawn from a normally distributed population. Note that:
- This is not the same as assuming the individual conditions are normally distributed.
- The differences can be normal even if the raw scores are not, as long as the two conditions' non-normalities cancel out.
- This is a weaker assumption than requiring both raw distributions to be normal.
How to check:
- Shapiro-Wilk test on the difference scores (): : differences are normally distributed. A significant result suggests departure from normality.
- Q-Q plot of difference scores: points should fall approximately on the diagonal reference line.
- Histogram of difference scores: should be approximately bell-shaped.
- Skewness () and kurtosis ().
Robustness: The paired t-test is robust to mild violations of normality, especially when:
- pairs (Central Limit Theorem ensures the sampling distribution of is approximately normal even if are not).
- The distribution of differences is symmetric, even if not perfectly normal.
- The violation consists of light tails rather than heavy tails or extreme skewness.
When violated: Use the Wilcoxon Signed-Rank test as a non-parametric alternative, or consider data transformations (log, square root) if the differences are right-skewed.
4.2 Independence of Pairs
All pairs must be independent of each other. That is, knowing the difference score for pair gives no information about the difference score for pair . Within a pair, the two measurements are of course correlated — that is the whole point of the design. It is the independence across pairs that is required.
Common violations:
- Multiple measurements from the same participant treated as separate pairs.
- Family members or social contacts in the same study.
- Clustered data (e.g., pairs sampled from the same school or ward).
- Time series where successive differences are autocorrelated.
How to check: Independence is a property of the study design, not of the data. Inspect the sampling procedure. Check for patterns in residuals over time (Durbin-Watson test) if measurements were sequential.
When violated: For clustered pairs, use multilevel models. For time series, use time-series methods (ARIMA, mixed models with autocorrelation structure).
4.3 Correct Pairing
The pairing must be meaningful and pre-specified. Each observation in Condition 1 must correspond to the correct partner observation in Condition 2. Incorrect or arbitrary pairing does not create a valid paired test — it creates noise.
How to check: Verify the data file structure — each row should represent one pair (one participant or one matched pair), with Condition 1 and Condition 2 values in separate columns.
⚠️ A common data-entry error is accidentally shifting one column so that rows no longer correspond to the same participant across conditions. Always verify that participant IDs match across the two columns before running a paired t-test.
4.4 Interval Scale of Measurement
The dependent variable must be measured on at least an interval scale (equal-spaced intervals between values). Difference scores must be meaningful — they require that the distance between score values is consistent throughout the scale.
When violated: If the DV is ordinal (e.g., a single Likert item, rank data), use the Wilcoxon Signed-Rank test instead.
4.5 Absence of Influential Outliers in Difference Scores
The paired t-test is sensitive to extreme outliers in the difference scores because they distort both and .
How to check:
- Boxplot of difference scores: values beyond from the quartiles.
- Standardised difference scores: flag .
- Grubbs' test for formal outlier detection in the differences.
When outliers are present: Investigate whether the outlier represents a data entry error, measurement error, or a genuine extreme response. Report analyses with and without the outlier(s). Consider using the Wilcoxon Signed-Rank test (which is rank-based and thus robust to outliers in the differences).
4.6 Assumption Summary Table
| Assumption | Description | How to Check | Remedy if Violated |
|---|---|---|---|
| Normality of differences | Shapiro-Wilk, Q-Q, histogram | Wilcoxon Signed-Rank; transform data | |
| Independence of pairs | Pairs are independent of each other | Design review; Durbin-Watson | Multilevel model; time-series methods |
| Correct pairing | Conditions 1 and 2 observations are correctly matched | Verify participant IDs in data file | Re-match data; verify recording |
| Interval scale | DV has equal-interval properties | Measurement theory | Wilcoxon Signed-Rank |
| No influential outliers | No extreme values in | Boxplot; $ | z_{d_i} |
5. Variants of the Paired t-Test
5.1 Overview of Effect Size Variants
Multiple variants of the paired t-test exist primarily because of different choices of effect size standardiser — the denominator of the standardised mean difference. Choosing the wrong variant leads to incomparable effect sizes across studies.
| Variant | t-Statistic | Effect Size | Denominator | Primary Use |
|---|---|---|---|---|
| Standard paired t | SD of differences | Comparing paired designs | ||
| Average SD standardiser | Same t | Average of group SDs | Comparing to between-subjects | |
| Pooled SD standardiser | Same t | Pooled SD (like between) | Meta-analysis | |
| RM-corrected | Same t | Adjusted for correlation | Cross-design comparison | |
| Pre-test standardiser | Same t | SD of pre-test (Condition 1) | Change from baseline |
5.2 Cohen's — The Standardised Mean Difference of Differences
Cohen's is the most straightforward effect size for the paired t-test. It expresses the mean difference in units of the standard deviation of the difference scores:
It is directly recoverable from the t-statistic: .
When to use :
- Comparing effect sizes across studies that all use paired designs.
- Within-study power analysis for the same paired design.
- When the research question is about the magnitude of within-person change relative to individual variability in change.
Limitation of : It is not directly comparable to Cohen's from an independent samples design because reflects within-person variability in change, which is typically much smaller than between-person variability. is therefore typically larger than for the same mean difference.
5.3 Cohen's — Average Standard Deviation Standardiser
Cohen's (Lakens, 2013) standardises the mean difference by the average of the two condition standard deviations:
When to use :
- When comparing a paired design effect to a between-subjects design effect using the same measurement scale.
- When the research question concerns the magnitude of mean change relative to the variability of the original measurements.
- Meta-analyses combining within-subjects and between-subjects designs.
5.4 Cohen's — Repeated Measures Corrected
Cohen's (Morris & DeShon, 2002) explicitly accounts for the within-subjects correlation to produce an effect size that is directly comparable to a between- subjects Cohen's :
Or equivalently:
Properties:
- When : .
- When : (correlation inflated — correction deflates it toward a between-subjects comparable value).
- When : .
is the most theoretically appropriate effect size for comparing paired designs to independent samples designs.
5.5 Glass's for Pre-Post Designs
Glass's standardises by the pre-test (Condition 1) standard deviation only. This is most appropriate for treatment-control or pre-post designs where the pre-test represents the baseline, unaffected by the treatment:
It answers: "How many standard deviations (in the original, pre-intervention metric) does the average participant change?"
When to use : Pre-post designs where the treatment may change the variability of the outcome (e.g., an intervention that reduces both mean and variance of depression scores). Standardising by the pre-test SD anchors the effect in the pre-intervention distribution.
5.6 Relationship Between and
and are related through the within-pair correlation :
More directly:
Wait — the exact relationship (Lakens, 2013):
Therefore:
and
This means when (which is almost always the case for repeated measures), explaining why paired designs appear to produce larger effect sizes than between-subjects designs when the same metric is uncritically applied to both.
Numerical example with :
So if , then — nearly 30% larger.
6. Using the Paired t-Test Calculator Component
The Paired t-Test Calculator in DataStatPro provides a comprehensive tool for running, diagnosing, visualising, and reporting paired t-tests and their alternatives.
Step-by-Step Guide
Step 1 — Select "Paired Samples t-Test"
From the "Test Type" dropdown, select:
- Paired Samples t-Test — for parametric analysis of normally distributed differences.
- Wilcoxon Signed-Rank Test — the non-parametric alternative (automatically suggested if DataStatPro's normality check flags a violation).
Step 2 — Input Method
Choose how to provide the data:
- Raw data (paired columns): Upload or paste two columns — Condition 1 and Condition 2 — with one row per participant. DataStatPro automatically computes difference scores, runs assumption checks, and generates all statistics.
- Raw data (difference scores): If you already have difference scores, upload a single column and DataStatPro treats it as a one-sample t-test on the differences.
- Summary statistics: Enter , , directly. All test statistics and effect sizes are computed but full assumption checks are unavailable.
- Summary statistics with correlation: Enter , , , , , to compute all effect size variants.
- t-statistic and df: Enter and to compute p-values and effect sizes from a published result.
💡 When using paired columns, DataStatPro verifies that column lengths are equal, flags any missing data, and alerts you if participant IDs are provided and do not match across columns.
Step 3 — Specify the Null Hypothesis Value
Default: (testing whether the mean difference is zero). To test a non-zero null (e.g., for non-inferiority testing with a margin of points), enter the appropriate value.
Step 4 — Select the Alternative Hypothesis
- Two-tailed (default): — most appropriate for most research.
- Upper one-tailed: — use only with strong a priori directional prediction, pre-registered before data collection.
- Lower one-tailed: — use only with pre-registered directional prediction.
Step 5 — Choose the Significance Level
Select (default: ). DataStatPro simultaneously shows results for , , and for reference.
Step 6 — Select Effect Size Variants
Choose which effect sizes to compute and report:
- ✅ Cohen's (default) — standardised by .
- ✅ Cohen's — standardised by average of condition SDs.
- ✅ Cohen's — RM-corrected (requires ; from raw data or entered manually).
- ✅ Glass's — standardised by Condition 1 (pre-test) SD.
- ✅ Hedges' — bias-corrected .
- ✅ Common Language Effect Size (CL).
Step 7 — Select Display Options
- ✅ t-statistic, df, p-value, and decision.
- ✅ Descriptive statistics: , , , , , , , .
- ✅ 95% CI for mean difference (in original units).
- ✅ All selected effect sizes with exact 95% CIs.
- ✅ Common Language Effect Size (CL%).
- ✅ Assumption test panel: Shapiro-Wilk on differences, Q-Q plot, histogram of differences, boxplot of differences.
- ✅ Visualisation: overlapping density plots for Conditions 1 and 2; distribution of difference scores with reference line at zero.
- ✅ Individual trajectory plot: each participant's scores connected across conditions.
- ✅ Cohen's diagram (, , overlap statistics).
- ✅ Power curve: power vs. for the observed effect size.
- ✅ Equivalence test (TOST) output.
- ✅ Bayesian paired t-test (Bayes Factor ).
- ✅ APA 7th edition-compliant results paragraph (auto-generated).
Step 8 — Run the Analysis
Click "Run Paired t-Test". DataStatPro will:
- Compute difference scores and all descriptive statistics.
- Run Shapiro-Wilk normality test on the differences.
- Compute the t-statistic, df, and p-value.
- Construct exact 95% CIs for the mean difference and all effect sizes.
- Generate all selected visualisations.
- Auto-generate the APA results paragraph.
7. Full Step-by-Step Procedure
7.1 Complete Computational Procedure
This section walks through every computational step for the paired t-test, from raw data to a full APA-style conclusion.
Given: pairs of observations for .
Step 1 — Verify and Arrange the Data
Arrange data in a table with one row per pair:
| Pair | (Condition 1) | (Condition 2) | |
|---|---|---|---|
| 1 | |||
| 2 | |||
Establish the sign convention: a positive means the participant scored higher in Condition 1 than Condition 2. State this convention explicitly before analysis.
Step 2 — Compute the Mean Difference
Equivalently:
Step 3 — Compute the Standard Deviation of Differences
Step 4 — Compute the Standard Error
Step 5 — Check the Normality Assumption
Run the Shapiro-Wilk test on the difference scores :
- If : normality is not contradicted; proceed with paired t-test.
- If and : consider the Wilcoxon Signed-Rank test.
- If and : proceed with caution (CLT generally provides protection); inspect Q-Q plot for severe violations.
Step 6 — Compute the t-Statistic
For the default null ():
Step 7 — Determine Degrees of Freedom
Step 8 — Compute the p-value
Using the t-distribution with df:
Two-tailed:
Compare to . Reject if .
Step 9 — Compute the 95% Confidence Interval for
The CI directly answers: "What are plausible values for the true population mean difference, given this sample?"
Step 10 — Compute Effect Sizes
Cohen's :
Hedges' (bias-corrected ):
Where is the bias correction factor.
Cohen's (requires and ):
Cohen's (requires ):
Common Language Effect Size (CL):
is the probability that a randomly selected participant scores higher in Condition 1 than in Condition 2 (for positive ).
Step 11 — Compute the 95% CI for Cohen's
Exact CI (via non-central t-distribution — computed by DataStatPro):
Find and such that:
and
Approximate CI (adequate for ):
Step 12 — Interpret and Report
Combine all results into a complete, APA-compliant report:
- Report [value], [value] (or ).
- Report and with units.
- Report the 95% CI for the mean difference.
- Report Cohen's (and/or ) with 95% CI.
- Classify the effect size using benchmarks.
- State the practical conclusion.
8. Effect Sizes for the Paired t-Test
8.1 Cohen's — Step-by-Step
Interpretation: means the mean difference is half a standard deviation of the difference scores. This is not directly comparable to Cohen's from an independent samples design without knowing .
8.2 Hedges' — Bias Correction
Cohen's is slightly positively biased in small samples — it overestimates the true population effect. Hedges' applies the bias correction:
More precise gamma function form:
The bias is negligible for (less than 5%) but can be substantial for very small samples ():
| Bias (%) | ||
|---|---|---|
| 5 | 0.8406 | 15.9% |
| 10 | 0.9227 | 7.7% |
| 15 | 0.9484 | 5.2% |
| 20 | 0.9613 | 3.9% |
| 30 | 0.9742 | 2.6% |
| 50 | 0.9848 | 1.5% |
8.3 Cohen's Benchmark Classification
Cohen (1988) proposed the following conventions for (and equivalently for and ):
| Verbal Label | (%) | (%) | Overlap (%) | |
|---|---|---|---|---|
| No effect | ||||
| Small | ||||
| Medium | ||||
| Large | ||||
| Very large | ||||
| Huge |
⚠️ Cohen's benchmarks were intended as rough conventions of last resort — to be used only when no domain-specific information is available. Always contextualise within your research domain. In clinical psychology, may be a meaningful effect; in some neuroimaging contexts, may be large relative to typical findings.
Extended benchmarks (Sawilowsky, 2009):
| Label | |
|---|---|
| Tiny | |
| Very small | |
| Small | |
| Medium | |
| Large | |
| Very large | |
| Huge |
8.4 The Common Language Effect Size
The Common Language Effect Size (McGraw & Wong, 1992) translates into a probability that is intuitive for non-statistical audiences:
means: "In 70% of repeated measurements of the same individual, their score in Condition 1 exceeds their score in Condition 2."
8.5 Which Effect Size to Report: A Decision Guide
| Research Goal | Recommended Effect Size | Rationale |
|---|---|---|
| Within-study power analysis and paired design comparison | Direct function of the t-statistic; reflects paired design power | |
| Comparing to between-subjects literature | or | Standardises by original-scale SD; comparable to independent |
| Clinical pre-post change evaluation | or Glass's | Anchored in clinically meaningful scale |
| Meta-analysis combining paired and independent | Design-adjusted; most comparable across designs | |
| Small sample () | (Hedges') | Reduces positive bias of |
| Reporting all relevant variants | + | Provides complete picture; specify which is primary |
💡 Always specify which effect size variant was computed. Writing "Cohen's " without specifying whether it is , , or is ambiguous and prevents accurate meta-analytic synthesis.
9. Confidence Intervals
9.1 CI for the Mean Difference (Original Units)
The 95% CI for the population mean difference provides the most practically interpretable interval — it is expressed in the original measurement units and directly answers: "How large might the true effect be?"
Interpreting the CI:
| CI Property | Interpretation |
|---|---|
| Entirely above zero | Effect is significantly positive () |
| Entirely below zero | Effect is significantly negative () |
| Contains zero | Not statistically significant at level |
| Narrow CI | Precise estimate; large |
| Wide CI | Imprecise estimate; small — interpret point estimate cautiously |
| Entirely within trivial range | Effect is definitively small (equivalence established) |
9.2 CI for Cohen's (Standardised)
The exact 95% CI for uses the non-central t-distribution (computed automatically by DataStatPro). The approximate CI (adequate for ) is:
(approximate, two-tailed )
Width of the 95% CI for as a function of (for true ):
| pairs | Approx. | 95% CI Width | Precision |
|---|---|---|---|
| 10 | 0.334 | 1.31 | Very low |
| 20 | 0.232 | 0.91 | Low |
| 30 | 0.188 | 0.74 | Moderate |
| 50 | 0.145 | 0.57 | Moderate-good |
| 100 | 0.102 | 0.40 | Good |
| 200 | 0.072 | 0.28 | High |
| 500 | 0.046 | 0.18 | Very high |
⚠️ With pairs, the 95% CI for spans approximately [−0.16, 1.16] — from "negligible" to "very large." A point estimate of from a study of only 10 pairs is essentially uninterpretable without the CI. Always report the CI.
9.3 CI for Other Effect Size Variants
95% CI for : Convert using and apply the same conversion to both CI bounds.
95% CI for : DataStatPro computes this by bootstrapping when raw data are available, or via the delta method for summary statistics.
10. Power Analysis and Sample Size Planning
10.1 A Priori Power Analysis
A priori power analysis determines the required number of pairs before data collection to achieve desired power at significance level for a hypothesised effect of size .
Required for two-tailed test:
The exact calculation uses the non-central t-distribution (numerical). An excellent approximation:
For (two-tailed, ) and power ():
| Power = 0.80 () | Power = 0.90 () | Power = 0.95 () | Power = 0.99 () | |
|---|---|---|---|---|
| 0.20 | 198 | 265 | 326 | 441 |
| 0.30 | 89 | 119 | 147 | 198 |
| 0.50 | 34 | 45 | 55 | 75 |
| 0.80 | 15 | 19 | 23 | 32 |
| 1.00 | 10 | 13 | 16 | 22 |
| 1.20 | 8 | 10 | 12 | 16 |
| 1.50 | 6 | 8 | 9 | 12 |
All values assume two-tailed . Add 1–2 pairs to account for rounding.
10.2 Sensitivity Analysis (Post-Hoc Power)
Sensitivity analysis determines the minimum effect size that could have been detected with the study's sample size at a specified power level. It answers: "What was the smallest effect this study was designed to detect?"
For pairs, , power :
This study could reliably detect only effects of — near Cohen's "large" threshold. Smaller effects may exist but would frequently be missed.
⚠️ Post-hoc power computed from the observed effect size (sometimes called "observed power") is circular, redundant with the p-value, and should NOT be reported as a justification for a non-significant result. Sensitivity analysis using the minimum detectable effect is the appropriate post-hoc power tool.
10.3 Planning Based on Instead of
When planning based on an expected (e.g., from a published between-subjects study), first convert to using the anticipated within-pair correlation :
Then apply the standard formula. If is unknown, use a conservative estimate of :
With , , so the sample size formula is the same.
10.4 The Effect of Pre-Post Correlation on Required Sample Size
The required sample size for a paired design decreases as increases — reflecting the power advantage of pairing. Compared to an independent samples design with the same :
| Factor | |||
|---|---|---|---|
| 0.00 | 1.414 | 0.707 | 0.500 |
| 0.20 | 1.265 | 0.791 | 0.625 |
| 0.50 | 1.000 | 1.000 | 1.000 |
| 0.70 | 0.775 | 1.291 | 1.667 |
| 0.80 | 0.632 | 1.581 | 2.500 |
| 0.90 | 0.447 | 2.236 | 5.000 |
is the ratio of total observations needed (paired has pairs = total; independent has total for same power on ).
💡 For , the paired design requires only 40% as many participants as the independent design to achieve the same power. When within-pair correlations are high, pairing provides a dramatic efficiency gain.
11. Advanced Topics
11.1 The Paired t-Test as a One-Sample t-Test
The paired t-test is mathematically identical to a one-sample t-test applied to the difference scores. This has several practical implications:
- Software implementation: Many software packages implement the paired t-test by computing difference scores and running a one-sample test.
- Missing data: If some participants have data for only one condition, those pairs cannot contribute difference scores and are excluded entirely from the analysis.
- Non-zero null: Testing (e.g., "does the mean improvement exceed a clinically significant threshold of 5 points?") is as straightforward as testing .
11.2 The Relationship Between Paired and Independent t-Tests
For the same dataset, the paired and independent t-statistics are related through the within-pair correlation :
The ratio of the t-statistics:
For equal SDs ():
When : — the paired t-statistic is twice as large, corresponding to vastly higher power.
Also note the degrees of freedom differ: vs. . The paired test loses degrees of freedom by pairing, but gains far more through the reduced error term when is high.
11.3 Equivalence Testing with TOST
Standard paired t-testing can reject but cannot establish that the mean difference is negligibly small. The Two One-Sided Tests (TOST) procedure tests whether the mean difference falls within a pre-specified equivalence interval :
(the difference is meaningfully negative) (the difference is meaningfully positive)
Equivalence is concluded when both one-sided tests reject their respective nulls at level — equivalently, when the 90% CI (for ) for falls entirely within .
The TOST t-statistics:
Both must exceed (one-tailed) for equivalence to be declared.
Choosing equivalence bounds: A common choice based on Cohen's is to set bounds corresponding to a "small" effect: . In practice, bounds should be domain-specific and set before data collection.
💡 TOST for paired t-tests is critical for crossover drug bioequivalence studies (where "no difference" between formulations must be positively demonstrated), for measurement instrument validation (demonstrating that a new instrument agrees with a gold standard), and for null results that claim two conditions are equivalent.
11.4 Bayesian Paired t-Test
The Bayesian paired t-test (Rouder et al., 2009) quantifies evidence for and against the null hypothesis using the Bayes Factor :
Under the default JZS prior (Jeffrey-Zellner-Siow), the prior on under is a Cauchy distribution with scale .
Interpreting Bayes Factors:
| Evidence for | |
|---|---|
| Extreme | |
| Very strong | |
| Strong | |
| Moderate | |
| Anecdotal | |
| No evidence (equal support) | |
| Anecdotal for | |
| Moderate for | |
| Strong or stronger for |
Advantages of Bayesian paired t-test:
- Quantifies evidence for (null results can be informative, not just "inconclusive").
- Valid for sequential testing (no correction needed for looking at data multiple times).
- Provides a posterior distribution for .
- Avoids the all-or-nothing dichotomy of significance testing.
Reporting: "A Bayesian paired t-test with the default Cauchy prior () provided [strong / moderate / anecdotal / no] evidence for the alternative hypothesis, [value]."
11.5 Robust Alternatives: Trimmed Mean Paired t-Test
Yuen's paired trimmed mean t-test (Yuen, 1974) uses -trimmed means of the difference scores as the measure of central tendency. With 20% trimming:
(effective sample size after trimming)
= 20%-trimmed mean of
= Winsorised variance of
, compared to
The trimmed mean paired t-test is substantially more powerful than the Wilcoxon signed- rank test for symmetric heavy-tailed distributions, while maintaining good Type I error control under non-normality.
11.6 Handling Missing Data in Paired Designs
In the paired t-test, both observations must be present for a pair to contribute to the analysis. Options for handling missing data:
| Approach | Description | When Appropriate |
|---|---|---|
| Complete case analysis | Use only pairs with both observations | MCAR assumption; small proportion missing |
| Multiple imputation | Impute missing values using predictive models | MAR assumption; principled approach |
| Maximum likelihood (MLM) | Use all available data via FIML | MAR assumption; preferred for repeated measures |
| Last observation carried forward (LOCF) | Replace missing post-value with last observation | Clinical trials; conservative assumption |
⚠️ Listwise deletion (complete case analysis) is the default in most software but can introduce bias when data are not Missing Completely At Random (MCAR). For more than 5% missing data, multiple imputation or maximum likelihood estimation are strongly preferred.
11.7 Multi-Level Extensions of the Paired Design
The paired t-test assumes that pairs are sampled from a common population. When pairs themselves are nested within clusters (e.g., twin pairs from the same family, or pre-post measurements from patients in the same hospital), standard paired t-tests underestimate standard errors and produce inflated Type I error rates.
The appropriate extension is a two-level mixed model:
Where is the cluster-level random effect and is the residual within clusters. The Intraclass Correlation Coefficient (ICC) = quantifies the degree of clustering.
11.8 Reporting the Paired t-Test According to APA 7th Edition
Minimum reporting requirements (APA 7th ed.):
- Mean and SD for each condition: , , , .
- Mean and SD of difference scores: , .
- t-statistic with df: [value].
- Exact p-value (or ).
- Effect size with 95% CI: [value] [95% CI: LB, UB].
- 95% CI for mean difference in original units.
- Specification of which effect size variant was reported.
- Whether the normality assumption was checked and met.
12. Worked Examples
Example 1: Pre-Post Mindfulness Intervention — PHQ-9 Depression Scores
A clinical psychologist evaluates whether an 8-week Mindfulness-Based Cognitive Therapy (MBCT) programme significantly reduces depression symptoms. PHQ-9 scores (0–27; higher = more depression) are recorded for participants immediately before and after the programme.
Raw data:
| Participant | Pre-MBCT () | Post-MBCT () | |
|---|---|---|---|
| 1 | 18 | 11 | 7 |
| 2 | 22 | 14 | 8 |
| 3 | 15 | 10 | 5 |
| 4 | 20 | 16 | 4 |
| 5 | 25 | 17 | 8 |
| 6 | 13 | 9 | 4 |
| 7 | 19 | 12 | 7 |
| 8 | 17 | 14 | 3 |
| 9 | 21 | 13 | 8 |
| 10 | 16 | 11 | 5 |
| 11 | 24 | 16 | 8 |
| 12 | 14 | 10 | 4 |
| 13 | 20 | 15 | 5 |
| 14 | 18 | 12 | 6 |
| 15 | 23 | 15 | 8 |
Step 1 — Normality check on differences:
Differences: 7, 8, 5, 4, 8, 4, 7, 3, 8, 5, 8, 4, 5, 6, 8
Shapiro-Wilk: , — normality not violated; proceed with paired t-test.
Step 2 — Descriptive statistics:
Condition means and SDs:
Pre-MBCT:
: values , sum ,
Post-MBCT:
: values , sum ,
Within-pair correlation:
Step 3 — Standard error:
Step 4 — t-statistic:
Step 5 — Degrees of freedom and p-value:
Step 6 — 95% CI for mean difference:
Step 7 — Effect sizes:
Cohen's :
✅
Hedges' :
Cohen's :
Cohen's :
Common Language Effect Size:
Step 8 — 95% CI for (approximate):
Summary table:
| Statistic | Value | Interpretation |
|---|---|---|
| Pre-MBCT mean | PHQ-9 pts | Moderate-severe depression |
| Post-MBCT mean | PHQ-9 pts | Mild depression |
| Mean difference () | pts | 6-point reduction |
| pts | Low variability in change | |
| (pre-post) | High pre-post correlation | |
| (two-tailed) | Highly significant | |
| 95% CI for | Excludes 0 | |
| Cohen's | Huge effect | |
| Hedges' | Huge (bias-corrected) | |
| Cohen's | Huge | |
| Cohen's | Large-very large | |
| 95% CI for | ||
| CL |
APA write-up: "A paired samples t-test was conducted to evaluate whether PHQ-9 depression scores changed from pre- to post-MBCT. Difference scores were normally distributed as assessed by Shapiro-Wilk (, ). MBCT produced a statistically significant reduction in depression scores (, ; , ), , , [95% CI: 1.78, 4.22], . The mean reduction of 6.00 PHQ-9 points [95% CI: 4.89, 7.11] represents a clinically large and statistically robust treatment effect. 98.3% of participants showed greater improvement than would be expected by chance."
Example 2: Crossover Drug Trial — Pain Reduction
A pharmacologist conducts a double-blind crossover study comparing Drug A vs. Drug B on pain ratings (0–100 VAS; lower = less pain) in participants with chronic back pain. Each participant receives both drugs in randomised order with a 2-week washout between. Difference = Drug A − Drug B (positive = Drug A produces more pain).
Raw data:
| Participant | Drug A () | Drug B () | |
|---|---|---|---|
| 1 | 45 | 38 | 7 |
| 2 | 62 | 55 | 7 |
| 3 | 33 | 41 | −8 |
| 4 | 58 | 49 | 9 |
| 5 | 70 | 63 | 7 |
| 6 | 41 | 47 | −6 |
| 7 | 55 | 50 | 5 |
| 8 | 48 | 55 | −7 |
| 9 | 64 | 58 | 6 |
| 10 | 52 | 46 | 6 |
| 11 | 39 | 44 | −5 |
| 12 | 60 | 56 | 4 |
Step 1 — Normality check:
Differences: 7, 7, −8, 9, 7, −6, 5, −7, 6, 6, −5, 4
Shapiro-Wilk: , — normality holds.
Step 2 — Descriptive statistics for differences:
Step 3 — Standard error:
Step 4 — t-statistic:
Step 5 — Degrees of freedom and p-value:
Step 6 — 95% CI:
Step 7 — Effect sizes:
95% CI for (approximate):
The CI spans from negative (Drug B better) to positive (Drug A better), confirming non-significance.
Step 8 — Equivalence test (TOST)
The pharmacologist wishes to establish whether the drugs are equivalent within VAS points ().
Both and exceed (one-tailed).
The 90% CI for :
Since the 90% CI falls entirely within , equivalence is established at .
Summary:
| Statistic | Value | Interpretation |
|---|---|---|
| Mean diff (A − B) | pts | Drug A produces slightly higher pain |
| pts | High variability in within-person differences | |
| (two-tailed) | Not significant | |
| 95% CI for | Includes 0 | |
| Cohen's | (Small) | |
| Hedges' | ||
| 95% CI for | ||
| TOST result | Equivalent | 90% CI within 10 pts |
APA write-up: "A paired samples t-test examined whether Drug A and Drug B differed in pain relief in a crossover design (). Difference scores were normally distributed (, ). The mean pain rating was not significantly different for Drug A (, ) vs. Drug B (, ), , , [95% CI: 0.26, 0.90]. The 95% CI for the mean difference was [2.04, 6.20] VAS points. A TOST equivalence test with bounds of 10 VAS points demonstrated that the drugs are equivalent in pain relief, with the 90% CI [1.28, 5.45] falling entirely within the equivalence interval."
Example 3: Reaction Time — Noise vs. Silence Condition
A cognitive psychologist tests whether background noise affects simple reaction time (ms) in university students. Each participant completes both a silent and a noise condition (order counterbalanced). (positive = noise increases RT).
Summary statistics (raw data not shown):
Step 1 — Compute :
Step 2 — Mean difference:
Step 3 — Standard error:
Step 4 — t-statistic:
Step 5 — df and p-value:
;
Step 6 — 95% CI:
Step 7 — Effect sizes:
)
Contrast: What if this had been run as (incorrect) independent test?
The incorrect independent t-test fails to detect the effect () while the paired test clearly identifies it (). This illustrates the dramatic power advantage of the paired design when .
Summary:
| Statistic | Value |
|---|---|
| Mean RT: Noise | ms |
| Mean RT: Silence | ms |
| Mean difference | ms (noise slower) |
| ms | |
| (high pairing efficiency) | |
| (paired) | |
| (paired, two-tailed) | |
| (if independent, incorrect) | |
| (if independent, incorrect) | |
| 95% CI for | ms |
| Cohen's | (Medium) |
| Cohen's | (Small-Medium) |
| Cohen's | (Small) |
| CL |
APA write-up: "A paired samples t-test examined whether background noise affected reaction time. Participants were significantly slower in the noise condition ( ms, ms) than in the silence condition ( ms, ms), , , [95% CI: 0.14, 0.99], . The mean slowing of 13.7 ms [95% CI: 3.77, 23.63 ms] represents a medium within-subjects effect. The high pre-post correlation () confirms the efficiency of the paired design."
Example 4: Small Sample with Non-Significant Result — Teaching Method
An education researcher compares mathematics test scores ( students) before and after a new tutoring method. Scores range 0–100.
Data:
| Student | Before () | After () | |
|---|---|---|---|
| 1 | 62 | 67 | −5 |
| 2 | 78 | 82 | −4 |
| 3 | 55 | 59 | −4 |
| 4 | 71 | 78 | −7 |
| 5 | 83 | 84 | −1 |
| 6 | 67 | 73 | −6 |
| 7 | 59 | 61 | −2 |
| 8 | 74 | 76 | −2 |
| 9 | 88 | 91 | −3 |
| 10 | 61 | 66 | −5 |
Note: , so negative indicates improvement.
; ;
95% CI:
(Huge);
Despite the small sample, the effect is large and the result is highly significant because individual differences in change scores are small relative to the mean improvement.
Note on interpreting CIs with small samples:
Wide CI but entirely positive — effect is definitively large even at the lower bound.
APA write-up: "A paired samples t-test showed that student mathematics scores improved significantly after tutoring (, ; , ), , , [95% CI: 0.91, 3.17]. The mean improvement of 3.9 points [95% CI: 2.53, 5.27] represents a very large within-subjects effect, indicating the tutoring method produced consistent, substantial gains across students."
13. Common Mistakes and How to Avoid Them
Mistake 1: Using the Independent Samples t-Test for Paired Data
Problem: Treating pre-post measurements (or matched pairs) as independent groups and running an independent samples t-test. This ignores the within-pair correlation , inflates the error term with between-person variability, severely reduces power, and may produce a non-significant result for a large, real effect.
How serious: In Example 3 above, the paired test correctly detected an effect at , while the incorrect independent test gave . When , the independent test has less than 50% of the power of the paired test.
Solution: Before running any test, determine the study design: if each participant contributes two scores, the paired t-test is required. Check the data file structure — paired data should have one row per participant (or matched pair), not one row per observation.
Mistake 2: Reporting Without Acknowledging It Is Not Comparable to Between-Subjects
Problem: Reporting from a paired design and implying this is comparable to from a between-subjects study. Because when , will be systematically larger than or from an independent design for the same mean difference.
How serious: For , is — reporting when the comparable between-subjects would be could grossly inflate perceived effect sizes in a research domain.
Solution: Always specify the effect size variant. Report both and (or ). When comparing to between-subjects studies, use or .
Mistake 3: Not Checking Normality of the Difference Scores
Problem: Applying the paired t-test without checking whether the difference scores are approximately normally distributed. This is especially risky with small samples (), where the CLT does not yet provide adequate protection, and the t-test's p-values may be inaccurate under skewed or heavy-tailed difference distributions.
Solution: Always run the Shapiro-Wilk test on the difference scores (not on the raw scores) and inspect the Q-Q plot of differences. If normality is violated and , use the Wilcoxon Signed-Rank test.
Mistake 4: Running Separate t-Tests on Each Condition Instead of a Paired Test
Problem: Testing whether Condition 1 mean differs from zero, then testing whether Condition 2 mean differs from zero, and comparing the significance of the two tests. This approach is fundamentally flawed — a condition can be significantly different from zero in both tests but not significantly different from each other, or vice versa.
Solution: The appropriate question is whether the mean difference between conditions is significant. Use the paired t-test, which directly tests .
Mistake 5: Failing to Report the 95% CI for the Mean Difference in Original Units
Problem: Reporting only , , and without reporting the 95% CI for in the original measurement units. The CI in original units is the most practically interpretable result — it tells readers how large the difference is in terms they can evaluate against a minimum important clinical or practical difference.
Solution: Always report the 95% CI for in original units, alongside the CI for the effect size . For clinical or applied research, also discuss whether the CI for the mean difference exceeds a minimally important clinical difference (MICD).
Mistake 6: Treating a Non-Significant Result as Evidence of No Change
Problem: Reporting , and concluding "the intervention had no effect." A non-significant result only means the data are insufficient to reject under the test's sensitivity — it does NOT establish that the true effect is zero. With small , even large effects fail to reach significance.
Solution: Report the 95% CI for the mean difference. If the CI is wide and includes both clinically trivial and clinically meaningful differences, explicitly acknowledge the study's limited power rather than claiming no effect. Use equivalence testing (TOST) with pre-specified bounds if the research goal is to demonstrate absence of a meaningful effect.
Mistake 7: Applying a One-Tailed Test After Observing the Data Direction
Problem: Observing that is positive, then switching to an upper one-tailed test to achieve when the two-tailed result was . This is p-hacking and doubles the effective Type I error rate.
Solution: Directional hypotheses must be pre-registered before data collection. Document the hypothesis direction in a pre-registration (e.g., on the OSF) before seeing any data. In the absence of a pre-registered directional prediction, use two-tailed tests.
Mistake 8: Using the Same Participants Twice Without Pairing
Problem: Collecting data from 30 participants under two conditions but entering all 60 observations as an independent-groups design. This creates pseudo-replication, violates independence, and severely inflates Type I error rates because the 60 observations are not all independent.
Solution: Understand the design. If each participant provided data under both conditions, the observations are paired and the within-subjects structure must be accounted for in the analysis (paired t-test, or repeated measures ANOVA for ).
Mistake 9: Ignoring Carryover Effects in Crossover Designs
Problem: In crossover designs, the effect of the first condition may carry over and influence responses in the second condition. Failing to account for order effects can bias the estimate of the mean difference, making the paired comparison misleading.
Solution: Use proper washout periods between conditions. Test for order effects by including condition order as a factor. If order effects are significant, report this and consider using only the first-condition data or modelling the order effect explicitly.
Mistake 10: Not Specifying When Testing Non-Zero Nulls
Problem: Testing whether a treatment effect exceeds a clinically meaningful threshold (e.g., a 5-point improvement on a 100-point scale) using the default instead of . The default test does not answer the right question.
Solution: Set the null hypothesis value to the minimum clinically important difference (MCID) before running the test. In DataStatPro, enter this value in the "Null Hypothesis Value" field.
14. Troubleshooting
| Problem | Likely Cause | Solution |
|---|---|---|
| is extremely large () | Very small (participants all changed by nearly the same amount) or data entry error | Check data; if genuine, report with interpretation — very consistent within-person change |
| All difference scores are identical | Verify data; if genuine, the test is degenerate — all participants changed by exactly the same amount | |
| exactly | exactly | All differences cancel out; report , interpret as no mean change |
| Shapiro-Wilk significant on large sample () | High power of normality test; minor deviations detected | With , CLT provides protection; inspect Q-Q plot for severity; t-test likely valid |
| High within-pair correlation ( large) | Both are correct; reflects paired design efficiency; more comparable to between-subjects | |
| Paired t significant but Wilcoxon signed-rank not significant (or vice versa) | Distributional issues or tied difference scores | Check normality; if differences are non-normal, trust Wilcoxon; report both with rationale |
| 95% CI for is very wide | Small () | Report wide CI — it is the truthful reflection of low precision; use exact (non-central t) CI from DataStatPro |
| Equivalence test fails despite small | Equivalence bounds are too tight for available | Increase for replication; widen bounds with theoretical justification or accept insufficient precision |
| is negative | Rare; could arise from counterbalancing with contrast effects | Verify measurement; pairing reduces power when — consider independent test |
| ; correction factor | Both values correct; report both; specify which is primary | |
| Bayes Factor | Insensitive data; study underpowered | Collect more data; report as reflecting insensitivity rather than evidence for either hypothesis |
| TOST bounds are difficult to specify | Lack of prior knowledge about MCID | Consult domain literature; use as a generic "trivially small" effect bound; pre-register choice |
| Dataset has missing values for some pairs | Incomplete data collection; attrition | Use complete-case analysis if MCAR; use multiple imputation or MLM if MAR; document clearly |
| Two conditions have very different SDs () | Treatment changes variability | Note heteroscedasticity; consider Glass's (baseline SD) rather than ; Wilcoxon is robust |
15. Quick Reference Cheat Sheet
Core Formulas
| Formula | Description |
|---|---|
| Difference score for pair | |
| Mean difference | |
| SD of differences | |
| Standard error of mean difference | |
| Paired t-statistic | |
| Degrees of freedom | |
| $p = 2 \times P(T_{n-1} \geq | t |
| 95% CI for mean difference | |
| from raw score statistics | |
| Pre-post correlation from summary stats |
Effect Size Formulas
| Formula | Description |
|---|---|
| Cohen's (most common for paired) | |
| Hedges' (bias-corrected) | |
| Cohen's (comparable to between) | |
| Corrected (most cross-design comparable) | |
| Glass's (baseline standardiser) | |
| Converting to | |
| Converting to | |
| Common Language Effect Size | |
| Approximate SE for CI of | |
| Non-centrality parameter for power |
TOST Equivalence Test Formulas
| Formula | Description |
|---|---|
| Lower TOST t-statistic | |
| Upper TOST t-statistic | |
| 90% CI within | Equivalence decision criterion |
Effect Size Variant Comparison
| Variant | Denominator | Comparable To | When to Use |
|---|---|---|---|
| Other paired designs only | Within-study; paired vs. paired | ||
| (corrected) | Other paired designs | Small samples () | |
| Between-subjects | Cross-design comparison | ||
| corrected | Most generalised | Meta-analysis; cross-design | |
| Glass's | (pre-test) | Between-subjects from baseline | Pre-post change from baseline |
Cohen's Benchmarks for
| Label | (%) | (%) | |
|---|---|---|---|
| Tiny | |||
| Very small | |||
| Small | |||
| Medium | |||
| Large | |||
| Very large | |||
| Huge |
Required Sample Size (Pairs) — Two-Tailed
| Power = 0.70 | Power = 0.80 | Power = 0.90 | Power = 0.95 | |
|---|---|---|---|---|
| 0.20 | 185 | 264 | 354 | 434 |
| 0.30 | 83 | 119 | 160 | 196 |
| 0.40 | 47 | 67 | 90 | 111 |
| 0.50 | 31 | 44 | 59 | 73 |
| 0.60 | 22 | 31 | 42 | 52 |
| 0.80 | 13 | 18 | 24 | 30 |
| 1.00 | 9 | 13 | 17 | 21 |
| 1.20 | 7 | 9 | 13 | 15 |
| 1.50 | 5 | 7 | 9 | 11 |
| 2.00 | 4 | 5 | 6 | 8 |
APA 7th Edition Reporting Templates
Full APA report (raw data available):
"A paired samples t-test was conducted to examine whether [DV] differed between [Condition 1] and [Condition 2]. Difference scores were [normally / not normally] distributed as assessed by Shapiro-Wilk ( [value], [value]). [Condition 1] (, ) [was / was not] significantly [higher / lower] than [Condition 2] (, ), [value], [value], [value] [95% CI: LB, UB]. The mean difference of [value] [units] [95% CI: LB, UB] represents a [small / medium / large] within-subjects effect."
Compact format (for results section):
[value], [value], [value] [95% CI: LB, UB], [value] [units] [95% CI: LB, UB].
Non-significant result with equivalence:
"The mean difference was not statistically significant, [value], [value], [value] [95% CI: LB, UB]. A TOST equivalence test with bounds of [units] [demonstrated / failed to demonstrate] equivalence at , with the 90% CI [LB, UB] falling [entirely within / outside] the equivalence interval."
Bayesian paired t-test:
"A Bayesian paired t-test with the default Cauchy prior () provided [extreme / very strong / strong / moderate / anecdotal / no] evidence for [H: / H: ], [value]."
Test Decision Flowchart
Two related conditions, continuous DV?
├── YES
│ └── Are difference scores approximately normally distributed?
│ (Check: Shapiro-Wilk on d_i; Q-Q plot of d_i)
│ ├── YES (or n ≥ 30)
│ │ └── Paired t-test ✅
│ │ ├── Significant F: Report t, p, CI, d_z, d_av
│ │ ├── Non-significant: Report CI, sensitivity analysis
│ │ └── Claiming equivalence: Add TOST
│ └── NO (and n < 30)
│ └── Wilcoxon Signed-Rank Test ✅
│ └── Report W, z, p, r_rb
└── NO
├── Ordinal DV → Wilcoxon Signed-Rank Test
└── 3+ conditions → Repeated Measures ANOVA
Assumption Checks Reference
| Assumption | Check | Tool | Action if Violated |
|---|---|---|---|
| Normality of | Shapiro-Wilk on differences | shapiro.test(d) in R | Wilcoxon signed-rank; transform |
| Independence of pairs | Design review | Study protocol | Multilevel model if clustered |
| Correct pairing | ID matching | Inspect data file | Re-match; verify data entry |
| Interval scale | Measurement theory | Conceptual check | Wilcoxon signed-rank |
| No influential outliers | Boxplot, of | boxplot(d) | Investigate; robust t-test |
Paired t-Test Reporting Checklist
| Item | Required |
|---|---|
| Mean and SD for each condition | ✅ Always |
| Mean and SD of difference scores | ✅ Always |
| t-statistic with | ✅ Always |
| Exact p-value (or ) | ✅ Always |
| 95% CI for mean difference (original units) | ✅ Always |
| Cohen's with 95% CI | ✅ Always |
| Which variant reported (, , etc.) | ✅ Always |
| Sample size (number of pairs) | ✅ Always |
| Shapiro-Wilk result on differences | ✅ When |
| Hedges' instead of | ✅ When |
| or alongside | ✅ When comparing to between-subjects |
| (pre-post/within-pair correlation) | ✅ Recommended |
| TOST result if claiming null | ✅ When claiming no meaningful difference |
| Bayes Factor | ✅ For ambiguous or null results |
| Power or sensitivity analysis | ✅ For null or inconclusive results |
| Direction of effect stated | ✅ Always |
| Domain-specific benchmark context | ✅ Recommended |
Conversion Formulas: Paired Other Metrics
| From | To | Formula |
|---|---|---|
| , | ||
| , | ||
| , , , | ||
| , | ||
| , | ||
| , | (point-biserial) | (approx) |
| , | (comparable) | Use |
This tutorial provides a comprehensive foundation for understanding, conducting, and reporting the Paired t-Test within the DataStatPro application. For further reading, consult Field's "Discovering Statistics Using IBM SPSS Statistics" (5th ed., 2018) for an applied introduction; Lakens's "Calculating and Reporting Effect Sizes to Facilitate Cumulative Science" (Frontiers in Psychology, 2013) for the critical discussion of vs. and ; Morris & DeShon's "Combining Effect Size Estimates in Meta-Analysis With Repeated Measures and Independent-Groups Designs" (Psychological Methods, 2002) for the formula; Rouder et al.'s "Bayesian t-Tests for Accepting and Rejecting the Null Hypothesis" (Psychonomic Bulletin & Review, 2009) for the Bayesian approach; and Lakens's "Equivalence Tests: A Practical Primer for t-Tests, Correlations, and Meta-Analyses" (Social Psychological and Personality Science, 2017) for TOST equivalence testing. For feature requests or support, contact the DataStatPro team.