Paired t-Test: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of dependent-samples inference all the way through the mathematics, assumptions, variants, effect sizes, interpretation, reporting, and practical usage of the Paired t-Test within the DataStatPro application. Whether you are encountering the paired t-test for the first time or seeking a rigorous understanding of within-subjects comparison, this guide builds your knowledge systematically from the ground up.

Prerequisites and Background Concepts
What is a Paired t-Test?
The Mathematics Behind the Paired t-Test
Assumptions of the Paired t-Test
Variants of the Paired t-Test
Using the Paired t-Test Calculator Component
Full Step-by-Step Procedure
Effect Sizes for the Paired t-Test
Confidence Intervals
Power Analysis and Sample Size Planning
Advanced Topics
Worked Examples
Common Mistakes and How to Avoid Them
Troubleshooting
Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into the paired t-test, it is essential to be comfortable with the following foundational statistical concepts. Each is briefly reviewed below.

1.1 Populations, Parameters, and Paired Designs

A population is the complete collection of individuals or measurements of interest. A sample is a subset drawn from that population. In a paired design, each participant (or experimental unit) contributes exactly two measurements — one under each of two conditions. The two measurements within a pair are inherently linked.

The paired t-test is an inferential procedure — it uses sample statistics computed from difference scores to draw conclusions about an unknown population parameter, namely the mean of the population difference scores $\mu_d$ .

The fundamental question: "Is the mean difference between the two paired conditions large enough to conclude that a true population-level difference exists?"

1.2 Why Pairing Matters: Removing Between-Person Variability

In most research involving repeated measurements, individuals vary considerably from one another — some participants score high on both measurements, others score low on both. This between-person variability is a source of noise that has nothing to do with the treatment or condition effect.

By computing difference scores $d_i = x_{1i} - x_{2i}$ for each participant, the paired design removes between-person variability from the error term entirely:

$s_d^2 = s_1^2 + s_2^2 - 2r_{12}s_1s_2$

When $r_{12} > 0$ (which is typical when measuring the same people twice), $s_d < s_{pooled}$ , meaning the paired test has a smaller denominator and greater statistical power than the independent samples t-test for the same data.

1.3 The Sampling Distribution of the Mean Difference

If we repeatedly drew samples of $n$ pairs from a population where the true mean difference is $\mu_d$ , the sampling distribution of $\bar{d}$ (the mean of the difference scores) would, by the Central Limit Theorem, be approximately normal:

$\bar{d} \sim \mathcal{N}\!\left(\mu_d,\; \frac{\sigma_d^2}{n}\right)$

Because the population standard deviation of differences $\sigma_d$ is unknown, we estimate it with the sample standard deviation $s_d$ , giving the estimated standard error of the mean difference:

$\widehat{SE}_{\bar{d}} = \frac{s_d}{\sqrt{n}}$

This substitution of $s_d$ for $\sigma_d$ is exactly what introduces the t-distribution rather than the standard normal distribution into the inference.

1.4 The t-Distribution and Degrees of Freedom

The Student's t-distribution arises when estimating a normally distributed population mean from a small sample with unknown variance. It is characterised by degrees of freedom $\nu$ . For the paired t-test:

$\nu = n - 1$

where $n$ is the number of pairs (not the total number of observations, which would be $2n$ ). As $\nu \to \infty$ , the t-distribution converges to the standard normal $Z$ .

The t-distribution has heavier tails than the standard normal, reflecting greater uncertainty from estimating $\sigma_d$ from the data rather than knowing it exactly.

1.5 The Null and Alternative Hypotheses

The paired t-test operates within the Neyman-Pearson hypothesis testing framework:

$H_0: \mu_d = 0$ (the population mean difference is zero)

$H_1: \mu_d \neq 0$ (two-tailed, default)

or directional alternatives:

$H_1: \mu_d > 0$ (upper one-tailed — Condition 1 > Condition 2)

$H_1: \mu_d < 0$ (lower one-tailed — Condition 1 < Condition 2)

The null hypothesis can also be generalised to test against a non-zero value $\delta_0$ :

$H_0: \mu_d = \delta_0$

which is useful for non-inferiority, superiority, or equivalence testing.

1.6 Statistical Significance vs. Practical Significance

A paired t-test answers: "Is the mean difference statistically distinguishable from zero, given sampling variability?" It does not answer: "Is the difference large enough to matter in practice?"

With a large number of pairs, even a trivially small mean difference can be statistically significant. Always report:

The t-statistic, degrees of freedom, and p-value (statistical significance).
An effect size (e.g., Cohen's $d_z$ ) and its 95% CI (practical significance).
The 95% CI for the mean difference (in original units).

1.7 Confidence Intervals and Their Relationship to the Test

A 95% confidence interval for $\mu_d$ is directly related to the two-tailed t-test at $\alpha = .05$ : the null hypothesis $H_0: \mu_d = 0$ is rejected at $\alpha = .05$ if and only if $0$ lies outside the 95% CI. The CI provides strictly more information than the p-value because it communicates both the precision and magnitude of the estimated difference in original units.

1.8 Type I and Type II Errors

Decision	$H_0$ True ( $\mu_d = 0$ )	$H_0$ False ( $\mu_d \neq 0$ )
Reject $H_0$	Type I error ( $\alpha$ )	Correct — Power ( $1-\beta$ )
Fail to Reject $H_0$	Correct ( $1-\alpha$ )	Type II error ( $\beta$ )

Type I error: Concluding a difference exists when none does (false positive). Rate controlled by $\alpha$ .
Type II error: Missing a true difference (false negative). Rate $\beta$ ; power $= 1-\beta$ .
Power: Probability of correctly detecting a true effect of a given size.

2. What is a Paired t-Test?

2.1 The Core Idea

The paired t-test (also called: dependent samples t-test, matched pairs t-test, repeated measures t-test, or within-subjects t-test) is a parametric inferential procedure for testing whether the mean of a set of difference scores is significantly different from zero (or another specified value).

Rather than comparing two separate group means directly, the paired t-test:

Computes a difference score $d_i$ for each pair of observations.
Reduces the problem to a one-sample t-test on those difference scores.
Tests whether the mean difference $\bar{d}$ is significantly different from zero.

This reduction is elegant: the paired t-test is mathematically identical to a one-sample t-test applied to the difference scores.

2.2 When to Use a Paired t-Test

A paired t-test is appropriate when:

The dependent variable is continuous (interval or ratio scale).
You are comparing exactly two related conditions or two time points.
Each observation in Condition 1 is meaningfully linked to exactly one observation in Condition 2.
The difference scores are approximately normally distributed (or $n$ is large enough for the CLT to apply).

2.3 What Makes Observations "Paired"?

Observations are paired when there is a natural, meaningful, one-to-one correspondence between observations in the two conditions:

Pairing Type	Example
Pre-post (same participant)	Depression score before and after therapy
Repeated measures (same participant)	Reaction time in noise vs. silence
Matched pairs (different participants)	Twins randomised to different conditions
Natural pairs	Left hand vs. right hand grip strength
Crossover designs	Drug A vs. Drug B, each participant receives both
Yoked controls	Each treatment participant matched to a control on age and IQ

The key criterion is that the pairing must be established before data collection, not post-hoc. The correlation between the paired measurements must be positive (or at least non-negative) for pairing to confer a power advantage.

2.4 The Paired t-Test vs. Related Procedures

Situation	Appropriate Test
Two related conditions, normal differences	Paired t-test
Two related conditions, non-normal or ordinal	Wilcoxon Signed-Rank test
Two independent groups	Independent samples t-test (Welch's recommended)
Three or more related conditions	One-way Repeated Measures ANOVA
Two related conditions, Bayesian inference	Bayesian paired t-test (BF $_{10}$ )
Testing equivalence of two related conditions	TOST equivalence test

2.5 The Power Advantage of Pairing

The paired t-test is more powerful than the independent samples t-test when:

The within-pair correlation $r_{12}$ is positive (which is almost always true for repeated measures on the same participant).
Between-person variability is large relative to within-person change.

The power gain is quantified by the relationship between paired and independent standard errors:

$SE_{paired} = SE_{independent} \times \sqrt{2(1-r_{12})}$

When $r_{12} = 0.50$ : $SE_{paired} = SE_{independent} \times \sqrt{1.0} = SE_{independent}$ (no gain).

When $r_{12} = 0.80$ : $SE_{paired} = SE_{independent} \times \sqrt{0.4} = 0.632 \times SE_{independent}$ (37% reduction in SE — substantial power gain).

When $r_{12} = 0.30$ : $SE_{paired} = SE_{independent} \times \sqrt{1.4} = 1.183 \times SE_{independent}$ — pairing actually hurts power when correlation is low and one degree of freedom is lost for pairing.

💡 Pairing is most advantageous when the within-pair correlation is high ( $r_{12} > 0.50$ ). When participants differ greatly from each other but respond consistently to conditions, the paired design dramatically reduces error and increases power.

3. The Mathematics Behind the Paired t-Test

3.1 The Difference Score Reduction

Let $(x_{1i}, x_{2i})$ denote the pair of observations for participant $i$ , where $i = 1, 2, \ldots, n$ . Define the difference score:

$d_i = x_{1i} - x_{2i}$

The sign convention matters: consistently subtracting Condition 2 from Condition 1 means a positive $\bar{d}$ indicates that Condition 1 has higher values.

The mean and standard deviation of the difference scores are:

$\bar{d} = \frac{1}{n}\sum_{i=1}^n d_i$

$s_d = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (d_i - \bar{d})^2}$

3.2 The t-Statistic

The paired t-statistic is:

$t = \frac{\bar{d} - \delta_0}{s_d / \sqrt{n}}$

Where:

$\bar{d}$ = mean of the difference scores.
$\delta_0$ = null hypothesis value (typically $0$ ).
$s_d$ = standard deviation of the difference scores.
$n$ = number of pairs.

Under $H_0: \mu_d = \delta_0$ , this statistic follows a t-distribution with $\nu = n - 1$ degrees of freedom.

3.3 Standard Error of the Mean Difference

The standard error of the mean difference measures the precision of $\bar{d}$ as an estimate of $\mu_d$ :

$SE_{\bar{d}} = \frac{s_d}{\sqrt{n}}$

This is the only standard error needed for the paired t-test. Note that it is computed entirely from the difference scores — the original scores $x_{1i}$ and $x_{2i}$ are used only to compute $d_i$ .

3.4 The p-value

Two-tailed p-value:

$p = 2 \times P(T_{n-1} \geq |t_{obs}|)= 2 \times [1 - F_{t,\;n-1}(|t_{obs}|)]$

One-tailed p-value (upper): $H_1: \mu_d > \delta_0$

$p = P(T_{n-1} \geq t_{obs}) = 1 - F_{t,\;n-1}(t_{obs})$

One-tailed p-value (lower): $H_1: \mu_d < \delta_0$

$p = P(T_{n-1} \leq t_{obs}) = F_{t,\;n-1}(t_{obs})$

Where $F_{t,\;n-1}$ is the CDF of the t-distribution with $n-1$ degrees of freedom.

3.5 Relationship Between $s_d$ and the Raw Score Statistics

The standard deviation of differences $s_d$ is algebraically related to the standard deviations of the original scores and their correlation:

$s_d^2 = s_1^2 + s_2^2 - 2r_{12}s_1 s_2$

Where:

$s_1, s_2$ = standard deviations of Condition 1 and Condition 2 scores respectively.
$r_{12}$ = Pearson correlation between the paired measurements.

This relationship has several important implications:

When $r_{12} = 0$ : $s_d^2 = s_1^2 + s_2^2$ — the paired test is equivalent to the independent test (no benefit from pairing).

When $r_{12} > 0$ : $s_d^2 < s_1^2 + s_2^2$ — pairing reduces error variance and increases power.

When $r_{12} < 0$ : $s_d^2 > s_1^2 + s_2^2$ — pairing increases error variance and reduces power. This is rare in practice but can occur with counterbalanced designs where learning effects operate.

3.6 The Mean Difference and Its Relationship to Raw Means

The mean difference score always equals the difference of the condition means:

$\bar{d} = \bar{x}_1 - \bar{x}_2$

This means the paired and independent tests produce identical estimates of the mean difference — the only difference is in the standard error used to evaluate that difference.

3.7 Computing the t-Statistic from Summary Statistics

If raw data are unavailable, the paired t-statistic can be computed from summary statistics in several ways:

From $\bar{d}$ and $s_d$ :

$t = \frac{\bar{d}}{s_d/\sqrt{n}}$

From the correlation $r_{12}$ and group SDs:

$s_d = \sqrt{s_1^2 + s_2^2 - 2r_{12}s_1 s_2}$

$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{(s_1^2 + s_2^2 - 2r_{12}s_1 s_2)/n}}$

From the t-statistic, recovering effect size:

$d_z = \frac{t}{\sqrt{n}}$

3.8 Non-Central t-Distribution and Exact CIs for Effect Sizes

Under $H_1$ (when a true effect exists), the t-statistic follows a non-central t-distribution with non-centrality parameter:

$\lambda = d_z \sqrt{n} = \frac{\mu_d}{\sigma_d}\sqrt{n}$

The exact 95% CI for Cohen's $d_z$ inverts this relationship numerically:

$P(T_{n-1}(\lambda_L) \geq t_{obs}) = 0.025$ and $P(T_{n-1}(\lambda_U) \leq t_{obs}) = 0.025$

$d_{z,L} = \frac{\lambda_L}{\sqrt{n}}, \qquad d_{z,U} = \frac{\lambda_U}{\sqrt{n}}$

No closed form exists for these bounds — DataStatPro computes them automatically using numerical iteration of the non-central t-distribution CDF.

3.9 Statistical Power of the Paired t-Test

Power is the probability that the paired t-test correctly rejects $H_0$ when a true effect of size $d_z$ exists:

$\text{Power} = P\!\left(T_{n-1}(\lambda) > t_{\alpha/2,\;n-1}\right)$

Where $\lambda = d_z\sqrt{n}$ is the non-centrality parameter.

The relationship between power, effect size, and sample size:

$d_z$	Power = 0.70 ( $n$ pairs)	Power = 0.80 ( $n$ pairs)	Power = 0.90 ( $n$ pairs)	Power = 0.95 ( $n$ pairs)
0.20	185	264	354	434
0.35	62	88	118	146
0.50	31	44	59	73
0.80	13	18	24	30
1.00	9	13	17	21
1.20	7	9	12	15
1.50	5	7	9	11

All values assume two-tailed $\alpha = .05$ .

4. Assumptions of the Paired t-Test

4.1 Normality of Difference Scores

The paired t-test assumes that the difference scores $d_i = x_{1i} - x_{2i}$ are drawn from a normally distributed population. Note that:

This is not the same as assuming the individual conditions are normally distributed.
The differences can be normal even if the raw scores are not, as long as the two conditions' non-normalities cancel out.
This is a weaker assumption than requiring both raw distributions to be normal.

How to check:

Shapiro-Wilk test on the difference scores ( $n < 50$ ): $H_0$ : differences are normally distributed. A significant result suggests departure from normality.
Q-Q plot of difference scores: points should fall approximately on the diagonal reference line.
Histogram of difference scores: should be approximately bell-shaped.
Skewness ( $|z_{skew}| < 2$ ) and kurtosis ( $|z_{kurt}| < 7$ ).

Robustness: The paired t-test is robust to mild violations of normality, especially when:

$n \geq 30$ pairs (Central Limit Theorem ensures the sampling distribution of $\bar{d}$ is approximately normal even if $d_i$ are not).
The distribution of differences is symmetric, even if not perfectly normal.
The violation consists of light tails rather than heavy tails or extreme skewness.

When violated: Use the Wilcoxon Signed-Rank test as a non-parametric alternative, or consider data transformations (log, square root) if the differences are right-skewed.

4.2 Independence of Pairs

All pairs must be independent of each other. That is, knowing the difference score for pair $i$ gives no information about the difference score for pair $j$ . Within a pair, the two measurements are of course correlated — that is the whole point of the design. It is the independence across pairs that is required.

Common violations:

Multiple measurements from the same participant treated as separate pairs.
Family members or social contacts in the same study.
Clustered data (e.g., pairs sampled from the same school or ward).
Time series where successive differences are autocorrelated.

How to check: Independence is a property of the study design, not of the data. Inspect the sampling procedure. Check for patterns in residuals over time (Durbin-Watson test) if measurements were sequential.

When violated: For clustered pairs, use multilevel models. For time series, use time-series methods (ARIMA, mixed models with autocorrelation structure).

4.3 Correct Pairing

The pairing must be meaningful and pre-specified. Each observation in Condition 1 must correspond to the correct partner observation in Condition 2. Incorrect or arbitrary pairing does not create a valid paired test — it creates noise.

How to check: Verify the data file structure — each row should represent one pair (one participant or one matched pair), with Condition 1 and Condition 2 values in separate columns.

⚠️ A common data-entry error is accidentally shifting one column so that rows no longer correspond to the same participant across conditions. Always verify that participant IDs match across the two columns before running a paired t-test.

4.4 Interval Scale of Measurement

The dependent variable must be measured on at least an interval scale (equal-spaced intervals between values). Difference scores must be meaningful — they require that the distance between score values is consistent throughout the scale.

When violated: If the DV is ordinal (e.g., a single Likert item, rank data), use the Wilcoxon Signed-Rank test instead.

4.5 Absence of Influential Outliers in Difference Scores

The paired t-test is sensitive to extreme outliers in the difference scores because they distort both $\bar{d}$ and $s_d$ .

How to check:

Boxplot of difference scores: values beyond $1.5 \times IQR$ from the quartiles.
Standardised difference scores: flag $|z_i| > 3$ .
Grubbs' test for formal outlier detection in the differences.

When outliers are present: Investigate whether the outlier represents a data entry error, measurement error, or a genuine extreme response. Report analyses with and without the outlier(s). Consider using the Wilcoxon Signed-Rank test (which is rank-based and thus robust to outliers in the differences).

4.6 Assumption Summary Table

Assumption	Description	How to Check	Remedy if Violated
Normality of differences	$d_i \sim \mathcal{N}(\mu_d, \sigma_d^2)$	Shapiro-Wilk, Q-Q, histogram	Wilcoxon Signed-Rank; transform data
Independence of pairs	Pairs are independent of each other	Design review; Durbin-Watson	Multilevel model; time-series methods
Correct pairing	Conditions 1 and 2 observations are correctly matched	Verify participant IDs in data file	Re-match data; verify recording
Interval scale	DV has equal-interval properties	Measurement theory	Wilcoxon Signed-Rank
No influential outliers	No extreme values in $d_i$	Boxplot; $	z_{d_i}

5. Variants of the Paired t-Test

5.1 Overview of Effect Size Variants

Multiple variants of the paired t-test exist primarily because of different choices of effect size standardiser — the denominator of the standardised mean difference. Choosing the wrong variant leads to incomparable effect sizes across studies.

Variant	t-Statistic	Effect Size	Denominator	Primary Use
Standard paired t	$\bar{d}/(s_d/\sqrt{n})$	$d_z = \bar{d}/s_d$	SD of differences	Comparing paired designs
Average SD standardiser	Same t	$d_{av} = \bar{d}/s_{av}$	Average of group SDs	Comparing to between-subjects
Pooled SD standardiser	Same t	$d_s = \bar{d}/s_{pooled}$	Pooled SD (like between)	Meta-analysis
RM-corrected	Same t	$d_{rm} = d_{av}\sqrt{2(1-r)}$	Adjusted for correlation	Cross-design comparison
Pre-test standardiser	Same t	$d_{pre} = \bar{d}/s_{pre}$	SD of pre-test (Condition 1)	Change from baseline

5.2 Cohen's $d_z$ — The Standardised Mean Difference of Differences

Cohen's $d_z$ is the most straightforward effect size for the paired t-test. It expresses the mean difference in units of the standard deviation of the difference scores:

$d_z = \frac{\bar{d}}{s_d}$

It is directly recoverable from the t-statistic: $d_z = t/\sqrt{n}$ .

When to use $d_z$ :

Comparing effect sizes across studies that all use paired designs.
Within-study power analysis for the same paired design.
When the research question is about the magnitude of within-person change relative to individual variability in change.

Limitation of $d_z$ : It is not directly comparable to Cohen's $d$ from an independent samples design because $s_d$ reflects within-person variability in change, which is typically much smaller than between-person variability. $d_z$ is therefore typically larger than $d_{av}$ for the same mean difference.

5.3 Cohen's $d_{av}$ — Average Standard Deviation Standardiser

Cohen's $d_{av}$ (Lakens, 2013) standardises the mean difference by the average of the two condition standard deviations:

$s_{av} = \frac{s_1 + s_2}{2}$

$d_{av} = \frac{\bar{d}}{s_{av}} = \frac{\bar{x}_1 - \bar{x}_2}{(s_1 + s_2)/2}$

When to use $d_{av}$ :

When comparing a paired design effect to a between-subjects design effect using the same measurement scale.
When the research question concerns the magnitude of mean change relative to the variability of the original measurements.
Meta-analyses combining within-subjects and between-subjects designs.

5.4 Cohen's $d_{rm}$ — Repeated Measures Corrected

Cohen's $d_{rm}$ (Morris & DeShon, 2002) explicitly accounts for the within-subjects correlation $r_{12}$ to produce an effect size that is directly comparable to a between- subjects Cohen's $d$ :

$d_{rm} = d_{av} \times \sqrt{2(1-r_{12})}$

Or equivalently:

$d_{rm} = \frac{\bar{d}}{s_{av}} \times \sqrt{2(1-r_{12})}$

Properties:

When $r_{12} = 0.5$ : $d_{rm} = d_{av} \times 1.0 = d_{av}$ .
When $r_{12} > 0.5$ : $d_{rm} < d_{av}$ (correlation inflated $d_{av}$ — correction deflates it toward a between-subjects comparable value).
When $r_{12} < 0.5$ : $d_{rm} > d_{av}$ .

$d_{rm}$ is the most theoretically appropriate effect size for comparing paired designs to independent samples designs.

5.5 Glass's $\Delta$ for Pre-Post Designs

Glass's $\Delta$ standardises by the pre-test (Condition 1) standard deviation only. This is most appropriate for treatment-control or pre-post designs where the pre-test represents the baseline, unaffected by the treatment:

$\Delta = \frac{\bar{d}}{s_{pre}} = \frac{\bar{x}_1 - \bar{x}_2}{s_1}$

It answers: "How many standard deviations (in the original, pre-intervention metric) does the average participant change?"

When to use $\Delta$ : Pre-post designs where the treatment may change the variability of the outcome (e.g., an intervention that reduces both mean and variance of depression scores). Standardising by the pre-test SD anchors the effect in the pre-intervention distribution.

5.6 Relationship Between $d_z$ and $d_{av}$

$d_z$ and $d_{av}$ are related through the within-pair correlation $r_{12}$ :

$d_z = d_{av} \times \sqrt{\frac{2}{1 + r_{12}}} \times \frac{1}{\sqrt{2(1-r_{12})}} \times \sqrt{2(1-r_{12})}$

More directly:

$d_z = \frac{d_{av}}{\sqrt{1 - r_{12}}} \times \frac{1}{\sqrt{2}}$

Wait — the exact relationship (Lakens, 2013):

$d_z = \frac{\bar{d}}{s_d} = \frac{\bar{d}}{s_{av}\sqrt{2(1-r_{12})}} = \frac{d_{av}}{\sqrt{2(1-r_{12})}}$

Therefore:

$d_z = \frac{d_{av}}{\sqrt{2(1-r_{12})}}$ and $d_{av} = d_z \times \sqrt{2(1-r_{12})}$

This means $d_z > d_{av}$ when $r_{12} > 0$ (which is almost always the case for repeated measures), explaining why paired designs appear to produce larger effect sizes than between-subjects designs when the same $d_z$ metric is uncritically applied to both.

Numerical example with $r_{12} = 0.70$ :

$d_z = d_{av}/\sqrt{2(1-0.70)} = d_{av}/\sqrt{0.60} = d_{av}/0.775 = 1.291 \times d_{av}$

So if $d_{av} = 0.60$ , then $d_z = 0.775$ — nearly 30% larger.

6. Using the Paired t-Test Calculator Component

The Paired t-Test Calculator in DataStatPro provides a comprehensive tool for running, diagnosing, visualising, and reporting paired t-tests and their alternatives.

Step-by-Step Guide

Step 1 — Select "Paired Samples t-Test"

From the "Test Type" dropdown, select:

Paired Samples t-Test — for parametric analysis of normally distributed differences.
Wilcoxon Signed-Rank Test — the non-parametric alternative (automatically suggested if DataStatPro's normality check flags a violation).

Step 2 — Input Method

Choose how to provide the data:

Raw data (paired columns): Upload or paste two columns — Condition 1 and Condition 2 — with one row per participant. DataStatPro automatically computes difference scores, runs assumption checks, and generates all statistics.
Raw data (difference scores): If you already have difference scores, upload a single column and DataStatPro treats it as a one-sample t-test on the differences.
Summary statistics: Enter $n$ , $\bar{d}$ , $s_d$ directly. All test statistics and effect sizes are computed but full assumption checks are unavailable.
Summary statistics with correlation: Enter $n$ , $\bar{x}_1$ , $\bar{x}_2$ , $s_1$ , $s_2$ , $r_{12}$ to compute all effect size variants.
t-statistic and df: Enter $t$ and $n-1$ to compute p-values and effect sizes from a published result.

💡 When using paired columns, DataStatPro verifies that column lengths are equal, flags any missing data, and alerts you if participant IDs are provided and do not match across columns.

Step 3 — Specify the Null Hypothesis Value $\delta_0$

Default: $\delta_0 = 0$ (testing whether the mean difference is zero). To test a non-zero null (e.g., for non-inferiority testing with a margin of $\delta_0 = -2$ points), enter the appropriate value.

Step 4 — Select the Alternative Hypothesis

Two-tailed (default): $H_1: \mu_d \neq 0$ — most appropriate for most research.
Upper one-tailed: $H_1: \mu_d > 0$ — use only with strong a priori directional prediction, pre-registered before data collection.
Lower one-tailed: $H_1: \mu_d < 0$ — use only with pre-registered directional prediction.

Step 5 — Choose the Significance Level

Select $\alpha$ (default: $.05$ ). DataStatPro simultaneously shows results for $\alpha = .05$ , $\alpha = .01$ , and $\alpha = .001$ for reference.

Step 6 — Select Effect Size Variants

Choose which effect sizes to compute and report:

✅ Cohen's $d_z$ (default) — standardised by $s_d$ .
✅ Cohen's $d_{av}$ — standardised by average of condition SDs.
✅ Cohen's $d_{rm}$ — RM-corrected (requires $r_{12}$ ; from raw data or entered manually).
✅ Glass's $\Delta$ — standardised by Condition 1 (pre-test) SD.
✅ Hedges' $g_z$ — bias-corrected $d_z$ .
✅ Common Language Effect Size (CL).

Step 7 — Select Display Options

✅ t-statistic, df, p-value, and decision.
✅ Descriptive statistics: $n$ , $\bar{x}_1$ , $\bar{x}_2$ , $\bar{d}$ , $s_1$ , $s_2$ , $s_d$ , $r_{12}$ .
✅ 95% CI for mean difference (in original units).
✅ All selected effect sizes with exact 95% CIs.
✅ Common Language Effect Size (CL%).
✅ Assumption test panel: Shapiro-Wilk on differences, Q-Q plot, histogram of differences, boxplot of differences.
✅ Visualisation: overlapping density plots for Conditions 1 and 2; distribution of difference scores with reference line at zero.
✅ Individual trajectory plot: each participant's scores connected across conditions.
✅ Cohen's $d$ diagram ( $U_1$ , $U_2$ , $U_3$ overlap statistics).
✅ Power curve: power vs. $n$ for the observed effect size.
✅ Equivalence test (TOST) output.
✅ Bayesian paired t-test (Bayes Factor $BF_{10}$ ).
✅ APA 7th edition-compliant results paragraph (auto-generated).

Step 8 — Run the Analysis

Click "Run Paired t-Test". DataStatPro will:

Compute difference scores and all descriptive statistics.
Run Shapiro-Wilk normality test on the differences.
Compute the t-statistic, df, and p-value.
Construct exact 95% CIs for the mean difference and all effect sizes.
Generate all selected visualisations.
Auto-generate the APA results paragraph.

7. Full Step-by-Step Procedure

7.1 Complete Computational Procedure

This section walks through every computational step for the paired t-test, from raw data to a full APA-style conclusion.

Given: $n$ pairs of observations $(x_{1i}, x_{2i})$ for $i = 1, 2, \ldots, n$ .

Step 1 — Verify and Arrange the Data

Arrange data in a table with one row per pair:

Pair $i$	$x_{1i}$ (Condition 1)	$x_{2i}$ (Condition 2)	$d_i = x_{1i} - x_{2i}$
1	$x_{11}$	$x_{21}$	$d_1$
2	$x_{12}$	$x_{22}$	$d_2$
$\vdots$	$\vdots$	$\vdots$	$\vdots$
$n$	$x_{1n}$	$x_{2n}$	$d_n$

Establish the sign convention: a positive $d_i$ means the participant scored higher in Condition 1 than Condition 2. State this convention explicitly before analysis.

Step 2 — Compute the Mean Difference

$\bar{d} = \frac{1}{n}\sum_{i=1}^n d_i = \frac{\sum_{i=1}^n d_i}{n}$

Equivalently: $\bar{d} = \bar{x}_1 - \bar{x}_2$

Step 3 — Compute the Standard Deviation of Differences

$s_d = \sqrt{\frac{\sum_{i=1}^n (d_i - \bar{d})^2}{n-1}} = \sqrt{\frac{\sum_{i=1}^n d_i^2 - n\bar{d}^2}{n-1}}$

Step 4 — Compute the Standard Error

$SE_{\bar{d}} = \frac{s_d}{\sqrt{n}}$

Step 5 — Check the Normality Assumption

Run the Shapiro-Wilk test on the difference scores $d_i$ :

If $p_{SW} > .05$ : normality is not contradicted; proceed with paired t-test.
If $p_{SW} \leq .05$ and $n < 30$ : consider the Wilcoxon Signed-Rank test.
If $p_{SW} \leq .05$ and $n \geq 30$ : proceed with caution (CLT generally provides protection); inspect Q-Q plot for severe violations.

Step 6 — Compute the t-Statistic

$t = \frac{\bar{d} - \delta_0}{SE_{\bar{d}}} = \frac{\bar{d} - \delta_0}{s_d/\sqrt{n}}$

For the default null ( $\delta_0 = 0$ ): $t = \bar{d} \cdot \sqrt{n} / s_d$

Step 7 — Determine Degrees of Freedom

$\nu = n - 1$

Step 8 — Compute the p-value

Using the t-distribution with $\nu = n-1$ df:

Two-tailed: $p = 2 \times P(T_{n-1} \geq |t|)$

Compare $p$ to $\alpha$ . Reject $H_0$ if $p \leq \alpha$ .

Step 9 — Compute the 95% Confidence Interval for $\mu_d$

$\bar{d} \pm t_{\alpha/2,\; n-1} \times SE_{\bar{d}} = \bar{d} \pm t_{\alpha/2,\; n-1} \times \frac{s_d}{\sqrt{n}}$

The CI directly answers: "What are plausible values for the true population mean difference, given this sample?"

Step 10 — Compute Effect Sizes

Cohen's $d_z$ :

$d_z = \frac{\bar{d}}{s_d} = \frac{t}{\sqrt{n}}$

Hedges' $g_z$ (bias-corrected $d_z$ ):

$g_z = d_z \times \left(1 - \frac{3}{4(n-1) - 1}\right) = d_z \times J$

Where $J = 1 - 3/(4n-5)$ is the bias correction factor.

Cohen's $d_{av}$ (requires $s_1$ and $s_2$ ):

$s_{av} = \frac{s_1 + s_2}{2}, \qquad d_{av} = \frac{\bar{d}}{s_{av}}$

Cohen's $d_{rm}$ (requires $r_{12}$ ):

$d_{rm} = d_{av} \times \sqrt{2(1-r_{12})}$

Common Language Effect Size (CL):

$CL = \Phi\!\left(\frac{d_z}{\sqrt{2}}\right)$

$CL$ is the probability that a randomly selected participant scores higher in Condition 1 than in Condition 2 (for positive $d_z$ ).

Step 11 — Compute the 95% CI for Cohen's $d_z$

Exact CI (via non-central t-distribution — computed by DataStatPro):

Find $\lambda_L$ and $\lambda_U$ such that:

$P(T_{n-1}(\lambda_L) \geq t_{obs}) = 0.025$ and $P(T_{n-1}(\lambda_U) \leq t_{obs}) = 0.025$

$d_{z,L} = \lambda_L/\sqrt{n}, \qquad d_{z,U} = \lambda_U/\sqrt{n}$

Approximate CI (adequate for $n > 30$ ):

$SE_{d_z} \approx \sqrt{\frac{1}{n} + \frac{d_z^2}{2(n-1)}}$

$d_z \pm 1.96 \times SE_{d_z}$

Step 12 — Interpret and Report

Combine all results into a complete, APA-compliant report:

Report $t(\nu) =$ [value], $p =$ [value] (or $p < .001$ ).
Report $\bar{d}$ and $s_d$ with units.
Report the 95% CI for the mean difference.
Report Cohen's $d_z$ (and/or $d_{av}$ ) with 95% CI.
Classify the effect size using benchmarks.
State the practical conclusion.

8. Effect Sizes for the Paired t-Test

8.1 Cohen's $d_z$ — Step-by-Step

$d_z = \frac{\bar{d}}{s_d}$

Interpretation: $d_z = 0.50$ means the mean difference is half a standard deviation of the difference scores. This is not directly comparable to Cohen's $d$ from an independent samples design without knowing $r_{12}$ .

8.2 Hedges' $g_z$ — Bias Correction

Cohen's $d_z$ is slightly positively biased in small samples — it overestimates the true population effect. Hedges' $g_z$ applies the bias correction:

$g_z = d_z \times J, \qquad J = 1 - \frac{3}{4(n-1)-1}$

More precise gamma function form:

$J = \frac{\Gamma((n-1)/2)}{\sqrt{(n-1)/2} \cdot \Gamma((n-2)/2)}$

The bias is negligible for $n > 20$ (less than 5%) but can be substantial for very small samples ( $n < 10$ ):

$n$	$J$	Bias (%)
5	0.8406	15.9%
10	0.9227	7.7%
15	0.9484	5.2%
20	0.9613	3.9%
30	0.9742	2.6%
50	0.9848	1.5%

8.3 Cohen's Benchmark Classification

Cohen (1988) proposed the following conventions for $d_z$ (and equivalently for $d_{av}$ and $d_{rm}$ ):

$\vert d_z \vert$	Verbal Label	$CL$ (%)	$U_3$ (%)	Overlap (%)
$0.00$	No effect	$50.0\%$	$50.0\%$	$100.0\%$
$0.20$	Small	$55.6\%$	$57.9\%$	$85.3\%$
$0.50$	Medium	$63.8\%$	$69.1\%$	$66.9\%$
$0.80$	Large	$71.4\%$	$78.8\%$	$52.5\%$
$1.20$	Very large	$80.2\%$	$88.5\%$	$35.9\%$
$2.00$	Huge	$92.1\%$	$97.7\%$	$16.9\%$

⚠️ Cohen's benchmarks were intended as rough conventions of last resort — to be used only when no domain-specific information is available. Always contextualise within your research domain. In clinical psychology, $d_z = 0.50$ may be a meaningful effect; in some neuroimaging contexts, $d_z = 0.20$ may be large relative to typical findings.

Extended benchmarks (Sawilowsky, 2009):

Label	$\vert d_z \vert$
Tiny	$< 0.10$
Very small	$0.10 - 0.19$
Small	$0.20 - 0.49$
Medium	$0.50 - 0.79$
Large	$0.80 - 1.19$
Very large	$1.20 - 1.99$
Huge	$\geq 2.00$

8.4 The Common Language Effect Size

The Common Language Effect Size (McGraw & Wong, 1992) translates $d_z$ into a probability that is intuitive for non-statistical audiences:

$CL = \Phi\!\left(\frac{d_z}{\sqrt{2}}\right)$

$CL = 0.70$ means: "In 70% of repeated measurements of the same individual, their score in Condition 1 exceeds their score in Condition 2."

8.5 Which Effect Size to Report: A Decision Guide

Research Goal	Recommended Effect Size	Rationale
Within-study power analysis and paired design comparison	$d_z$	Direct function of the t-statistic; reflects paired design power
Comparing to between-subjects literature	$d_{av}$ or $d_{rm}$	Standardises by original-scale SD; comparable to independent $d$
Clinical pre-post change evaluation	$d_{av}$ or Glass's $\Delta$	Anchored in clinically meaningful scale
Meta-analysis combining paired and independent	$d_{rm}$	Design-adjusted; most comparable across designs
Small sample ( $n < 20$ )	$g_z$ (Hedges')	Reduces positive bias of $d_z$
Reporting all relevant variants	$d_z$ + $d_{av}$	Provides complete picture; specify which is primary

💡 Always specify which effect size variant was computed. Writing "Cohen's $d = 0.78$ " without specifying whether it is $d_z$ , $d_{av}$ , or $d_{rm}$ is ambiguous and prevents accurate meta-analytic synthesis.

9. Confidence Intervals

9.1 CI for the Mean Difference (Original Units)

The 95% CI for the population mean difference $\mu_d$ provides the most practically interpretable interval — it is expressed in the original measurement units and directly answers: "How large might the true effect be?"

$\bar{d} \pm t_{\alpha/2,\; n-1} \times \frac{s_d}{\sqrt{n}}$

Interpreting the CI:

CI Property	Interpretation
Entirely above zero	Effect is significantly positive ( $p < \alpha$ )
Entirely below zero	Effect is significantly negative ( $p < \alpha$ )
Contains zero	Not statistically significant at level $\alpha$
Narrow CI	Precise estimate; large $n$
Wide CI	Imprecise estimate; small $n$ — interpret point estimate cautiously
Entirely within trivial range	Effect is definitively small (equivalence established)

9.2 CI for Cohen's $d_z$ (Standardised)

The exact 95% CI for $d_z$ uses the non-central t-distribution (computed automatically by DataStatPro). The approximate CI (adequate for $n > 30$ ) is:

$SE_{d_z} \approx \sqrt{\frac{1}{n} + \frac{d_z^2}{2(n-1)}}$

$d_z \pm 1.96 \times SE_{d_z}$ (approximate, two-tailed $\alpha = .05$ )

Width of the 95% CI for $d_z$ as a function of $n$ (for true $d_z = 0.50$ ):

$n$ pairs	Approx. $SE_{d_z}$	95% CI Width	Precision
10	0.334	1.31	Very low
20	0.232	0.91	Low
30	0.188	0.74	Moderate
50	0.145	0.57	Moderate-good
100	0.102	0.40	Good
200	0.072	0.28	High
500	0.046	0.18	Very high

⚠️ With $n = 10$ pairs, the 95% CI for $d_z = 0.50$ spans approximately [−0.16, 1.16] — from "negligible" to "very large." A point estimate of $d_z = 0.50$ from a study of only 10 pairs is essentially uninterpretable without the CI. Always report the CI.

9.3 CI for Other Effect Size Variants

95% CI for $d_{av}$ : Convert using $d_{av} = d_z \times \sqrt{2(1-r_{12})}$ and apply the same conversion to both CI bounds.

$d_{av,L} = d_{z,L} \times \sqrt{2(1-r_{12})}, \qquad d_{av,U} = d_{z,U} \times \sqrt{2(1-r_{12})}$

95% CI for $d_{rm}$ : DataStatPro computes this by bootstrapping when raw data are available, or via the delta method for summary statistics.

10. Power Analysis and Sample Size Planning

10.1 A Priori Power Analysis

A priori power analysis determines the required number of pairs before data collection to achieve desired power $1-\beta$ at significance level $\alpha$ for a hypothesised effect of size $d_z$ .

Required $n$ for two-tailed test:

The exact calculation uses the non-central t-distribution (numerical). An excellent approximation:

$n \approx \frac{(z_{1-\alpha/2} + z_{1-\beta})^2}{d_z^2} + \frac{z_{1-\alpha/2}^2}{2}$

For $\alpha = .05$ (two-tailed, $z_{.975} = 1.96$ ) and power $= 0.80$ ( $z_{.80} = 0.842$ ):

$n \approx \frac{(1.96 + 0.842)^2}{d_z^2} = \frac{7.849}{d_z^2}$

$d_z$	Power = 0.80 ( $n$ )	Power = 0.90 ( $n$ )	Power = 0.95 ( $n$ )	Power = 0.99 ( $n$ )
0.20	198	265	326	441
0.30	89	119	147	198
0.50	34	45	55	75
0.80	15	19	23	32
1.00	10	13	16	22
1.20	8	10	12	16
1.50	6	8	9	12

All values assume two-tailed $\alpha = .05$ . Add 1–2 pairs to account for rounding.

10.2 Sensitivity Analysis (Post-Hoc Power)

Sensitivity analysis determines the minimum effect size that could have been detected with the study's sample size at a specified power level. It answers: "What was the smallest effect this study was designed to detect?"

$d_{z,\;min} = \sqrt{\frac{(z_{1-\alpha/2} + z_{1-\beta})^2}{n}}$

For $n = 20$ pairs, $\alpha = .05$ , power $= 0.80$ :

$d_{z,\;min} = \sqrt{7.849/20} = \sqrt{0.392} = 0.626$

This study could reliably detect only effects of $d_z \geq 0.63$ — near Cohen's "large" threshold. Smaller effects may exist but would frequently be missed.

⚠️ Post-hoc power computed from the observed effect size (sometimes called "observed power") is circular, redundant with the p-value, and should NOT be reported as a justification for a non-significant result. Sensitivity analysis using the minimum detectable effect is the appropriate post-hoc power tool.

10.3 Planning Based on $d_{av}$ Instead of $d_z$

When planning based on an expected $d_{av}$ (e.g., from a published between-subjects study), first convert to $d_z$ using the anticipated within-pair correlation $r_{12}$ :

$d_z = \frac{d_{av}}{\sqrt{2(1-r_{12})}}$

Then apply the standard $n$ formula. If $r_{12}$ is unknown, use a conservative estimate of $r_{12} = 0.50$ :

$d_z = \frac{d_{av}}{\sqrt{2(1-0.50)}} = \frac{d_{av}}{\sqrt{1.0}} = d_{av}$

With $r_{12} = 0.50$ , $d_z = d_{av}$ , so the sample size formula is the same.

10.4 The Effect of Pre-Post Correlation on Required Sample Size

The required sample size for a paired design decreases as $r_{12}$ increases — reflecting the power advantage of pairing. Compared to an independent samples design with the same $d_{av}$ :

$r_{12}$	Factor $\sqrt{2(1-r_{12})}$	$d_z/d_{av}$	$n_{paired}/n_{independent}$
0.00	1.414	0.707	0.500
0.20	1.265	0.791	0.625
0.50	1.000	1.000	1.000
0.70	0.775	1.291	1.667
0.80	0.632	1.581	2.500
0.90	0.447	2.236	5.000

$n_{paired}/n_{independent}$ is the ratio of total observations needed (paired has $n$ pairs = $n$ total; independent has $2n$ total for same power on $d_{av}$ ).

💡 For $r_{12} = 0.80$ , the paired design requires only 40% as many participants as the independent design to achieve the same power. When within-pair correlations are high, pairing provides a dramatic efficiency gain.

11. Advanced Topics

11.1 The Paired t-Test as a One-Sample t-Test

The paired t-test is mathematically identical to a one-sample t-test applied to the difference scores. This has several practical implications:

Software implementation: Many software packages implement the paired t-test by computing difference scores and running a one-sample test.
Missing data: If some participants have data for only one condition, those pairs cannot contribute difference scores and are excluded entirely from the analysis.
Non-zero null: Testing $H_0: \mu_d = 5$ (e.g., "does the mean improvement exceed a clinically significant threshold of 5 points?") is as straightforward as testing $H_0: \mu_d = 0$ .

11.2 The Relationship Between Paired and Independent t-Tests

For the same dataset, the paired and independent t-statistics are related through the within-pair correlation $r_{12}$ :

$t_{paired} = \frac{\bar{d}}{s_d/\sqrt{n}} = \frac{\bar{x}_1-\bar{x}_2}{\sqrt{(s_1^2+s_2^2-2r_{12}s_1s_2)/n}}$

$t_{independent} = \frac{\bar{x}_1-\bar{x}_2}{s_{pooled}\sqrt{2/n}} = \frac{\bar{x}_1-\bar{x}_2}{\sqrt{(s_1^2+s_2^2)/n}}$

The ratio of the t-statistics:

$\frac{t_{paired}}{t_{independent}} = \frac{\sqrt{s_1^2+s_2^2}}{\sqrt{s_1^2+s_2^2-2r_{12}s_1s_2}} = \frac{1}{\sqrt{1-\frac{2r_{12}s_1s_2}{s_1^2+s_2^2}}}$

For equal SDs ( $s_1 = s_2 = s$ ):

$\frac{t_{paired}}{t_{independent}} = \frac{1}{\sqrt{1-r_{12}}}$

When $r_{12} = 0.75$ : $t_{paired}/t_{independent} = 1/\sqrt{0.25} = 2.0$ — the paired t-statistic is twice as large, corresponding to vastly higher power.

Also note the degrees of freedom differ: $\nu_{paired} = n-1$ vs. $\nu_{independent} = 2n-2$ . The paired test loses $n-1$ degrees of freedom by pairing, but gains far more through the reduced error term when $r_{12}$ is high.

11.3 Equivalence Testing with TOST

Standard paired t-testing can reject $H_0: \mu_d = 0$ but cannot establish that the mean difference is negligibly small. The Two One-Sided Tests (TOST) procedure tests whether the mean difference falls within a pre-specified equivalence interval $[-\Delta_L, \Delta_U]$ :

$H_{01}: \mu_d \leq -\Delta_L$ (the difference is meaningfully negative) $H_{02}: \mu_d \geq \Delta_U$ (the difference is meaningfully positive)

Equivalence is concluded when both one-sided tests reject their respective nulls at level $\alpha$ — equivalently, when the 90% CI (for $\alpha = .05$ ) for $\mu_d$ falls entirely within $(-\Delta_L, \Delta_U)$ .

The TOST t-statistics:

$t_1 = \frac{\bar{d} - (-\Delta_L)}{s_d/\sqrt{n}}, \qquad t_2 = \frac{\bar{d} - \Delta_U}{s_d/\sqrt{n}}$

Both must exceed $t_{\alpha,\;n-1}$ (one-tailed) for equivalence to be declared.

Choosing equivalence bounds: A common choice based on Cohen's $d_z$ is to set bounds corresponding to a "small" effect: $\Delta = d_{z,small} \times s_d = 0.20 \times s_d$ . In practice, bounds should be domain-specific and set before data collection.

💡 TOST for paired t-tests is critical for crossover drug bioequivalence studies (where "no difference" between formulations must be positively demonstrated), for measurement instrument validation (demonstrating that a new instrument agrees with a gold standard), and for null results that claim two conditions are equivalent.

11.4 Bayesian Paired t-Test

The Bayesian paired t-test (Rouder et al., 2009) quantifies evidence for and against the null hypothesis using the Bayes Factor $BF_{10}$ :

$BF_{10} = \frac{P(\text{data} \mid H_1: \mu_d \neq 0)}{P(\text{data} \mid H_0: \mu_d = 0)}$

Under the default JZS prior (Jeffrey-Zellner-Siow), the prior on $d_z$ under $H_1$ is a Cauchy distribution with scale $r = \sqrt{2}/2 \approx 0.707$ .

Interpreting Bayes Factors:

$BF_{10}$	Evidence for $H_1$
$> 100$	Extreme
$30 - 100$	Very strong
$10 - 30$	Strong
$3 - 10$	Moderate
$1 - 3$	Anecdotal
$1$	No evidence (equal support)
$1/3 - 1$	Anecdotal for $H_0$
$1/10 - 1/3$	Moderate for $H_0$
$< 1/10$	Strong or stronger for $H_0$

Advantages of Bayesian paired t-test:

Quantifies evidence for $H_0$ (null results can be informative, not just "inconclusive").
Valid for sequential testing (no correction needed for looking at data multiple times).
Provides a posterior distribution for $d_z$ .
Avoids the all-or-nothing dichotomy of significance testing.

Reporting: "A Bayesian paired t-test with the default Cauchy prior ( $r = \sqrt{2}/2$ ) provided [strong / moderate / anecdotal / no] evidence for the alternative hypothesis, $BF_{10} =$ [value]."

11.5 Robust Alternatives: Trimmed Mean Paired t-Test

Yuen's paired trimmed mean t-test (Yuen, 1974) uses $\alpha$ -trimmed means of the difference scores as the measure of central tendency. With 20% trimming:

$h = n - 2\lfloor 0.2n \rfloor$ (effective sample size after trimming)

$\bar{d}_{trim}$ = 20%-trimmed mean of $d_i$

$s_{w,d}^2$ = Winsorised variance of $d_i$

$t_{trim} = \frac{\bar{d}_{trim}}{s_{w,d}/\sqrt{h(h-1)}}$ , compared to $t_{h-1}$

The trimmed mean paired t-test is substantially more powerful than the Wilcoxon signed- rank test for symmetric heavy-tailed distributions, while maintaining good Type I error control under non-normality.

11.6 Handling Missing Data in Paired Designs

In the paired t-test, both observations must be present for a pair to contribute to the analysis. Options for handling missing data:

Approach	Description	When Appropriate
Complete case analysis	Use only pairs with both observations	MCAR assumption; small proportion missing
Multiple imputation	Impute missing values using predictive models	MAR assumption; principled approach
Maximum likelihood (MLM)	Use all available data via FIML	MAR assumption; preferred for repeated measures
Last observation carried forward (LOCF)	Replace missing post-value with last observation	Clinical trials; conservative assumption

⚠️ Listwise deletion (complete case analysis) is the default in most software but can introduce bias when data are not Missing Completely At Random (MCAR). For more than 5% missing data, multiple imputation or maximum likelihood estimation are strongly preferred.

11.7 Multi-Level Extensions of the Paired Design

The paired t-test assumes that pairs are sampled from a common population. When pairs themselves are nested within clusters (e.g., twin pairs from the same family, or pre-post measurements from patients in the same hospital), standard paired t-tests underestimate standard errors and produce inflated Type I error rates.

The appropriate extension is a two-level mixed model:

$d_{ij} = \gamma_{00} + u_{0j} + \varepsilon_{ij}$

Where $u_{0j} \sim \mathcal{N}(0, \tau^2)$ is the cluster-level random effect and $\varepsilon_{ij} \sim \mathcal{N}(0, \sigma^2)$ is the residual within clusters. The Intraclass Correlation Coefficient (ICC) = $\tau^2/(\tau^2 + \sigma^2)$ quantifies the degree of clustering.

11.8 Reporting the Paired t-Test According to APA 7th Edition

Minimum reporting requirements (APA 7th ed.):

Mean and SD for each condition: $M_1$ , $SD_1$ , $M_2$ , $SD_2$ .
Mean and SD of difference scores: $M_d$ , $SD_d$ .
t-statistic with df: $t(n-1) =$ [value].
Exact p-value (or $p < .001$ ).
Effect size with 95% CI: $d_z =$ [value] [95% CI: LB, UB].
95% CI for mean difference in original units.
Specification of which effect size variant was reported.
Whether the normality assumption was checked and met.

12. Worked Examples

Example 1: Pre-Post Mindfulness Intervention — PHQ-9 Depression Scores

A clinical psychologist evaluates whether an 8-week Mindfulness-Based Cognitive Therapy (MBCT) programme significantly reduces depression symptoms. PHQ-9 scores (0–27; higher = more depression) are recorded for $n = 15$ participants immediately before and after the programme.

Raw data:

Participant	Pre-MBCT ( $x_{1i}$ )	Post-MBCT ( $x_{2i}$ )	$d_i = x_{1i} - x_{2i}$
1	18	11	7
2	22	14	8
3	15	10	5
4	20	16	4
5	25	17	8
6	13	9	4
7	19	12	7
8	17	14	3
9	21	13	8
10	16	11	5
11	24	16	8
12	14	10	4
13	20	15	5
14	18	12	6
15	23	15	8

Step 1 — Normality check on differences:

Differences: 7, 8, 5, 4, 8, 4, 7, 3, 8, 5, 8, 4, 5, 6, 8

Shapiro-Wilk: $W = 0.913$ , $p = .154$ — normality not violated; proceed with paired t-test.

Step 2 — Descriptive statistics:

$\sum d_i = 7+8+5+4+8+4+7+3+8+5+8+4+5+6+8 = 90$

$\bar{d} = 90/15 = 6.000$

$\sum d_i^2 = 49+64+25+16+64+16+49+9+64+25+64+16+25+36+64 = 596$

$s_d = \sqrt{\frac{\sum d_i^2 - n\bar{d}^2}{n-1}} = \sqrt{\frac{596 - 15(36)}{14}} = \sqrt{\frac{596-540}{14}} = \sqrt{\frac{56}{14}} = \sqrt{4.000} = 2.000$

Condition means and SDs:

Pre-MBCT: $\bar{x}_1 = (18+22+15+20+25+13+19+17+21+16+24+14+20+18+23)/15 = 285/15 = 19.000$

$s_1 = \sqrt{\sum(x_{1i}-19)^2/14}$ : values $(1,9,16,1,36,36,0,4,4,9,25,25,1,1,16)$ , sum $= 184$ , $s_1 = \sqrt{184/14} = \sqrt{13.143} = 3.625$

Post-MBCT: $\bar{x}_2 = 19.000 - 6.000 = 13.000$

$s_2 = \sqrt{\sum(x_{2i}-13)^2/14}$ : values $(4,1,9,9,16,16,1,1,0,4,9,9,4,1,4)$ , sum $= 88$ , $s_2 = \sqrt{88/14} = \sqrt{6.286} = 2.507$

Within-pair correlation:

$r_{12} = \frac{s_1^2 + s_2^2 - s_d^2}{2s_1 s_2} = \frac{13.143 + 6.286 - 4.000}{2 \times 3.625 \times 2.507} = \frac{15.429}{18.176} = 0.849$

Step 3 — Standard error:

$SE_{\bar{d}} = s_d/\sqrt{n} = 2.000/\sqrt{15} = 2.000/3.873 = 0.5164$

Step 4 — t-statistic:

$t = \bar{d}/SE_{\bar{d}} = 6.000/0.5164 = 11.619$

Step 5 — Degrees of freedom and p-value:

$\nu = 15-1 = 14$

$p = 2 \times P(T_{14} \geq 11.619) < .001$

Step 6 — 95% CI for mean difference:

$t_{.025,\;14} = 2.145$

$6.000 \pm 2.145 \times 0.5164 = 6.000 \pm 1.108 = [4.892, 7.108]$

Step 7 — Effect sizes:

Cohen's $d_z$ :

$d_z = \bar{d}/s_d = 6.000/2.000 = 3.000$

$d_z = t/\sqrt{n} = 11.619/\sqrt{15} = 11.619/3.873 = 3.000$ ✅

Hedges' $g_z$ :

$J = 1 - 3/(4 \times 14 - 1) = 1 - 3/55 = 1 - 0.0545 = 0.9455$

$g_z = 3.000 \times 0.9455 = 2.836$

Cohen's $d_{av}$ :

$s_{av} = (3.625 + 2.507)/2 = 3.066$

$d_{av} = 6.000/3.066 = 1.957$

Cohen's $d_{rm}$ :

$d_{rm} = d_{av} \times \sqrt{2(1-r_{12})} = 1.957 \times \sqrt{2(1-0.849)} = 1.957 \times \sqrt{0.302} = 1.957 \times 0.550 = 1.076$

Common Language Effect Size:

$CL = \Phi(3.000/\sqrt{2}) = \Phi(2.121) = 0.983$

Step 8 — 95% CI for $d_z$ (approximate):

$SE_{d_z} = \sqrt{1/15 + 3.000^2/(2 \times 14)} = \sqrt{0.0667 + 0.3214} = \sqrt{0.3881} = 0.623$

$95\%\text{ CI}: 3.000 \pm 1.96(0.623) = [1.779, 4.221]$

Summary table:

Statistic	Value	Interpretation
Pre-MBCT mean	$19.00$ PHQ-9 pts	Moderate-severe depression
Post-MBCT mean	$13.00$ PHQ-9 pts	Mild depression
Mean difference ( $\bar{d}$ )	$6.00$ pts	6-point reduction
$SD_d$	$2.00$ pts	Low variability in change
$r_{12}$ (pre-post)	$0.849$	High pre-post correlation
$t(14)$	$11.619$
$p$ (two-tailed)	$< .001$	Highly significant
95% CI for $\mu_d$	$[4.89, 7.11]$	Excludes 0
Cohen's $d_z$	$3.000$	Huge effect
Hedges' $g_z$	$2.836$	Huge (bias-corrected)
Cohen's $d_{av}$	$1.957$	Huge
Cohen's $d_{rm}$	$1.076$	Large-very large
95% CI for $d_z$	$[1.78, 4.22]$
CL	$98.3\%$

APA write-up: "A paired samples t-test was conducted to evaluate whether PHQ-9 depression scores changed from pre- to post-MBCT. Difference scores were normally distributed as assessed by Shapiro-Wilk ( $W = 0.91$ , $p = .154$ ). MBCT produced a statistically significant reduction in depression scores ( $M_{pre} = 19.00$ , $SD_{pre} = 3.63$ ; $M_{post} = 13.00$ , $SD_{post} = 2.51$ ), $t(14) = 11.62$ , $p < .001$ , $d_z = 3.00$ [95% CI: 1.78, 4.22], $d_{av} = 1.96$ . The mean reduction of 6.00 PHQ-9 points [95% CI: 4.89, 7.11] represents a clinically large and statistically robust treatment effect. 98.3% of participants showed greater improvement than would be expected by chance."

Example 2: Crossover Drug Trial — Pain Reduction

A pharmacologist conducts a double-blind crossover study comparing Drug A vs. Drug B on pain ratings (0–100 VAS; lower = less pain) in $n = 12$ participants with chronic back pain. Each participant receives both drugs in randomised order with a 2-week washout between. Difference = Drug A − Drug B (positive = Drug A produces more pain).

Raw data:

Participant	Drug A ( $x_{1i}$ )	Drug B ( $x_{2i}$ )	$d_i$
1	45	38	7
2	62	55	7
3	33	41	−8
4	58	49	9
5	70	63	7
6	41	47	−6
7	55	50	5
8	48	55	−7
9	64	58	6
10	52	46	6
11	39	44	−5
12	60	56	4

Step 1 — Normality check:

Differences: 7, 7, −8, 9, 7, −6, 5, −7, 6, 6, −5, 4

Shapiro-Wilk: $W = 0.964$ , $p = .836$ — normality holds.

Step 2 — Descriptive statistics for differences:

$\sum d_i = 7+7-8+9+7-6+5-7+6+6-5+4 = 25$

$\bar{d} = 25/12 = 2.083$

$\sum d_i^2 = 49+49+64+81+49+36+25+49+36+36+25+16 = 515$

$s_d = \sqrt{(515 - 12 \times 2.083^2)/11} = \sqrt{(515 - 52.08)/11} = \sqrt{462.92/11} = \sqrt{42.084} = 6.487$

Step 3 — Standard error:

$SE_{\bar{d}} = 6.487/\sqrt{12} = 6.487/3.464 = 1.873$

Step 4 — t-statistic:

$t = 2.083/1.873 = 1.112$

Step 5 — Degrees of freedom and p-value:

$\nu = 12-1 = 11$

$p = 2 \times P(T_{11} \geq 1.112) = 2 \times 0.146 = .292$

Step 6 — 95% CI:

$t_{.025,\;11} = 2.201$

$2.083 \pm 2.201 \times 1.873 = 2.083 \pm 4.121 = [-2.038, 6.204]$

Step 7 — Effect sizes:

$d_z = \bar{d}/s_d = 2.083/6.487 = 0.321$

$g_z = 0.321 \times (1 - 3/(4 \times 11-1)) = 0.321 \times (1 - 3/43) = 0.321 \times 0.930 = 0.299$

95% CI for $d_z$ (approximate):

$SE_{d_z} = \sqrt{1/12 + 0.321^2/(2 \times 11)} = \sqrt{0.0833 + 0.00468} = \sqrt{0.0880} = 0.297$

$95\%\text{ CI}: 0.321 \pm 1.96(0.297) = [-0.261, 0.903]$

The CI spans from negative (Drug B better) to positive (Drug A better), confirming non-significance.

Step 8 — Equivalence test (TOST)

The pharmacologist wishes to establish whether the drugs are equivalent within $\pm 10$ VAS points ( $\Delta = 10$ ).

$t_1 = (2.083 - (-10))/1.873 = 12.083/1.873 = 6.449$

$t_2 = (2.083 - 10)/1.873 = -7.917/1.873 = -4.227$

Both $t_1 = 6.449$ and $|t_2| = 4.227$ exceed $t_{.05,\;11} = 1.796$ (one-tailed).

The 90% CI for $\mu_d$ :

$2.083 \pm 1.796 \times 1.873 = 2.083 \pm 3.364 = [-1.281, 5.447]$

Since the 90% CI $[-1.28, 5.45]$ falls entirely within $[-10, +10]$ , equivalence is established at $\alpha = .05$ .

Summary:

Statistic	Value	Interpretation
Mean diff (A − B)	$2.08$ pts	Drug A produces slightly higher pain
$SD_d$	$6.49$ pts	High variability in within-person differences
$t(11)$	$1.112$
$p$ (two-tailed)	$.292$	Not significant
95% CI for $\mu_d$	$[-2.04, 6.20]$	Includes 0
Cohen's $d_z$	$0.321$ (Small)
Hedges' $g_z$	$0.299$
95% CI for $d_z$	$[-0.26, 0.90]$
TOST result	Equivalent	90% CI within $\pm$ 10 pts

APA write-up: "A paired samples t-test examined whether Drug A and Drug B differed in pain relief in a crossover design ( $n = 12$ ). Difference scores were normally distributed ( $W = 0.964$ , $p = .836$ ). The mean pain rating was not significantly different for Drug A ( $M = 52.25$ , $SD = 11.43$ ) vs. Drug B ( $M = 50.17$ , $SD = 8.42$ ), $t(11) = 1.11$ , $p = .292$ , $d_z = 0.32$ [95% CI: $-$ 0.26, 0.90]. The 95% CI for the mean difference was [ $-$ 2.04, 6.20] VAS points. A TOST equivalence test with bounds of $\pm$ 10 VAS points demonstrated that the drugs are equivalent in pain relief, with the 90% CI [ $-$ 1.28, 5.45] falling entirely within the equivalence interval."

Example 3: Reaction Time — Noise vs. Silence Condition

A cognitive psychologist tests whether background noise affects simple reaction time (ms) in $n = 25$ university students. Each participant completes both a silent and a noise condition (order counterbalanced). $d_i = RT_{noise} - RT_{silence}$ (positive = noise increases RT).

Summary statistics (raw data not shown):

$n = 25, \quad \bar{x}_{noise} = 312.4\text{ ms}, \quad \bar{x}_{silence} = 298.7\text{ ms}$

$s_{noise} = 41.3\text{ ms}, \quad s_{silence} = 38.6\text{ ms}, \quad r_{12} = 0.82$

Step 1 — Compute $s_d$ :

$s_d = \sqrt{s_1^2 + s_2^2 - 2r_{12}s_1s_2} = \sqrt{41.3^2 + 38.6^2 - 2(0.82)(41.3)(38.6)}$

$= \sqrt{1705.69 + 1489.96 - 2617.40} = \sqrt{578.25} = 24.05\text{ ms}$

Step 2 — Mean difference:

$\bar{d} = 312.4 - 298.7 = 13.7\text{ ms}$

Step 3 — Standard error:

$SE_{\bar{d}} = 24.05/\sqrt{25} = 24.05/5 = 4.81\text{ ms}$

Step 4 — t-statistic:

$t = 13.7/4.81 = 2.849$

Step 5 — df and p-value:

$\nu = 24$ ; $p = 2 \times P(T_{24} \geq 2.849) = 2 \times .0045 = .009$

Step 6 — 95% CI:

$t_{.025,\;24} = 2.064$

$13.7 \pm 2.064 \times 4.81 = 13.7 \pm 9.93 = [3.77, 23.63]\text{ ms}$

Step 7 — Effect sizes:

$d_z = 13.7/24.05 = 0.570 \quad (\text{or } t/\sqrt{n} = 2.849/5 = 0.570$ )

$g_z = 0.570 \times (1-3/(4\times24-1)) = 0.570 \times 0.969 = 0.552$

$s_{av} = (41.3+38.6)/2 = 39.95\text{ ms}$

$d_{av} = 13.7/39.95 = 0.343$

$d_{rm} = 0.343 \times \sqrt{2(1-0.82)} = 0.343 \times \sqrt{0.36} = 0.343 \times 0.600 = 0.206$

$CL = \Phi(0.570/\sqrt{2}) = \Phi(0.403) = 0.657$

Contrast: What if this had been run as (incorrect) independent test?

$s_{pooled} = \sqrt{(24 \times 41.3^2 + 24 \times 38.6^2)/48} = \sqrt{(40936.56+35757.84)/48} = \sqrt{1597.80} = 39.97$

$t_{ind} = 13.7/(39.97\sqrt{2/25}) = 13.7/11.318 = 1.211, \quad p = .230$

The incorrect independent t-test fails to detect the effect ( $p = .230$ ) while the paired test clearly identifies it ( $p = .009$ ). This illustrates the dramatic power advantage of the paired design when $r_{12} = 0.82$ .

Summary:

Statistic	Value
Mean RT: Noise	$312.4$ ms
Mean RT: Silence	$298.7$ ms
Mean difference $\bar{d}$	$13.7$ ms (noise slower)
$s_d$	$24.05$ ms
$r_{12}$	$0.820$ (high pairing efficiency)
$t(24)$ (paired)	$2.849$
$p$ (paired, two-tailed)	$.009$
$t$ (if independent, incorrect)	$1.211$
$p$ (if independent, incorrect)	$.230$
95% CI for $\mu_d$	$[3.77, 23.63]$ ms
Cohen's $d_z$	$0.570$ (Medium)
Cohen's $d_{av}$	$0.343$ (Small-Medium)
Cohen's $d_{rm}$	$0.206$ (Small)
CL	$65.7\%$

APA write-up: "A paired samples t-test examined whether background noise affected reaction time. Participants were significantly slower in the noise condition ( $M = 312.4$ ms, $SD = 41.3$ ms) than in the silence condition ( $M = 298.7$ ms, $SD = 38.6$ ms), $t(24) = 2.85$ , $p = .009$ , $d_z = 0.57$ [95% CI: 0.14, 0.99], $d_{av} = 0.34$ . The mean slowing of 13.7 ms [95% CI: 3.77, 23.63 ms] represents a medium within-subjects effect. The high pre-post correlation ( $r = 0.82$ ) confirms the efficiency of the paired design."

Example 4: Small Sample with Non-Significant Result — Teaching Method

An education researcher compares mathematics test scores ( $n = 10$ students) before and after a new tutoring method. Scores range 0–100.

Data:

Student	Before ( $x_{1i}$ )	After ( $x_{2i}$ )	$d_i$
1	62	67	−5
2	78	82	−4
3	55	59	−4
4	71	78	−7
5	83	84	−1
6	67	73	−6
7	59	61	−2
8	74	76	−2
9	88	91	−3
10	61	66	−5

Note: $d_i = Before - After$ , so negative $d_i$ indicates improvement.

$\bar{d} = (-5-4-4-7-1-6-2-2-3-5)/10 = -39/10 = -3.900$

$\sum d_i^2 = 25+16+16+49+1+36+4+4+9+25 = 185$

$s_d = \sqrt{(185-10\times3.900^2)/9} = \sqrt{(185-152.10)/9} = \sqrt{32.9/9} = \sqrt{3.656} = 1.912$

$SE = 1.912/\sqrt{10} = 0.605$

$t = -3.900/0.605 = -6.446$ ; $\nu = 9$ ; $p < .001$

95% CI: $-3.900 \pm 2.262 \times 0.605 = [-5.268, -2.532]$

$d_z = 3.900/1.912 = 2.039$ (Huge); $g_z = 2.039 \times (1-3/35) = 2.039 \times 0.914 = 1.864$

Despite the small sample, the effect is large and the result is highly significant because individual differences in change scores are small relative to the mean improvement.

Note on interpreting CIs with small samples:

$SE_{d_z} = \sqrt{1/10 + 2.039^2/18} = \sqrt{0.100+0.231} = \sqrt{0.331} = 0.575$

$95\%\text{ CI for } d_z: 2.039 \pm 1.96(0.575) = [0.912, 3.166]$

Wide CI but entirely positive — effect is definitively large even at the lower bound.

APA write-up: "A paired samples t-test showed that student mathematics scores improved significantly after tutoring ( $M_{before} = 69.8$ , $SD = 10.8$ ; $M_{after} = 73.7$ , $SD = 10.8$ ), $t(9) = 6.45$ , $p < .001$ , $d_z = 2.04$ [95% CI: 0.91, 3.17]. The mean improvement of 3.9 points [95% CI: 2.53, 5.27] represents a very large within-subjects effect, indicating the tutoring method produced consistent, substantial gains across students."

13. Common Mistakes and How to Avoid Them

Mistake 1: Using the Independent Samples t-Test for Paired Data

Problem: Treating pre-post measurements (or matched pairs) as independent groups and running an independent samples t-test. This ignores the within-pair correlation $r_{12}$ , inflates the error term with between-person variability, severely reduces power, and may produce a non-significant result for a large, real effect.

How serious: In Example 3 above, the paired test correctly detected an effect at $p = .009$ , while the incorrect independent test gave $p = .230$ . When $r_{12} = 0.82$ , the independent test has less than 50% of the power of the paired test.

Solution: Before running any test, determine the study design: if each participant contributes two scores, the paired t-test is required. Check the data file structure — paired data should have one row per participant (or matched pair), not one row per observation.

Mistake 2: Reporting $d_z$ Without Acknowledging It Is Not Comparable to Between-Subjects $d$

Problem: Reporting $d_z = 1.50$ from a paired design and implying this is comparable to $d = 1.50$ from a between-subjects study. Because $s_d < s_{pooled}$ when $r_{12} > 0$ , $d_z$ will be systematically larger than $d_{av}$ or $d$ from an independent design for the same mean difference.

How serious: For $r_{12} = 0.80$ , $d_z$ is $1.58 \times d_{av}$ — reporting $d_z = 1.50$ when the comparable between-subjects $d$ would be $\approx 0.95$ could grossly inflate perceived effect sizes in a research domain.

Solution: Always specify the effect size variant. Report both $d_z$ and $d_{av}$ (or $d_{rm}$ ). When comparing to between-subjects studies, use $d_{av}$ or $d_{rm}$ .

Mistake 3: Not Checking Normality of the Difference Scores

Problem: Applying the paired t-test without checking whether the difference scores are approximately normally distributed. This is especially risky with small samples ( $n < 30$ ), where the CLT does not yet provide adequate protection, and the t-test's p-values may be inaccurate under skewed or heavy-tailed difference distributions.

Solution: Always run the Shapiro-Wilk test on the difference scores (not on the raw scores) and inspect the Q-Q plot of differences. If normality is violated and $n < 30$ , use the Wilcoxon Signed-Rank test.

Mistake 4: Running Separate t-Tests on Each Condition Instead of a Paired Test

Problem: Testing whether Condition 1 mean differs from zero, then testing whether Condition 2 mean differs from zero, and comparing the significance of the two tests. This approach is fundamentally flawed — a condition can be significantly different from zero in both tests but not significantly different from each other, or vice versa.

Solution: The appropriate question is whether the mean difference between conditions is significant. Use the paired t-test, which directly tests $H_0: \mu_d = 0$ .

Mistake 5: Failing to Report the 95% CI for the Mean Difference in Original Units

Problem: Reporting only $t$ , $p$ , and $d_z$ without reporting the 95% CI for $\mu_d$ in the original measurement units. The CI in original units is the most practically interpretable result — it tells readers how large the difference is in terms they can evaluate against a minimum important clinical or practical difference.

Solution: Always report the 95% CI for $\bar{d}$ in original units, alongside the CI for the effect size $d_z$ . For clinical or applied research, also discuss whether the CI for the mean difference exceeds a minimally important clinical difference (MICD).

Mistake 6: Treating a Non-Significant Result as Evidence of No Change

Problem: Reporting $t(n) = 1.20$ , $p = .25$ and concluding "the intervention had no effect." A non-significant result only means the data are insufficient to reject $H_0$ under the test's sensitivity — it does NOT establish that the true effect is zero. With small $n$ , even large effects fail to reach significance.

Solution: Report the 95% CI for the mean difference. If the CI is wide and includes both clinically trivial and clinically meaningful differences, explicitly acknowledge the study's limited power rather than claiming no effect. Use equivalence testing (TOST) with pre-specified bounds if the research goal is to demonstrate absence of a meaningful effect.

Mistake 7: Applying a One-Tailed Test After Observing the Data Direction

Problem: Observing that $\bar{d}$ is positive, then switching to an upper one-tailed test to achieve $p = .03$ when the two-tailed result was $p = .06$ . This is p-hacking and doubles the effective Type I error rate.

Solution: Directional hypotheses must be pre-registered before data collection. Document the hypothesis direction in a pre-registration (e.g., on the OSF) before seeing any data. In the absence of a pre-registered directional prediction, use two-tailed tests.

Mistake 8: Using the Same Participants Twice Without Pairing

Problem: Collecting data from 30 participants under two conditions but entering all 60 observations as an independent-groups design. This creates pseudo-replication, violates independence, and severely inflates Type I error rates because the 60 observations are not all independent.

Solution: Understand the design. If each participant provided data under both conditions, the observations are paired and the within-subjects structure must be accounted for in the analysis (paired t-test, or repeated measures ANOVA for $K > 2$ ).

Mistake 9: Ignoring Carryover Effects in Crossover Designs

Problem: In crossover designs, the effect of the first condition may carry over and influence responses in the second condition. Failing to account for order effects can bias the estimate of the mean difference, making the paired comparison misleading.

Solution: Use proper washout periods between conditions. Test for order effects by including condition order as a factor. If order effects are significant, report this and consider using only the first-condition data or modelling the order effect explicitly.

Mistake 10: Not Specifying $\delta_0$ When Testing Non-Zero Nulls

Problem: Testing whether a treatment effect exceeds a clinically meaningful threshold (e.g., a 5-point improvement on a 100-point scale) using the default $H_0: \mu_d = 0$ instead of $H_0: \mu_d = 5$ . The default test does not answer the right question.

Solution: Set the null hypothesis value $\delta_0$ to the minimum clinically important difference (MCID) before running the test. In DataStatPro, enter this value in the "Null Hypothesis Value" field.

14. Troubleshooting

Problem	Likely Cause	Solution
$t$ is extremely large ( $\vert t \vert > 10$ )	Very small $s_d$ (participants all changed by nearly the same amount) or data entry error	Check data; if genuine, report with interpretation — very consistent within-person change
$\vert s_d \vert = 0$	All difference scores are identical	Verify data; if genuine, the test is degenerate — all participants changed by exactly the same amount
$p = 1.000$ exactly	$\bar{d} = 0$ exactly	All differences cancel out; report $d_z = 0$ , interpret as no mean change
Shapiro-Wilk significant on large sample ( $n > 50$ )	High power of normality test; minor deviations detected	With $n \geq 30$ , CLT provides protection; inspect Q-Q plot for severity; t-test likely valid
$d_z \gg d_{av}$	High within-pair correlation ( $r_{12}$ large)	Both are correct; $d_z$ reflects paired design efficiency; $d_{av}$ more comparable to between-subjects $d$
Paired t significant but Wilcoxon signed-rank not significant (or vice versa)	Distributional issues or tied difference scores	Check normality; if differences are non-normal, trust Wilcoxon; report both with rationale
95% CI for $d_z$ is very wide	Small $n$ ( $n < 15$ )	Report wide CI — it is the truthful reflection of low precision; use exact (non-central t) CI from DataStatPro
Equivalence test fails despite small $d_z$	Equivalence bounds are too tight for available $n$	Increase $n$ for replication; widen bounds with theoretical justification or accept insufficient precision
$r_{12}$ is negative	Rare; could arise from counterbalancing with contrast effects	Verify measurement; pairing reduces power when $r_{12} < 0$ — consider independent test
$d_{rm} > d_{av}$	$r_{12} < 0.50$ ; correction factor $\sqrt{2(1-r_{12})} > 1$	Both values correct; report both; specify which is primary
Bayes Factor $BF_{10} \approx 1$	Insensitive data; study underpowered	Collect more data; report $BF_{10}$ as reflecting insensitivity rather than evidence for either hypothesis
TOST bounds are difficult to specify	Lack of prior knowledge about MCID	Consult domain literature; use $d_z = 0.20$ as a generic "trivially small" effect bound; pre-register choice
Dataset has missing values for some pairs	Incomplete data collection; attrition	Use complete-case analysis if MCAR; use multiple imputation or MLM if MAR; document clearly
Two conditions have very different SDs ( $s_1/s_2 > 2$ )	Treatment changes variability	Note heteroscedasticity; consider Glass's $\Delta$ (baseline SD) rather than $d_{av}$ ; Wilcoxon is robust

15. Quick Reference Cheat Sheet

Core Formulas

Formula	Description
$d_i = x_{1i} - x_{2i}$	Difference score for pair $i$
$\bar{d} = \frac{1}{n}\sum d_i = \bar{x}_1 - \bar{x}_2$	Mean difference
$s_d = \sqrt{\frac{\sum(d_i-\bar{d})^2}{n-1}}$	SD of differences
$SE_{\bar{d}} = s_d/\sqrt{n}$	Standard error of mean difference
$t = (\bar{d} - \delta_0)/(s_d/\sqrt{n})$	Paired t-statistic
$\nu = n-1$	Degrees of freedom
$p = 2 \times P(T_{n-1} \geq	t
$\bar{d} \pm t_{\alpha/2,\;n-1} \times s_d/\sqrt{n}$	95% CI for mean difference
$s_d^2 = s_1^2 + s_2^2 - 2r_{12}s_1s_2$	$s_d$ from raw score statistics
$r_{12} = (s_1^2+s_2^2-s_d^2)/(2s_1s_2)$	Pre-post correlation from summary stats

Effect Size Formulas

Formula	Description
$d_z = \bar{d}/s_d = t/\sqrt{n}$	Cohen's $d_z$ (most common for paired)
$g_z = d_z \times (1-3/(4n-5))$	Hedges' $g_z$ (bias-corrected)
$d_{av} = \bar{d}/s_{av},\quad s_{av}=(s_1+s_2)/2$	Cohen's $d_{av}$ (comparable to between)
$d_{rm} = d_{av}\sqrt{2(1-r_{12})}$	Corrected $d$ (most cross-design comparable)
$\Delta = \bar{d}/s_1$	Glass's $\Delta$ (baseline standardiser)
$d_z = d_{av}/\sqrt{2(1-r_{12})}$	Converting $d_{av}$ to $d_z$
$d_{av} = d_z\sqrt{2(1-r_{12})}$	Converting $d_z$ to $d_{av}$
$CL = \Phi(d_z/\sqrt{2})$	Common Language Effect Size
$SE_{d_z} \approx \sqrt{1/n + d_z^2/(2(n-1))}$	Approximate SE for CI of $d_z$
$\lambda = d_z\sqrt{n}$	Non-centrality parameter for power

TOST Equivalence Test Formulas

Formula	Description
$t_1 = (\bar{d}+\Delta_L)/(s_d/\sqrt{n})$	Lower TOST t-statistic
$t_2 = (\bar{d}-\Delta_U)/(s_d/\sqrt{n})$	Upper TOST t-statistic
90% CI within $(-\Delta_L, \Delta_U)$	Equivalence decision criterion

Effect Size Variant Comparison

Variant	Denominator	Comparable To	When to Use
$d_z$	$s_d$	Other paired designs only	Within-study; paired vs. paired
$g_z$	$s_d$ (corrected)	Other paired designs	Small samples ( $n < 20$ )
$d_{av}$	$(s_1+s_2)/2$	Between-subjects $d$	Cross-design comparison
$d_{rm}$	$(s_1+s_2)/2$ corrected	Most generalised	Meta-analysis; cross-design
Glass's $\Delta$	$s_1$ (pre-test)	Between-subjects from baseline	Pre-post change from baseline

Cohen's Benchmarks for $d_z$

$\vert d_z \vert$	Label	$CL$ (%)	$U_3$ (%)
$< 0.10$	Tiny	$< 52.8\%$	$< 54.0\%$
$0.10 - 0.19$	Very small	$52.8 - 55.3\%$	$54.0 - 57.5\%$
$0.20 - 0.49$	Small	$55.6 - 63.4\%$	$57.9 - 68.8\%$
$0.50 - 0.79$	Medium	$63.8 - 71.1\%$	$69.1 - 78.5\%$
$0.80 - 1.19$	Large	$71.4 - 80.0\%$	$78.8 - 88.3\%$
$1.20 - 1.99$	Very large	$80.2 - 92.0\%$	$88.5 - 97.7\%$
$\geq 2.00$	Huge	$\geq 92.1\%$	$\geq 97.7\%$

Required Sample Size (Pairs) — Two-Tailed $\alpha = .05$

$d_z$	Power = 0.70	Power = 0.80	Power = 0.90	Power = 0.95
0.20	185	264	354	434
0.30	83	119	160	196
0.40	47	67	90	111
0.50	31	44	59	73
0.60	22	31	42	52
0.80	13	18	24	30
1.00	9	13	17	21
1.20	7	9	13	15
1.50	5	7	9	11
2.00	4	5	6	8

APA 7th Edition Reporting Templates

Full APA report (raw data available):

"A paired samples t-test was conducted to examine whether [DV] differed between [Condition 1] and [Condition 2]. Difference scores were [normally / not normally] distributed as assessed by Shapiro-Wilk ( $W =$ [value], $p =$ [value]). [Condition 1] ( $M =$ , $SD =$ ) [was / was not] significantly [higher / lower] than [Condition 2] ( $M =$ , $SD =$ ), $t(n-1) =$ [value], $p =$ [value], $d_z =$ [value] [95% CI: LB, UB]. The mean difference of [value] [units] [95% CI: LB, UB] represents a [small / medium / large] within-subjects effect."

Compact format (for results section):

$t(n-1) =$ [value], $p =$ [value], $d_z =$ [value] [95% CI: LB, UB], $M_d =$ [value] [units] [95% CI: LB, UB].

Non-significant result with equivalence:

"The mean difference was not statistically significant, $t(n-1) =$ [value], $p =$ [value], $d_z =$ [value] [95% CI: LB, UB]. A TOST equivalence test with bounds of $\pm\Delta$ [units] [demonstrated / failed to demonstrate] equivalence at $\alpha = .05$ , with the 90% CI [LB, UB] falling [entirely within / outside] the equivalence interval."

Bayesian paired t-test:

"A Bayesian paired t-test with the default Cauchy prior ( $r = \sqrt{2}/2$ ) provided [extreme / very strong / strong / moderate / anecdotal / no] evidence for [H $_1$ : $\mu_d \neq 0$ / H $_0$ : $\mu_d = 0$ ], $BF_{10} =$ [value]."

Test Decision Flowchart

Two related conditions, continuous DV?
├── YES
│   └── Are difference scores approximately normally distributed?
│       (Check: Shapiro-Wilk on d_i; Q-Q plot of d_i)
│       ├── YES (or n ≥ 30)
│       │   └── Paired t-test ✅
│       │       ├── Significant F: Report t, p, CI, d_z, d_av
│       │       ├── Non-significant: Report CI, sensitivity analysis
│       │       └── Claiming equivalence: Add TOST
│       └── NO (and n < 30)
│           └── Wilcoxon Signed-Rank Test ✅
│               └── Report W, z, p, r_rb
└── NO
    ├── Ordinal DV → Wilcoxon Signed-Rank Test
    └── 3+ conditions → Repeated Measures ANOVA

Assumption Checks Reference

Assumption	Check	Tool	Action if Violated
Normality of $d_i$	Shapiro-Wilk on differences	`shapiro.test(d)` in R	Wilcoxon signed-rank; transform
Independence of pairs	Design review	Study protocol	Multilevel model if clustered
Correct pairing	ID matching	Inspect data file	Re-match; verify data entry
Interval scale	Measurement theory	Conceptual check	Wilcoxon signed-rank
No influential outliers	Boxplot, $z_i > 3$ of $d_i$	`boxplot(d)`	Investigate; robust t-test

Paired t-Test Reporting Checklist

Item	Required
Mean and SD for each condition	✅ Always
Mean and SD of difference scores	✅ Always
t-statistic with $\nu = n-1$	✅ Always
Exact p-value (or $p < .001$ )	✅ Always
95% CI for mean difference (original units)	✅ Always
Cohen's $d_z$ with 95% CI	✅ Always
Which $d$ variant reported ( $d_z$ , $d_{av}$ , etc.)	✅ Always
Sample size $n$ (number of pairs)	✅ Always
Shapiro-Wilk result on differences	✅ When $n < 50$
Hedges' $g_z$ instead of $d_z$	✅ When $n < 20$
$d_{av}$ or $d_{rm}$ alongside $d_z$	✅ When comparing to between-subjects
$r_{12}$ (pre-post/within-pair correlation)	✅ Recommended
TOST result if claiming null	✅ When claiming no meaningful difference
Bayes Factor	✅ For ambiguous or null results
Power or sensitivity analysis	✅ For null or inconclusive results
Direction of effect stated	✅ Always
Domain-specific benchmark context	✅ Recommended

Conversion Formulas: Paired $\leftrightarrow$ Other Metrics

From	To	Formula
$t$ , $n$	$d_z$	$d_z = t/\sqrt{n}$
$d_z$ , $n$	$t$	$t = d_z\sqrt{n}$
$d_z$ , $r_{12}$ , $s_1$ , $s_2$	$d_{av}$	$d_{av} = d_z\sqrt{2(1-r_{12})}$
$d_{av}$ , $r_{12}$	$d_z$	$d_z = d_{av}/\sqrt{2(1-r_{12})}$
$d_{av}$ , $r_{12}$	$d_{rm}$	$d_{rm} = d_{av}\sqrt{2(1-r_{12})}$
$d_z$ , $n$	$r$ (point-biserial)	$r = d_z/\sqrt{d_z^2+1}$ (approx)
$d_z$	$CL$	$CL = \Phi(d_z/\sqrt{2})$
$d_z$	$U_3$	$U_3 = \Phi(d_z)$
$d_z$ , $r_{12}$	$d_{between}$ (comparable)	Use $d_{rm}$

This tutorial provides a comprehensive foundation for understanding, conducting, and reporting the Paired t-Test within the DataStatPro application. For further reading, consult Field's "Discovering Statistics Using IBM SPSS Statistics" (5th ed., 2018) for an applied introduction; Lakens's "Calculating and Reporting Effect Sizes to Facilitate Cumulative Science" (Frontiers in Psychology, 2013) for the critical discussion of $d_z$ vs. $d_{av}$ and $d_{rm}$ ; Morris & DeShon's "Combining Effect Size Estimates in Meta-Analysis With Repeated Measures and Independent-Groups Designs" (Psychological Methods, 2002) for the $d_{rm}$ formula; Rouder et al.'s "Bayesian t-Tests for Accepting and Rejecting the Null Hypothesis" (Psychonomic Bulletin & Review, 2009) for the Bayesian approach; and Lakens's "Equivalence Tests: A Practical Primer for t-Tests, Correlations, and Meta-Analyses" (Social Psychological and Personality Science, 2017) for TOST equivalence testing. For feature requests or support, contact the DataStatPro team.

Participant	Pre-MBCT ( $x_{1i}$ )	Post-MBCT ( $x_{2i}$ )	$d_i = x_{1i} - x_{2i}$
1	18	11	7
2	22	14	8
3	15	10	5
4	20	16	4
5	25	17	8
6	13	9	4
7	19	12	7
8	17	14	3
9	21	13	8
10	16	11	5
11	24	16	8
12	14	10	4
13	20	15	5
14	18	12	6
15	23	15	8

Participant	Pre-MBCT ( $x_{1i}$ )	Post-MBCT ( $x_{2i}$ )	$d_i = x_{1i} - x_{2i}$
1	18	11	7
2	22	14	8
3	15	10	5
4	20	16	4
5	25	17	8
6	13	9	4
7	19	12	7
8	17	14	3
9	21	13	8
10	16	11	5
11	24	16	8
12	14	10	4
13	20	15	5
14	18	12	6
15	23	15	8

Paired Samples t-Test

Paired t-Test: Zero to Hero Tutorial

Table of Contents

1. Prerequisites and Background Concepts

1.1 Populations, Parameters, and Paired Designs

1.2 Why Pairing Matters: Removing Between-Person Variability

1.3 The Sampling Distribution of the Mean Difference

1.4 The t-Distribution and Degrees of Freedom

1.5 The Null and Alternative Hypotheses

1.6 Statistical Significance vs. Practical Significance

1.7 Confidence Intervals and Their Relationship to the Test

1.8 Type I and Type II Errors

2. What is a Paired t-Test?

2.1 The Core Idea

2.2 When to Use a Paired t-Test

2.3 What Makes Observations "Paired"?

2.4 The Paired t-Test vs. Related Procedures

2.5 The Power Advantage of Pairing

3. The Mathematics Behind the Paired t-Test

3.1 The Difference Score Reduction

3.2 The t-Statistic

3.3 Standard Error of the Mean Difference

3.4 The p-value

3.5 Relationship Between sds_dsd​ and the Raw Score Statistics

3.6 The Mean Difference and Its Relationship to Raw Means

3.7 Computing the t-Statistic from Summary Statistics

3.8 Non-Central t-Distribution and Exact CIs for Effect Sizes

3.9 Statistical Power of the Paired t-Test

4. Assumptions of the Paired t-Test

4.1 Normality of Difference Scores

4.2 Independence of Pairs

4.3 Correct Pairing

4.4 Interval Scale of Measurement

4.5 Absence of Influential Outliers in Difference Scores

4.6 Assumption Summary Table

5. Variants of the Paired t-Test

5.1 Overview of Effect Size Variants

5.2 Cohen's dzd_zdz​ — The Standardised Mean Difference of Differences

5.3 Cohen's davd_{av}dav​ — Average Standard Deviation Standardiser

5.4 Cohen's drmd_{rm}drm​ — Repeated Measures Corrected

5.5 Glass's Δ\DeltaΔ for Pre-Post Designs

5.6 Relationship Between dzd_zdz​ and davd_{av}dav​

6. Using the Paired t-Test Calculator Component

Step-by-Step Guide

7. Full Step-by-Step Procedure

7.1 Complete Computational Procedure

Step 1 — Verify and Arrange the Data

Step 2 — Compute the Mean Difference

Step 3 — Compute the Standard Deviation of Differences

Step 4 — Compute the Standard Error

Step 5 — Check the Normality Assumption

Step 6 — Compute the t-Statistic

Step 7 — Determine Degrees of Freedom

Step 8 — Compute the p-value

Step 9 — Compute the 95% Confidence Interval for μd\mu_dμd​

Step 10 — Compute Effect Sizes

Step 11 — Compute the 95% CI for Cohen's dzd_zdz​

Step 12 — Interpret and Report

8. Effect Sizes for the Paired t-Test

8.1 Cohen's dzd_zdz​ — Step-by-Step

8.2 Hedges' gzg_zgz​ — Bias Correction

8.3 Cohen's Benchmark Classification

8.4 The Common Language Effect Size

8.5 Which Effect Size to Report: A Decision Guide

9. Confidence Intervals

9.1 CI for the Mean Difference (Original Units)

9.2 CI for Cohen's dzd_zdz​ (Standardised)

9.3 CI for Other Effect Size Variants

10. Power Analysis and Sample Size Planning

10.1 A Priori Power Analysis

10.2 Sensitivity Analysis (Post-Hoc Power)

10.3 Planning Based on davd_{av}dav​ Instead of dzd_zdz​

10.4 The Effect of Pre-Post Correlation on Required Sample Size

11. Advanced Topics

11.1 The Paired t-Test as a One-Sample t-Test

11.2 The Relationship Between Paired and Independent t-Tests

11.3 Equivalence Testing with TOST

11.4 Bayesian Paired t-Test

11.5 Robust Alternatives: Trimmed Mean Paired t-Test

11.6 Handling Missing Data in Paired Designs

3.5 Relationship Between $s_d$ and the Raw Score Statistics

5.2 Cohen's $d_z$ — The Standardised Mean Difference of Differences

5.3 Cohen's $d_{av}$ — Average Standard Deviation Standardiser

5.4 Cohen's $d_{rm}$ — Repeated Measures Corrected

5.5 Glass's $\Delta$ for Pre-Post Designs

5.6 Relationship Between $d_z$ and $d_{av}$

Step 9 — Compute the 95% Confidence Interval for $\mu_d$

Step 11 — Compute the 95% CI for Cohen's $d_z$

8.1 Cohen's $d_z$ — Step-by-Step

8.2 Hedges' $g_z$ — Bias Correction

9.2 CI for Cohen's $d_z$ (Standardised)

10.3 Planning Based on $d_{av}$ Instead of $d_z$

Mistake 2: Reporting $d_z$ Without Acknowledging It Is Not Comparable to Between-Subjects $d$

Mistake 10: Not Specifying $\delta_0$ When Testing Non-Zero Nulls

Cohen's Benchmarks for $d_z$

Required Sample Size (Pairs) — Two-Tailed $\alpha = .05$

Conversion Formulas: Paired $\leftrightarrow$ Other Metrics

Participant	Pre-MBCT ( $x_{1i}$ )	Post-MBCT ( $x_{2i}$ )	$d_i = x_{1i} - x_{2i}$
1	18	11	7
2	22	14	8
3	15	10	5
4	20	16	4
5	25	17	8
6	13	9	4
7	19	12	7
8	17	14	3
9	21	13	8
10	16	11	5
11	24	16	8
12	14	10	4
13	20	15	5
14	18	12	6
15	23	15	8