Knowledge Base / Paired Samples t-Test Inferential Statistics 67 min read

Paired Samples t-Test

Step-by-step guide to conducting paired samples t-tests using DataStatPro.

Paired t-Test: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of dependent-samples inference all the way through the mathematics, assumptions, variants, effect sizes, interpretation, reporting, and practical usage of the Paired t-Test within the DataStatPro application. Whether you are encountering the paired t-test for the first time or seeking a rigorous understanding of within-subjects comparison, this guide builds your knowledge systematically from the ground up.


Table of Contents

  1. Prerequisites and Background Concepts
  2. What is a Paired t-Test?
  3. The Mathematics Behind the Paired t-Test
  4. Assumptions of the Paired t-Test
  5. Variants of the Paired t-Test
  6. Using the Paired t-Test Calculator Component
  7. Full Step-by-Step Procedure
  8. Effect Sizes for the Paired t-Test
  9. Confidence Intervals
  10. Power Analysis and Sample Size Planning
  11. Advanced Topics
  12. Worked Examples
  13. Common Mistakes and How to Avoid Them
  14. Troubleshooting
  15. Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into the paired t-test, it is essential to be comfortable with the following foundational statistical concepts. Each is briefly reviewed below.

1.1 Populations, Parameters, and Paired Designs

A population is the complete collection of individuals or measurements of interest. A sample is a subset drawn from that population. In a paired design, each participant (or experimental unit) contributes exactly two measurements — one under each of two conditions. The two measurements within a pair are inherently linked.

The paired t-test is an inferential procedure — it uses sample statistics computed from difference scores to draw conclusions about an unknown population parameter, namely the mean of the population difference scores μd\mu_d.

The fundamental question: "Is the mean difference between the two paired conditions large enough to conclude that a true population-level difference exists?"

1.2 Why Pairing Matters: Removing Between-Person Variability

In most research involving repeated measurements, individuals vary considerably from one another — some participants score high on both measurements, others score low on both. This between-person variability is a source of noise that has nothing to do with the treatment or condition effect.

By computing difference scores di=x1ix2id_i = x_{1i} - x_{2i} for each participant, the paired design removes between-person variability from the error term entirely:

sd2=s12+s222r12s1s2s_d^2 = s_1^2 + s_2^2 - 2r_{12}s_1s_2

When r12>0r_{12} > 0 (which is typical when measuring the same people twice), sd<spooleds_d < s_{pooled}, meaning the paired test has a smaller denominator and greater statistical power than the independent samples t-test for the same data.

1.3 The Sampling Distribution of the Mean Difference

If we repeatedly drew samples of nn pairs from a population where the true mean difference is μd\mu_d, the sampling distribution of dˉ\bar{d} (the mean of the difference scores) would, by the Central Limit Theorem, be approximately normal:

dˉN ⁣(μd,  σd2n)\bar{d} \sim \mathcal{N}\!\left(\mu_d,\; \frac{\sigma_d^2}{n}\right)

Because the population standard deviation of differences σd\sigma_d is unknown, we estimate it with the sample standard deviation sds_d, giving the estimated standard error of the mean difference:

SE^dˉ=sdn\widehat{SE}_{\bar{d}} = \frac{s_d}{\sqrt{n}}

This substitution of sds_d for σd\sigma_d is exactly what introduces the t-distribution rather than the standard normal distribution into the inference.

1.4 The t-Distribution and Degrees of Freedom

The Student's t-distribution arises when estimating a normally distributed population mean from a small sample with unknown variance. It is characterised by degrees of freedom ν\nu. For the paired t-test:

ν=n1\nu = n - 1

where nn is the number of pairs (not the total number of observations, which would be 2n2n). As ν\nu \to \infty, the t-distribution converges to the standard normal ZZ.

The t-distribution has heavier tails than the standard normal, reflecting greater uncertainty from estimating σd\sigma_d from the data rather than knowing it exactly.

1.5 The Null and Alternative Hypotheses

The paired t-test operates within the Neyman-Pearson hypothesis testing framework:

H0:μd=0H_0: \mu_d = 0 (the population mean difference is zero)

H1:μd0H_1: \mu_d \neq 0 (two-tailed, default)

or directional alternatives:

H1:μd>0H_1: \mu_d > 0 (upper one-tailed — Condition 1 > Condition 2)

H1:μd<0H_1: \mu_d < 0 (lower one-tailed — Condition 1 < Condition 2)

The null hypothesis can also be generalised to test against a non-zero value δ0\delta_0:

H0:μd=δ0H_0: \mu_d = \delta_0

which is useful for non-inferiority, superiority, or equivalence testing.

1.6 Statistical Significance vs. Practical Significance

A paired t-test answers: "Is the mean difference statistically distinguishable from zero, given sampling variability?" It does not answer: "Is the difference large enough to matter in practice?"

With a large number of pairs, even a trivially small mean difference can be statistically significant. Always report:

  1. The t-statistic, degrees of freedom, and p-value (statistical significance).
  2. An effect size (e.g., Cohen's dzd_z) and its 95% CI (practical significance).
  3. The 95% CI for the mean difference (in original units).

1.7 Confidence Intervals and Their Relationship to the Test

A 95% confidence interval for μd\mu_d is directly related to the two-tailed t-test at α=.05\alpha = .05: the null hypothesis H0:μd=0H_0: \mu_d = 0 is rejected at α=.05\alpha = .05 if and only if 00 lies outside the 95% CI. The CI provides strictly more information than the p-value because it communicates both the precision and magnitude of the estimated difference in original units.

1.8 Type I and Type II Errors

DecisionH0H_0 True (μd=0\mu_d = 0)H0H_0 False (μd0\mu_d \neq 0)
Reject H0H_0Type I error (α\alpha)Correct — Power (1β1-\beta)
Fail to Reject H0H_0Correct (1α1-\alpha)Type II error (β\beta)

2. What is a Paired t-Test?

2.1 The Core Idea

The paired t-test (also called: dependent samples t-test, matched pairs t-test, repeated measures t-test, or within-subjects t-test) is a parametric inferential procedure for testing whether the mean of a set of difference scores is significantly different from zero (or another specified value).

Rather than comparing two separate group means directly, the paired t-test:

  1. Computes a difference score did_i for each pair of observations.
  2. Reduces the problem to a one-sample t-test on those difference scores.
  3. Tests whether the mean difference dˉ\bar{d} is significantly different from zero.

This reduction is elegant: the paired t-test is mathematically identical to a one-sample t-test applied to the difference scores.

2.2 When to Use a Paired t-Test

A paired t-test is appropriate when:

2.3 What Makes Observations "Paired"?

Observations are paired when there is a natural, meaningful, one-to-one correspondence between observations in the two conditions:

Pairing TypeExample
Pre-post (same participant)Depression score before and after therapy
Repeated measures (same participant)Reaction time in noise vs. silence
Matched pairs (different participants)Twins randomised to different conditions
Natural pairsLeft hand vs. right hand grip strength
Crossover designsDrug A vs. Drug B, each participant receives both
Yoked controlsEach treatment participant matched to a control on age and IQ

The key criterion is that the pairing must be established before data collection, not post-hoc. The correlation between the paired measurements must be positive (or at least non-negative) for pairing to confer a power advantage.

2.4 The Paired t-Test vs. Related Procedures

SituationAppropriate Test
Two related conditions, normal differencesPaired t-test
Two related conditions, non-normal or ordinalWilcoxon Signed-Rank test
Two independent groupsIndependent samples t-test (Welch's recommended)
Three or more related conditionsOne-way Repeated Measures ANOVA
Two related conditions, Bayesian inferenceBayesian paired t-test (BF10_{10})
Testing equivalence of two related conditionsTOST equivalence test

2.5 The Power Advantage of Pairing

The paired t-test is more powerful than the independent samples t-test when:

  1. The within-pair correlation r12r_{12} is positive (which is almost always true for repeated measures on the same participant).
  2. Between-person variability is large relative to within-person change.

The power gain is quantified by the relationship between paired and independent standard errors:

SEpaired=SEindependent×2(1r12)SE_{paired} = SE_{independent} \times \sqrt{2(1-r_{12})}

When r12=0.50r_{12} = 0.50: SEpaired=SEindependent×1.0=SEindependentSE_{paired} = SE_{independent} \times \sqrt{1.0} = SE_{independent} (no gain).

When r12=0.80r_{12} = 0.80: SEpaired=SEindependent×0.4=0.632×SEindependentSE_{paired} = SE_{independent} \times \sqrt{0.4} = 0.632 \times SE_{independent} (37% reduction in SE — substantial power gain).

When r12=0.30r_{12} = 0.30: SEpaired=SEindependent×1.4=1.183×SEindependentSE_{paired} = SE_{independent} \times \sqrt{1.4} = 1.183 \times SE_{independent} — pairing actually hurts power when correlation is low and one degree of freedom is lost for pairing.

💡 Pairing is most advantageous when the within-pair correlation is high (r12>0.50r_{12} > 0.50). When participants differ greatly from each other but respond consistently to conditions, the paired design dramatically reduces error and increases power.


3. The Mathematics Behind the Paired t-Test

3.1 The Difference Score Reduction

Let (x1i,x2i)(x_{1i}, x_{2i}) denote the pair of observations for participant ii, where i=1,2,,ni = 1, 2, \ldots, n. Define the difference score:

di=x1ix2id_i = x_{1i} - x_{2i}

The sign convention matters: consistently subtracting Condition 2 from Condition 1 means a positive dˉ\bar{d} indicates that Condition 1 has higher values.

The mean and standard deviation of the difference scores are:

dˉ=1ni=1ndi\bar{d} = \frac{1}{n}\sum_{i=1}^n d_i

sd=1n1i=1n(didˉ)2s_d = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (d_i - \bar{d})^2}

3.2 The t-Statistic

The paired t-statistic is:

t=dˉδ0sd/nt = \frac{\bar{d} - \delta_0}{s_d / \sqrt{n}}

Where:

Under H0:μd=δ0H_0: \mu_d = \delta_0, this statistic follows a t-distribution with ν=n1\nu = n - 1 degrees of freedom.

3.3 Standard Error of the Mean Difference

The standard error of the mean difference measures the precision of dˉ\bar{d} as an estimate of μd\mu_d:

SEdˉ=sdnSE_{\bar{d}} = \frac{s_d}{\sqrt{n}}

This is the only standard error needed for the paired t-test. Note that it is computed entirely from the difference scores — the original scores x1ix_{1i} and x2ix_{2i} are used only to compute did_i.

3.4 The p-value

Two-tailed p-value:

p=2×P(Tn1tobs)=2×[1Ft,  n1(tobs)]p = 2 \times P(T_{n-1} \geq |t_{obs}|)= 2 \times [1 - F_{t,\;n-1}(|t_{obs}|)]

One-tailed p-value (upper): H1:μd>δ0H_1: \mu_d > \delta_0

p=P(Tn1tobs)=1Ft,  n1(tobs)p = P(T_{n-1} \geq t_{obs}) = 1 - F_{t,\;n-1}(t_{obs})

One-tailed p-value (lower): H1:μd<δ0H_1: \mu_d < \delta_0

p=P(Tn1tobs)=Ft,  n1(tobs)p = P(T_{n-1} \leq t_{obs}) = F_{t,\;n-1}(t_{obs})

Where Ft,  n1F_{t,\;n-1} is the CDF of the t-distribution with n1n-1 degrees of freedom.

3.5 Relationship Between sds_d and the Raw Score Statistics

The standard deviation of differences sds_d is algebraically related to the standard deviations of the original scores and their correlation:

sd2=s12+s222r12s1s2s_d^2 = s_1^2 + s_2^2 - 2r_{12}s_1 s_2

Where:

This relationship has several important implications:

When r12=0r_{12} = 0: sd2=s12+s22s_d^2 = s_1^2 + s_2^2 — the paired test is equivalent to the independent test (no benefit from pairing).

When r12>0r_{12} > 0: sd2<s12+s22s_d^2 < s_1^2 + s_2^2 — pairing reduces error variance and increases power.

When r12<0r_{12} < 0: sd2>s12+s22s_d^2 > s_1^2 + s_2^2 — pairing increases error variance and reduces power. This is rare in practice but can occur with counterbalanced designs where learning effects operate.

3.6 The Mean Difference and Its Relationship to Raw Means

The mean difference score always equals the difference of the condition means:

dˉ=xˉ1xˉ2\bar{d} = \bar{x}_1 - \bar{x}_2

This means the paired and independent tests produce identical estimates of the mean difference — the only difference is in the standard error used to evaluate that difference.

3.7 Computing the t-Statistic from Summary Statistics

If raw data are unavailable, the paired t-statistic can be computed from summary statistics in several ways:

From dˉ\bar{d} and sds_d:

t=dˉsd/nt = \frac{\bar{d}}{s_d/\sqrt{n}}

From the correlation r12r_{12} and group SDs:

sd=s12+s222r12s1s2s_d = \sqrt{s_1^2 + s_2^2 - 2r_{12}s_1 s_2}

t=xˉ1xˉ2(s12+s222r12s1s2)/nt = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{(s_1^2 + s_2^2 - 2r_{12}s_1 s_2)/n}}

From the t-statistic, recovering effect size:

dz=tnd_z = \frac{t}{\sqrt{n}}

3.8 Non-Central t-Distribution and Exact CIs for Effect Sizes

Under H1H_1 (when a true effect exists), the t-statistic follows a non-central t-distribution with non-centrality parameter:

λ=dzn=μdσdn\lambda = d_z \sqrt{n} = \frac{\mu_d}{\sigma_d}\sqrt{n}

The exact 95% CI for Cohen's dzd_z inverts this relationship numerically:

P(Tn1(λL)tobs)=0.025P(T_{n-1}(\lambda_L) \geq t_{obs}) = 0.025 and P(Tn1(λU)tobs)=0.025P(T_{n-1}(\lambda_U) \leq t_{obs}) = 0.025

dz,L=λLn,dz,U=λUnd_{z,L} = \frac{\lambda_L}{\sqrt{n}}, \qquad d_{z,U} = \frac{\lambda_U}{\sqrt{n}}

No closed form exists for these bounds — DataStatPro computes them automatically using numerical iteration of the non-central t-distribution CDF.

3.9 Statistical Power of the Paired t-Test

Power is the probability that the paired t-test correctly rejects H0H_0 when a true effect of size dzd_z exists:

Power=P ⁣(Tn1(λ)>tα/2,  n1)\text{Power} = P\!\left(T_{n-1}(\lambda) > t_{\alpha/2,\;n-1}\right)

Where λ=dzn\lambda = d_z\sqrt{n} is the non-centrality parameter.

The relationship between power, effect size, and sample size:

dzd_zPower = 0.70 (nn pairs)Power = 0.80 (nn pairs)Power = 0.90 (nn pairs)Power = 0.95 (nn pairs)
0.20185264354434
0.356288118146
0.5031445973
0.8013182430
1.009131721
1.20791215
1.5057911

All values assume two-tailed α=.05\alpha = .05.


4. Assumptions of the Paired t-Test

4.1 Normality of Difference Scores

The paired t-test assumes that the difference scores di=x1ix2id_i = x_{1i} - x_{2i} are drawn from a normally distributed population. Note that:

How to check:

Robustness: The paired t-test is robust to mild violations of normality, especially when:

When violated: Use the Wilcoxon Signed-Rank test as a non-parametric alternative, or consider data transformations (log, square root) if the differences are right-skewed.

4.2 Independence of Pairs

All pairs must be independent of each other. That is, knowing the difference score for pair ii gives no information about the difference score for pair jj. Within a pair, the two measurements are of course correlated — that is the whole point of the design. It is the independence across pairs that is required.

Common violations:

How to check: Independence is a property of the study design, not of the data. Inspect the sampling procedure. Check for patterns in residuals over time (Durbin-Watson test) if measurements were sequential.

When violated: For clustered pairs, use multilevel models. For time series, use time-series methods (ARIMA, mixed models with autocorrelation structure).

4.3 Correct Pairing

The pairing must be meaningful and pre-specified. Each observation in Condition 1 must correspond to the correct partner observation in Condition 2. Incorrect or arbitrary pairing does not create a valid paired test — it creates noise.

How to check: Verify the data file structure — each row should represent one pair (one participant or one matched pair), with Condition 1 and Condition 2 values in separate columns.

⚠️ A common data-entry error is accidentally shifting one column so that rows no longer correspond to the same participant across conditions. Always verify that participant IDs match across the two columns before running a paired t-test.

4.4 Interval Scale of Measurement

The dependent variable must be measured on at least an interval scale (equal-spaced intervals between values). Difference scores must be meaningful — they require that the distance between score values is consistent throughout the scale.

When violated: If the DV is ordinal (e.g., a single Likert item, rank data), use the Wilcoxon Signed-Rank test instead.

4.5 Absence of Influential Outliers in Difference Scores

The paired t-test is sensitive to extreme outliers in the difference scores because they distort both dˉ\bar{d} and sds_d.

How to check:

When outliers are present: Investigate whether the outlier represents a data entry error, measurement error, or a genuine extreme response. Report analyses with and without the outlier(s). Consider using the Wilcoxon Signed-Rank test (which is rank-based and thus robust to outliers in the differences).

4.6 Assumption Summary Table

AssumptionDescriptionHow to CheckRemedy if Violated
Normality of differencesdiN(μd,σd2)d_i \sim \mathcal{N}(\mu_d, \sigma_d^2)Shapiro-Wilk, Q-Q, histogramWilcoxon Signed-Rank; transform data
Independence of pairsPairs are independent of each otherDesign review; Durbin-WatsonMultilevel model; time-series methods
Correct pairingConditions 1 and 2 observations are correctly matchedVerify participant IDs in data fileRe-match data; verify recording
Interval scaleDV has equal-interval propertiesMeasurement theoryWilcoxon Signed-Rank
No influential outliersNo extreme values in did_iBoxplot; $z_{d_i}

5. Variants of the Paired t-Test

5.1 Overview of Effect Size Variants

Multiple variants of the paired t-test exist primarily because of different choices of effect size standardiser — the denominator of the standardised mean difference. Choosing the wrong variant leads to incomparable effect sizes across studies.

Variantt-StatisticEffect SizeDenominatorPrimary Use
Standard paired tdˉ/(sd/n)\bar{d}/(s_d/\sqrt{n})dz=dˉ/sdd_z = \bar{d}/s_dSD of differencesComparing paired designs
Average SD standardiserSame tdav=dˉ/savd_{av} = \bar{d}/s_{av}Average of group SDsComparing to between-subjects
Pooled SD standardiserSame tds=dˉ/spooledd_s = \bar{d}/s_{pooled}Pooled SD (like between)Meta-analysis
RM-correctedSame tdrm=dav2(1r)d_{rm} = d_{av}\sqrt{2(1-r)}Adjusted for correlationCross-design comparison
Pre-test standardiserSame tdpre=dˉ/spred_{pre} = \bar{d}/s_{pre}SD of pre-test (Condition 1)Change from baseline

5.2 Cohen's dzd_z — The Standardised Mean Difference of Differences

Cohen's dzd_z is the most straightforward effect size for the paired t-test. It expresses the mean difference in units of the standard deviation of the difference scores:

dz=dˉsdd_z = \frac{\bar{d}}{s_d}

It is directly recoverable from the t-statistic: dz=t/nd_z = t/\sqrt{n}.

When to use dzd_z:

Limitation of dzd_z: It is not directly comparable to Cohen's dd from an independent samples design because sds_d reflects within-person variability in change, which is typically much smaller than between-person variability. dzd_z is therefore typically larger than davd_{av} for the same mean difference.

5.3 Cohen's davd_{av} — Average Standard Deviation Standardiser

Cohen's davd_{av} (Lakens, 2013) standardises the mean difference by the average of the two condition standard deviations:

sav=s1+s22s_{av} = \frac{s_1 + s_2}{2}

dav=dˉsav=xˉ1xˉ2(s1+s2)/2d_{av} = \frac{\bar{d}}{s_{av}} = \frac{\bar{x}_1 - \bar{x}_2}{(s_1 + s_2)/2}

When to use davd_{av}:

5.4 Cohen's drmd_{rm} — Repeated Measures Corrected

Cohen's drmd_{rm} (Morris & DeShon, 2002) explicitly accounts for the within-subjects correlation r12r_{12} to produce an effect size that is directly comparable to a between- subjects Cohen's dd:

drm=dav×2(1r12)d_{rm} = d_{av} \times \sqrt{2(1-r_{12})}

Or equivalently:

drm=dˉsav×2(1r12)d_{rm} = \frac{\bar{d}}{s_{av}} \times \sqrt{2(1-r_{12})}

Properties:

drmd_{rm} is the most theoretically appropriate effect size for comparing paired designs to independent samples designs.

5.5 Glass's Δ\Delta for Pre-Post Designs

Glass's Δ\Delta standardises by the pre-test (Condition 1) standard deviation only. This is most appropriate for treatment-control or pre-post designs where the pre-test represents the baseline, unaffected by the treatment:

Δ=dˉspre=xˉ1xˉ2s1\Delta = \frac{\bar{d}}{s_{pre}} = \frac{\bar{x}_1 - \bar{x}_2}{s_1}

It answers: "How many standard deviations (in the original, pre-intervention metric) does the average participant change?"

When to use Δ\Delta: Pre-post designs where the treatment may change the variability of the outcome (e.g., an intervention that reduces both mean and variance of depression scores). Standardising by the pre-test SD anchors the effect in the pre-intervention distribution.

5.6 Relationship Between dzd_z and davd_{av}

dzd_z and davd_{av} are related through the within-pair correlation r12r_{12}:

dz=dav×21+r12×12(1r12)×2(1r12)d_z = d_{av} \times \sqrt{\frac{2}{1 + r_{12}}} \times \frac{1}{\sqrt{2(1-r_{12})}} \times \sqrt{2(1-r_{12})}

More directly:

dz=dav1r12×12d_z = \frac{d_{av}}{\sqrt{1 - r_{12}}} \times \frac{1}{\sqrt{2}}

Wait — the exact relationship (Lakens, 2013):

dz=dˉsd=dˉsav2(1r12)=dav2(1r12)d_z = \frac{\bar{d}}{s_d} = \frac{\bar{d}}{s_{av}\sqrt{2(1-r_{12})}} = \frac{d_{av}}{\sqrt{2(1-r_{12})}}

Therefore:

dz=dav2(1r12)d_z = \frac{d_{av}}{\sqrt{2(1-r_{12})}} and dav=dz×2(1r12)d_{av} = d_z \times \sqrt{2(1-r_{12})}

This means dz>davd_z > d_{av} when r12>0r_{12} > 0 (which is almost always the case for repeated measures), explaining why paired designs appear to produce larger effect sizes than between-subjects designs when the same dzd_z metric is uncritically applied to both.

Numerical example with r12=0.70r_{12} = 0.70:

dz=dav/2(10.70)=dav/0.60=dav/0.775=1.291×davd_z = d_{av}/\sqrt{2(1-0.70)} = d_{av}/\sqrt{0.60} = d_{av}/0.775 = 1.291 \times d_{av}

So if dav=0.60d_{av} = 0.60, then dz=0.775d_z = 0.775 — nearly 30% larger.


6. Using the Paired t-Test Calculator Component

The Paired t-Test Calculator in DataStatPro provides a comprehensive tool for running, diagnosing, visualising, and reporting paired t-tests and their alternatives.

Step-by-Step Guide

Step 1 — Select "Paired Samples t-Test"

From the "Test Type" dropdown, select:

Step 2 — Input Method

Choose how to provide the data:

💡 When using paired columns, DataStatPro verifies that column lengths are equal, flags any missing data, and alerts you if participant IDs are provided and do not match across columns.

Step 3 — Specify the Null Hypothesis Value δ0\delta_0

Default: δ0=0\delta_0 = 0 (testing whether the mean difference is zero). To test a non-zero null (e.g., for non-inferiority testing with a margin of δ0=2\delta_0 = -2 points), enter the appropriate value.

Step 4 — Select the Alternative Hypothesis

Step 5 — Choose the Significance Level

Select α\alpha (default: .05.05). DataStatPro simultaneously shows results for α=.05\alpha = .05, α=.01\alpha = .01, and α=.001\alpha = .001 for reference.

Step 6 — Select Effect Size Variants

Choose which effect sizes to compute and report:

Step 7 — Select Display Options

Step 8 — Run the Analysis

Click "Run Paired t-Test". DataStatPro will:

  1. Compute difference scores and all descriptive statistics.
  2. Run Shapiro-Wilk normality test on the differences.
  3. Compute the t-statistic, df, and p-value.
  4. Construct exact 95% CIs for the mean difference and all effect sizes.
  5. Generate all selected visualisations.
  6. Auto-generate the APA results paragraph.

7. Full Step-by-Step Procedure

7.1 Complete Computational Procedure

This section walks through every computational step for the paired t-test, from raw data to a full APA-style conclusion.

Given: nn pairs of observations (x1i,x2i)(x_{1i}, x_{2i}) for i=1,2,,ni = 1, 2, \ldots, n.


Step 1 — Verify and Arrange the Data

Arrange data in a table with one row per pair:

Pair iix1ix_{1i} (Condition 1)x2ix_{2i} (Condition 2)di=x1ix2id_i = x_{1i} - x_{2i}
1x11x_{11}x21x_{21}d1d_1
2x12x_{12}x22x_{22}d2d_2
\vdots\vdots\vdots\vdots
nnx1nx_{1n}x2nx_{2n}dnd_n

Establish the sign convention: a positive did_i means the participant scored higher in Condition 1 than Condition 2. State this convention explicitly before analysis.


Step 2 — Compute the Mean Difference

dˉ=1ni=1ndi=i=1ndin\bar{d} = \frac{1}{n}\sum_{i=1}^n d_i = \frac{\sum_{i=1}^n d_i}{n}

Equivalently: dˉ=xˉ1xˉ2\bar{d} = \bar{x}_1 - \bar{x}_2


Step 3 — Compute the Standard Deviation of Differences

sd=i=1n(didˉ)2n1=i=1ndi2ndˉ2n1s_d = \sqrt{\frac{\sum_{i=1}^n (d_i - \bar{d})^2}{n-1}} = \sqrt{\frac{\sum_{i=1}^n d_i^2 - n\bar{d}^2}{n-1}}


Step 4 — Compute the Standard Error

SEdˉ=sdnSE_{\bar{d}} = \frac{s_d}{\sqrt{n}}


Step 5 — Check the Normality Assumption

Run the Shapiro-Wilk test on the difference scores did_i:


Step 6 — Compute the t-Statistic

t=dˉδ0SEdˉ=dˉδ0sd/nt = \frac{\bar{d} - \delta_0}{SE_{\bar{d}}} = \frac{\bar{d} - \delta_0}{s_d/\sqrt{n}}

For the default null (δ0=0\delta_0 = 0): t=dˉn/sdt = \bar{d} \cdot \sqrt{n} / s_d


Step 7 — Determine Degrees of Freedom

ν=n1\nu = n - 1


Step 8 — Compute the p-value

Using the t-distribution with ν=n1\nu = n-1 df:

Two-tailed: p=2×P(Tn1t)p = 2 \times P(T_{n-1} \geq |t|)

Compare pp to α\alpha. Reject H0H_0 if pαp \leq \alpha.


Step 9 — Compute the 95% Confidence Interval for μd\mu_d

dˉ±tα/2,  n1×SEdˉ=dˉ±tα/2,  n1×sdn\bar{d} \pm t_{\alpha/2,\; n-1} \times SE_{\bar{d}} = \bar{d} \pm t_{\alpha/2,\; n-1} \times \frac{s_d}{\sqrt{n}}

The CI directly answers: "What are plausible values for the true population mean difference, given this sample?"


Step 10 — Compute Effect Sizes

Cohen's dzd_z:

dz=dˉsd=tnd_z = \frac{\bar{d}}{s_d} = \frac{t}{\sqrt{n}}

Hedges' gzg_z (bias-corrected dzd_z):

gz=dz×(134(n1)1)=dz×Jg_z = d_z \times \left(1 - \frac{3}{4(n-1) - 1}\right) = d_z \times J

Where J=13/(4n5)J = 1 - 3/(4n-5) is the bias correction factor.

Cohen's davd_{av} (requires s1s_1 and s2s_2):

sav=s1+s22,dav=dˉsavs_{av} = \frac{s_1 + s_2}{2}, \qquad d_{av} = \frac{\bar{d}}{s_{av}}

Cohen's drmd_{rm} (requires r12r_{12}):

drm=dav×2(1r12)d_{rm} = d_{av} \times \sqrt{2(1-r_{12})}

Common Language Effect Size (CL):

CL=Φ ⁣(dz2)CL = \Phi\!\left(\frac{d_z}{\sqrt{2}}\right)

CLCL is the probability that a randomly selected participant scores higher in Condition 1 than in Condition 2 (for positive dzd_z).


Step 11 — Compute the 95% CI for Cohen's dzd_z

Exact CI (via non-central t-distribution — computed by DataStatPro):

Find λL\lambda_L and λU\lambda_U such that:

P(Tn1(λL)tobs)=0.025P(T_{n-1}(\lambda_L) \geq t_{obs}) = 0.025 and P(Tn1(λU)tobs)=0.025P(T_{n-1}(\lambda_U) \leq t_{obs}) = 0.025

dz,L=λL/n,dz,U=λU/nd_{z,L} = \lambda_L/\sqrt{n}, \qquad d_{z,U} = \lambda_U/\sqrt{n}

Approximate CI (adequate for n>30n > 30):

SEdz1n+dz22(n1)SE_{d_z} \approx \sqrt{\frac{1}{n} + \frac{d_z^2}{2(n-1)}}

dz±1.96×SEdzd_z \pm 1.96 \times SE_{d_z}


Step 12 — Interpret and Report

Combine all results into a complete, APA-compliant report:

  1. Report t(ν)=t(\nu) = [value], p=p = [value] (or p<.001p < .001).
  2. Report dˉ\bar{d} and sds_d with units.
  3. Report the 95% CI for the mean difference.
  4. Report Cohen's dzd_z (and/or davd_{av}) with 95% CI.
  5. Classify the effect size using benchmarks.
  6. State the practical conclusion.

8. Effect Sizes for the Paired t-Test

8.1 Cohen's dzd_z — Step-by-Step

dz=dˉsdd_z = \frac{\bar{d}}{s_d}

Interpretation: dz=0.50d_z = 0.50 means the mean difference is half a standard deviation of the difference scores. This is not directly comparable to Cohen's dd from an independent samples design without knowing r12r_{12}.

8.2 Hedges' gzg_z — Bias Correction

Cohen's dzd_z is slightly positively biased in small samples — it overestimates the true population effect. Hedges' gzg_z applies the bias correction:

gz=dz×J,J=134(n1)1g_z = d_z \times J, \qquad J = 1 - \frac{3}{4(n-1)-1}

More precise gamma function form:

J=Γ((n1)/2)(n1)/2Γ((n2)/2)J = \frac{\Gamma((n-1)/2)}{\sqrt{(n-1)/2} \cdot \Gamma((n-2)/2)}

The bias is negligible for n>20n > 20 (less than 5%) but can be substantial for very small samples (n<10n < 10):

nnJJBias (%)
50.840615.9%
100.92277.7%
150.94845.2%
200.96133.9%
300.97422.6%
500.98481.5%

8.3 Cohen's Benchmark Classification

Cohen (1988) proposed the following conventions for dzd_z (and equivalently for davd_{av} and drmd_{rm}):

dz\vert d_z \vertVerbal LabelCLCL (%)U3U_3 (%)Overlap (%)
0.000.00No effect50.0%50.0\%50.0%50.0\%100.0%100.0\%
0.200.20Small55.6%55.6\%57.9%57.9\%85.3%85.3\%
0.500.50Medium63.8%63.8\%69.1%69.1\%66.9%66.9\%
0.800.80Large71.4%71.4\%78.8%78.8\%52.5%52.5\%
1.201.20Very large80.2%80.2\%88.5%88.5\%35.9%35.9\%
2.002.00Huge92.1%92.1\%97.7%97.7\%16.9%16.9\%

⚠️ Cohen's benchmarks were intended as rough conventions of last resort — to be used only when no domain-specific information is available. Always contextualise within your research domain. In clinical psychology, dz=0.50d_z = 0.50 may be a meaningful effect; in some neuroimaging contexts, dz=0.20d_z = 0.20 may be large relative to typical findings.

Extended benchmarks (Sawilowsky, 2009):

Labeldz\vert d_z \vert
Tiny<0.10< 0.10
Very small0.100.190.10 - 0.19
Small0.200.490.20 - 0.49
Medium0.500.790.50 - 0.79
Large0.801.190.80 - 1.19
Very large1.201.991.20 - 1.99
Huge2.00\geq 2.00

8.4 The Common Language Effect Size

The Common Language Effect Size (McGraw & Wong, 1992) translates dzd_z into a probability that is intuitive for non-statistical audiences:

CL=Φ ⁣(dz2)CL = \Phi\!\left(\frac{d_z}{\sqrt{2}}\right)

CL=0.70CL = 0.70 means: "In 70% of repeated measurements of the same individual, their score in Condition 1 exceeds their score in Condition 2."

8.5 Which Effect Size to Report: A Decision Guide

Research GoalRecommended Effect SizeRationale
Within-study power analysis and paired design comparisondzd_zDirect function of the t-statistic; reflects paired design power
Comparing to between-subjects literaturedavd_{av} or drmd_{rm}Standardises by original-scale SD; comparable to independent dd
Clinical pre-post change evaluationdavd_{av} or Glass's Δ\DeltaAnchored in clinically meaningful scale
Meta-analysis combining paired and independentdrmd_{rm}Design-adjusted; most comparable across designs
Small sample (n<20n < 20)gzg_z (Hedges')Reduces positive bias of dzd_z
Reporting all relevant variantsdzd_z + davd_{av}Provides complete picture; specify which is primary

💡 Always specify which effect size variant was computed. Writing "Cohen's d=0.78d = 0.78" without specifying whether it is dzd_z, davd_{av}, or drmd_{rm} is ambiguous and prevents accurate meta-analytic synthesis.


9. Confidence Intervals

9.1 CI for the Mean Difference (Original Units)

The 95% CI for the population mean difference μd\mu_d provides the most practically interpretable interval — it is expressed in the original measurement units and directly answers: "How large might the true effect be?"

dˉ±tα/2,  n1×sdn\bar{d} \pm t_{\alpha/2,\; n-1} \times \frac{s_d}{\sqrt{n}}

Interpreting the CI:

CI PropertyInterpretation
Entirely above zeroEffect is significantly positive (p<αp < \alpha)
Entirely below zeroEffect is significantly negative (p<αp < \alpha)
Contains zeroNot statistically significant at level α\alpha
Narrow CIPrecise estimate; large nn
Wide CIImprecise estimate; small nn — interpret point estimate cautiously
Entirely within trivial rangeEffect is definitively small (equivalence established)

9.2 CI for Cohen's dzd_z (Standardised)

The exact 95% CI for dzd_z uses the non-central t-distribution (computed automatically by DataStatPro). The approximate CI (adequate for n>30n > 30) is:

SEdz1n+dz22(n1)SE_{d_z} \approx \sqrt{\frac{1}{n} + \frac{d_z^2}{2(n-1)}}

dz±1.96×SEdzd_z \pm 1.96 \times SE_{d_z} (approximate, two-tailed α=.05\alpha = .05)

Width of the 95% CI for dzd_z as a function of nn (for true dz=0.50d_z = 0.50):

nn pairsApprox. SEdzSE_{d_z}95% CI WidthPrecision
100.3341.31Very low
200.2320.91Low
300.1880.74Moderate
500.1450.57Moderate-good
1000.1020.40Good
2000.0720.28High
5000.0460.18Very high

⚠️ With n=10n = 10 pairs, the 95% CI for dz=0.50d_z = 0.50 spans approximately [−0.16, 1.16] — from "negligible" to "very large." A point estimate of dz=0.50d_z = 0.50 from a study of only 10 pairs is essentially uninterpretable without the CI. Always report the CI.

9.3 CI for Other Effect Size Variants

95% CI for davd_{av}: Convert using dav=dz×2(1r12)d_{av} = d_z \times \sqrt{2(1-r_{12})} and apply the same conversion to both CI bounds.

dav,L=dz,L×2(1r12),dav,U=dz,U×2(1r12)d_{av,L} = d_{z,L} \times \sqrt{2(1-r_{12})}, \qquad d_{av,U} = d_{z,U} \times \sqrt{2(1-r_{12})}

95% CI for drmd_{rm}: DataStatPro computes this by bootstrapping when raw data are available, or via the delta method for summary statistics.


10. Power Analysis and Sample Size Planning

10.1 A Priori Power Analysis

A priori power analysis determines the required number of pairs before data collection to achieve desired power 1β1-\beta at significance level α\alpha for a hypothesised effect of size dzd_z.

Required nn for two-tailed test:

The exact calculation uses the non-central t-distribution (numerical). An excellent approximation:

n(z1α/2+z1β)2dz2+z1α/222n \approx \frac{(z_{1-\alpha/2} + z_{1-\beta})^2}{d_z^2} + \frac{z_{1-\alpha/2}^2}{2}

For α=.05\alpha = .05 (two-tailed, z.975=1.96z_{.975} = 1.96) and power =0.80= 0.80 (z.80=0.842z_{.80} = 0.842):

n(1.96+0.842)2dz2=7.849dz2n \approx \frac{(1.96 + 0.842)^2}{d_z^2} = \frac{7.849}{d_z^2}

dzd_zPower = 0.80 (nn)Power = 0.90 (nn)Power = 0.95 (nn)Power = 0.99 (nn)
0.20198265326441
0.3089119147198
0.5034455575
0.8015192332
1.0010131622
1.208101216
1.5068912

All values assume two-tailed α=.05\alpha = .05. Add 1–2 pairs to account for rounding.

10.2 Sensitivity Analysis (Post-Hoc Power)

Sensitivity analysis determines the minimum effect size that could have been detected with the study's sample size at a specified power level. It answers: "What was the smallest effect this study was designed to detect?"

dz,  min=(z1α/2+z1β)2nd_{z,\;min} = \sqrt{\frac{(z_{1-\alpha/2} + z_{1-\beta})^2}{n}}

For n=20n = 20 pairs, α=.05\alpha = .05, power =0.80= 0.80:

dz,  min=7.849/20=0.392=0.626d_{z,\;min} = \sqrt{7.849/20} = \sqrt{0.392} = 0.626

This study could reliably detect only effects of dz0.63d_z \geq 0.63 — near Cohen's "large" threshold. Smaller effects may exist but would frequently be missed.

⚠️ Post-hoc power computed from the observed effect size (sometimes called "observed power") is circular, redundant with the p-value, and should NOT be reported as a justification for a non-significant result. Sensitivity analysis using the minimum detectable effect is the appropriate post-hoc power tool.

10.3 Planning Based on davd_{av} Instead of dzd_z

When planning based on an expected davd_{av} (e.g., from a published between-subjects study), first convert to dzd_z using the anticipated within-pair correlation r12r_{12}:

dz=dav2(1r12)d_z = \frac{d_{av}}{\sqrt{2(1-r_{12})}}

Then apply the standard nn formula. If r12r_{12} is unknown, use a conservative estimate of r12=0.50r_{12} = 0.50:

dz=dav2(10.50)=dav1.0=davd_z = \frac{d_{av}}{\sqrt{2(1-0.50)}} = \frac{d_{av}}{\sqrt{1.0}} = d_{av}

With r12=0.50r_{12} = 0.50, dz=davd_z = d_{av}, so the sample size formula is the same.

10.4 The Effect of Pre-Post Correlation on Required Sample Size

The required sample size for a paired design decreases as r12r_{12} increases — reflecting the power advantage of pairing. Compared to an independent samples design with the same davd_{av}:

r12r_{12}Factor 2(1r12)\sqrt{2(1-r_{12})}dz/davd_z/d_{av}npaired/nindependentn_{paired}/n_{independent}
0.001.4140.7070.500
0.201.2650.7910.625
0.501.0001.0001.000
0.700.7751.2911.667
0.800.6321.5812.500
0.900.4472.2365.000

npaired/nindependentn_{paired}/n_{independent} is the ratio of total observations needed (paired has nn pairs = nn total; independent has 2n2n total for same power on davd_{av}).

💡 For r12=0.80r_{12} = 0.80, the paired design requires only 40% as many participants as the independent design to achieve the same power. When within-pair correlations are high, pairing provides a dramatic efficiency gain.


11. Advanced Topics

11.1 The Paired t-Test as a One-Sample t-Test

The paired t-test is mathematically identical to a one-sample t-test applied to the difference scores. This has several practical implications:

  1. Software implementation: Many software packages implement the paired t-test by computing difference scores and running a one-sample test.
  2. Missing data: If some participants have data for only one condition, those pairs cannot contribute difference scores and are excluded entirely from the analysis.
  3. Non-zero null: Testing H0:μd=5H_0: \mu_d = 5 (e.g., "does the mean improvement exceed a clinically significant threshold of 5 points?") is as straightforward as testing H0:μd=0H_0: \mu_d = 0.

11.2 The Relationship Between Paired and Independent t-Tests

For the same dataset, the paired and independent t-statistics are related through the within-pair correlation r12r_{12}:

tpaired=dˉsd/n=xˉ1xˉ2(s12+s222r12s1s2)/nt_{paired} = \frac{\bar{d}}{s_d/\sqrt{n}} = \frac{\bar{x}_1-\bar{x}_2}{\sqrt{(s_1^2+s_2^2-2r_{12}s_1s_2)/n}}

tindependent=xˉ1xˉ2spooled2/n=xˉ1xˉ2(s12+s22)/nt_{independent} = \frac{\bar{x}_1-\bar{x}_2}{s_{pooled}\sqrt{2/n}} = \frac{\bar{x}_1-\bar{x}_2}{\sqrt{(s_1^2+s_2^2)/n}}

The ratio of the t-statistics:

tpairedtindependent=s12+s22s12+s222r12s1s2=112r12s1s2s12+s22\frac{t_{paired}}{t_{independent}} = \frac{\sqrt{s_1^2+s_2^2}}{\sqrt{s_1^2+s_2^2-2r_{12}s_1s_2}} = \frac{1}{\sqrt{1-\frac{2r_{12}s_1s_2}{s_1^2+s_2^2}}}

For equal SDs (s1=s2=ss_1 = s_2 = s):

tpairedtindependent=11r12\frac{t_{paired}}{t_{independent}} = \frac{1}{\sqrt{1-r_{12}}}

When r12=0.75r_{12} = 0.75: tpaired/tindependent=1/0.25=2.0t_{paired}/t_{independent} = 1/\sqrt{0.25} = 2.0 — the paired t-statistic is twice as large, corresponding to vastly higher power.

Also note the degrees of freedom differ: νpaired=n1\nu_{paired} = n-1 vs. νindependent=2n2\nu_{independent} = 2n-2. The paired test loses n1n-1 degrees of freedom by pairing, but gains far more through the reduced error term when r12r_{12} is high.

11.3 Equivalence Testing with TOST

Standard paired t-testing can reject H0:μd=0H_0: \mu_d = 0 but cannot establish that the mean difference is negligibly small. The Two One-Sided Tests (TOST) procedure tests whether the mean difference falls within a pre-specified equivalence interval [ΔL,ΔU][-\Delta_L, \Delta_U]:

H01:μdΔLH_{01}: \mu_d \leq -\Delta_L (the difference is meaningfully negative) H02:μdΔUH_{02}: \mu_d \geq \Delta_U (the difference is meaningfully positive)

Equivalence is concluded when both one-sided tests reject their respective nulls at level α\alpha — equivalently, when the 90% CI (for α=.05\alpha = .05) for μd\mu_d falls entirely within (ΔL,ΔU)(-\Delta_L, \Delta_U).

The TOST t-statistics:

t1=dˉ(ΔL)sd/n,t2=dˉΔUsd/nt_1 = \frac{\bar{d} - (-\Delta_L)}{s_d/\sqrt{n}}, \qquad t_2 = \frac{\bar{d} - \Delta_U}{s_d/\sqrt{n}}

Both must exceed tα,  n1t_{\alpha,\;n-1} (one-tailed) for equivalence to be declared.

Choosing equivalence bounds: A common choice based on Cohen's dzd_z is to set bounds corresponding to a "small" effect: Δ=dz,small×sd=0.20×sd\Delta = d_{z,small} \times s_d = 0.20 \times s_d. In practice, bounds should be domain-specific and set before data collection.

💡 TOST for paired t-tests is critical for crossover drug bioequivalence studies (where "no difference" between formulations must be positively demonstrated), for measurement instrument validation (demonstrating that a new instrument agrees with a gold standard), and for null results that claim two conditions are equivalent.

11.4 Bayesian Paired t-Test

The Bayesian paired t-test (Rouder et al., 2009) quantifies evidence for and against the null hypothesis using the Bayes Factor BF10BF_{10}:

BF10=P(dataH1:μd0)P(dataH0:μd=0)BF_{10} = \frac{P(\text{data} \mid H_1: \mu_d \neq 0)}{P(\text{data} \mid H_0: \mu_d = 0)}

Under the default JZS prior (Jeffrey-Zellner-Siow), the prior on dzd_z under H1H_1 is a Cauchy distribution with scale r=2/20.707r = \sqrt{2}/2 \approx 0.707.

Interpreting Bayes Factors:

BF10BF_{10}Evidence for H1H_1
>100> 100Extreme
3010030 - 100Very strong
103010 - 30Strong
3103 - 10Moderate
131 - 3Anecdotal
11No evidence (equal support)
1/311/3 - 1Anecdotal for H0H_0
1/101/31/10 - 1/3Moderate for H0H_0
<1/10< 1/10Strong or stronger for H0H_0

Advantages of Bayesian paired t-test:

Reporting: "A Bayesian paired t-test with the default Cauchy prior (r=2/2r = \sqrt{2}/2) provided [strong / moderate / anecdotal / no] evidence for the alternative hypothesis, BF10=BF_{10} = [value]."

11.5 Robust Alternatives: Trimmed Mean Paired t-Test

Yuen's paired trimmed mean t-test (Yuen, 1974) uses α\alpha-trimmed means of the difference scores as the measure of central tendency. With 20% trimming:

h=n20.2nh = n - 2\lfloor 0.2n \rfloor (effective sample size after trimming)

dˉtrim\bar{d}_{trim} = 20%-trimmed mean of did_i

sw,d2s_{w,d}^2 = Winsorised variance of did_i

ttrim=dˉtrimsw,d/h(h1)t_{trim} = \frac{\bar{d}_{trim}}{s_{w,d}/\sqrt{h(h-1)}}, compared to th1t_{h-1}

The trimmed mean paired t-test is substantially more powerful than the Wilcoxon signed- rank test for symmetric heavy-tailed distributions, while maintaining good Type I error control under non-normality.

11.6 Handling Missing Data in Paired Designs

In the paired t-test, both observations must be present for a pair to contribute to the analysis. Options for handling missing data:

ApproachDescriptionWhen Appropriate
Complete case analysisUse only pairs with both observationsMCAR assumption; small proportion missing
Multiple imputationImpute missing values using predictive modelsMAR assumption; principled approach
Maximum likelihood (MLM)Use all available data via FIMLMAR assumption; preferred for repeated measures
Last observation carried forward (LOCF)Replace missing post-value with last observationClinical trials; conservative assumption

⚠️ Listwise deletion (complete case analysis) is the default in most software but can introduce bias when data are not Missing Completely At Random (MCAR). For more than 5% missing data, multiple imputation or maximum likelihood estimation are strongly preferred.

11.7 Multi-Level Extensions of the Paired Design

The paired t-test assumes that pairs are sampled from a common population. When pairs themselves are nested within clusters (e.g., twin pairs from the same family, or pre-post measurements from patients in the same hospital), standard paired t-tests underestimate standard errors and produce inflated Type I error rates.

The appropriate extension is a two-level mixed model:

dij=γ00+u0j+εijd_{ij} = \gamma_{00} + u_{0j} + \varepsilon_{ij}

Where u0jN(0,τ2)u_{0j} \sim \mathcal{N}(0, \tau^2) is the cluster-level random effect and εijN(0,σ2)\varepsilon_{ij} \sim \mathcal{N}(0, \sigma^2) is the residual within clusters. The Intraclass Correlation Coefficient (ICC) = τ2/(τ2+σ2)\tau^2/(\tau^2 + \sigma^2) quantifies the degree of clustering.

11.8 Reporting the Paired t-Test According to APA 7th Edition

Minimum reporting requirements (APA 7th ed.):

  1. Mean and SD for each condition: M1M_1, SD1SD_1, M2M_2, SD2SD_2.
  2. Mean and SD of difference scores: MdM_d, SDdSD_d.
  3. t-statistic with df: t(n1)=t(n-1) = [value].
  4. Exact p-value (or p<.001p < .001).
  5. Effect size with 95% CI: dz=d_z = [value] [95% CI: LB, UB].
  6. 95% CI for mean difference in original units.
  7. Specification of which effect size variant was reported.
  8. Whether the normality assumption was checked and met.

12. Worked Examples

Example 1: Pre-Post Mindfulness Intervention — PHQ-9 Depression Scores

A clinical psychologist evaluates whether an 8-week Mindfulness-Based Cognitive Therapy (MBCT) programme significantly reduces depression symptoms. PHQ-9 scores (0–27; higher = more depression) are recorded for n=15n = 15 participants immediately before and after the programme.

Raw data:

ParticipantPre-MBCT (x1ix_{1i})Post-MBCT (x2ix_{2i})di=x1ix2id_i = x_{1i} - x_{2i}
118117
222148
315105
420164
525178
61394
719127
817143
921138
1016115
1124168
1214104
1320155
1418126
1523158

Step 1 — Normality check on differences:

Differences: 7, 8, 5, 4, 8, 4, 7, 3, 8, 5, 8, 4, 5, 6, 8

Shapiro-Wilk: W=0.913W = 0.913, p=.154p = .154 — normality not violated; proceed with paired t-test.

Step 2 — Descriptive statistics:

di=7+8+5+4+8+4+7+3+8+5+8+4+5+6+8=90\sum d_i = 7+8+5+4+8+4+7+3+8+5+8+4+5+6+8 = 90

dˉ=90/15=6.000\bar{d} = 90/15 = 6.000

di2=49+64+25+16+64+16+49+9+64+25+64+16+25+36+64=596\sum d_i^2 = 49+64+25+16+64+16+49+9+64+25+64+16+25+36+64 = 596

sd=di2ndˉ2n1=59615(36)14=59654014=5614=4.000=2.000s_d = \sqrt{\frac{\sum d_i^2 - n\bar{d}^2}{n-1}} = \sqrt{\frac{596 - 15(36)}{14}} = \sqrt{\frac{596-540}{14}} = \sqrt{\frac{56}{14}} = \sqrt{4.000} = 2.000

Condition means and SDs:

Pre-MBCT: xˉ1=(18+22+15+20+25+13+19+17+21+16+24+14+20+18+23)/15=285/15=19.000\bar{x}_1 = (18+22+15+20+25+13+19+17+21+16+24+14+20+18+23)/15 = 285/15 = 19.000

s1=(x1i19)2/14s_1 = \sqrt{\sum(x_{1i}-19)^2/14}: values (1,9,16,1,36,36,0,4,4,9,25,25,1,1,16)(1,9,16,1,36,36,0,4,4,9,25,25,1,1,16), sum =184= 184, s1=184/14=13.143=3.625s_1 = \sqrt{184/14} = \sqrt{13.143} = 3.625

Post-MBCT: xˉ2=19.0006.000=13.000\bar{x}_2 = 19.000 - 6.000 = 13.000

s2=(x2i13)2/14s_2 = \sqrt{\sum(x_{2i}-13)^2/14}: values (4,1,9,9,16,16,1,1,0,4,9,9,4,1,4)(4,1,9,9,16,16,1,1,0,4,9,9,4,1,4), sum =88= 88, s2=88/14=6.286=2.507s_2 = \sqrt{88/14} = \sqrt{6.286} = 2.507

Within-pair correlation:

r12=s12+s22sd22s1s2=13.143+6.2864.0002×3.625×2.507=15.42918.176=0.849r_{12} = \frac{s_1^2 + s_2^2 - s_d^2}{2s_1 s_2} = \frac{13.143 + 6.286 - 4.000}{2 \times 3.625 \times 2.507} = \frac{15.429}{18.176} = 0.849

Step 3 — Standard error:

SEdˉ=sd/n=2.000/15=2.000/3.873=0.5164SE_{\bar{d}} = s_d/\sqrt{n} = 2.000/\sqrt{15} = 2.000/3.873 = 0.5164

Step 4 — t-statistic:

t=dˉ/SEdˉ=6.000/0.5164=11.619t = \bar{d}/SE_{\bar{d}} = 6.000/0.5164 = 11.619

Step 5 — Degrees of freedom and p-value:

ν=151=14\nu = 15-1 = 14

p=2×P(T1411.619)<.001p = 2 \times P(T_{14} \geq 11.619) < .001

Step 6 — 95% CI for mean difference:

t.025,  14=2.145t_{.025,\;14} = 2.145

6.000±2.145×0.5164=6.000±1.108=[4.892,7.108]6.000 \pm 2.145 \times 0.5164 = 6.000 \pm 1.108 = [4.892, 7.108]

Step 7 — Effect sizes:

Cohen's dzd_z:

dz=dˉ/sd=6.000/2.000=3.000d_z = \bar{d}/s_d = 6.000/2.000 = 3.000

dz=t/n=11.619/15=11.619/3.873=3.000d_z = t/\sqrt{n} = 11.619/\sqrt{15} = 11.619/3.873 = 3.000

Hedges' gzg_z:

J=13/(4×141)=13/55=10.0545=0.9455J = 1 - 3/(4 \times 14 - 1) = 1 - 3/55 = 1 - 0.0545 = 0.9455

gz=3.000×0.9455=2.836g_z = 3.000 \times 0.9455 = 2.836

Cohen's davd_{av}:

sav=(3.625+2.507)/2=3.066s_{av} = (3.625 + 2.507)/2 = 3.066

dav=6.000/3.066=1.957d_{av} = 6.000/3.066 = 1.957

Cohen's drmd_{rm}:

drm=dav×2(1r12)=1.957×2(10.849)=1.957×0.302=1.957×0.550=1.076d_{rm} = d_{av} \times \sqrt{2(1-r_{12})} = 1.957 \times \sqrt{2(1-0.849)} = 1.957 \times \sqrt{0.302} = 1.957 \times 0.550 = 1.076

Common Language Effect Size:

CL=Φ(3.000/2)=Φ(2.121)=0.983CL = \Phi(3.000/\sqrt{2}) = \Phi(2.121) = 0.983

Step 8 — 95% CI for dzd_z (approximate):

SEdz=1/15+3.0002/(2×14)=0.0667+0.3214=0.3881=0.623SE_{d_z} = \sqrt{1/15 + 3.000^2/(2 \times 14)} = \sqrt{0.0667 + 0.3214} = \sqrt{0.3881} = 0.623

95% CI:3.000±1.96(0.623)=[1.779,4.221]95\%\text{ CI}: 3.000 \pm 1.96(0.623) = [1.779, 4.221]

Summary table:

StatisticValueInterpretation
Pre-MBCT mean19.0019.00 PHQ-9 ptsModerate-severe depression
Post-MBCT mean13.0013.00 PHQ-9 ptsMild depression
Mean difference (dˉ\bar{d})6.006.00 pts6-point reduction
SDdSD_d2.002.00 ptsLow variability in change
r12r_{12} (pre-post)0.8490.849High pre-post correlation
t(14)t(14)11.61911.619
pp (two-tailed)<.001< .001Highly significant
95% CI for μd\mu_d[4.89,7.11][4.89, 7.11]Excludes 0
Cohen's dzd_z3.0003.000Huge effect
Hedges' gzg_z2.8362.836Huge (bias-corrected)
Cohen's davd_{av}1.9571.957Huge
Cohen's drmd_{rm}1.0761.076Large-very large
95% CI for dzd_z[1.78,4.22][1.78, 4.22]
CL98.3%98.3\%

APA write-up: "A paired samples t-test was conducted to evaluate whether PHQ-9 depression scores changed from pre- to post-MBCT. Difference scores were normally distributed as assessed by Shapiro-Wilk (W=0.91W = 0.91, p=.154p = .154). MBCT produced a statistically significant reduction in depression scores (Mpre=19.00M_{pre} = 19.00, SDpre=3.63SD_{pre} = 3.63; Mpost=13.00M_{post} = 13.00, SDpost=2.51SD_{post} = 2.51), t(14)=11.62t(14) = 11.62, p<.001p < .001, dz=3.00d_z = 3.00 [95% CI: 1.78, 4.22], dav=1.96d_{av} = 1.96. The mean reduction of 6.00 PHQ-9 points [95% CI: 4.89, 7.11] represents a clinically large and statistically robust treatment effect. 98.3% of participants showed greater improvement than would be expected by chance."


Example 2: Crossover Drug Trial — Pain Reduction

A pharmacologist conducts a double-blind crossover study comparing Drug A vs. Drug B on pain ratings (0–100 VAS; lower = less pain) in n=12n = 12 participants with chronic back pain. Each participant receives both drugs in randomised order with a 2-week washout between. Difference = Drug A − Drug B (positive = Drug A produces more pain).

Raw data:

ParticipantDrug A (x1ix_{1i})Drug B (x2ix_{2i})did_i
145387
262557
33341−8
458499
570637
64147−6
755505
84855−7
964586
1052466
113944−5
1260564

Step 1 — Normality check:

Differences: 7, 7, −8, 9, 7, −6, 5, −7, 6, 6, −5, 4

Shapiro-Wilk: W=0.964W = 0.964, p=.836p = .836 — normality holds.

Step 2 — Descriptive statistics for differences:

di=7+78+9+76+57+6+65+4=25\sum d_i = 7+7-8+9+7-6+5-7+6+6-5+4 = 25

dˉ=25/12=2.083\bar{d} = 25/12 = 2.083

di2=49+49+64+81+49+36+25+49+36+36+25+16=515\sum d_i^2 = 49+49+64+81+49+36+25+49+36+36+25+16 = 515

sd=(51512×2.0832)/11=(51552.08)/11=462.92/11=42.084=6.487s_d = \sqrt{(515 - 12 \times 2.083^2)/11} = \sqrt{(515 - 52.08)/11} = \sqrt{462.92/11} = \sqrt{42.084} = 6.487

Step 3 — Standard error:

SEdˉ=6.487/12=6.487/3.464=1.873SE_{\bar{d}} = 6.487/\sqrt{12} = 6.487/3.464 = 1.873

Step 4 — t-statistic:

t=2.083/1.873=1.112t = 2.083/1.873 = 1.112

Step 5 — Degrees of freedom and p-value:

ν=121=11\nu = 12-1 = 11

p=2×P(T111.112)=2×0.146=.292p = 2 \times P(T_{11} \geq 1.112) = 2 \times 0.146 = .292

Step 6 — 95% CI:

t.025,  11=2.201t_{.025,\;11} = 2.201

2.083±2.201×1.873=2.083±4.121=[2.038,6.204]2.083 \pm 2.201 \times 1.873 = 2.083 \pm 4.121 = [-2.038, 6.204]

Step 7 — Effect sizes:

dz=dˉ/sd=2.083/6.487=0.321d_z = \bar{d}/s_d = 2.083/6.487 = 0.321

gz=0.321×(13/(4×111))=0.321×(13/43)=0.321×0.930=0.299g_z = 0.321 \times (1 - 3/(4 \times 11-1)) = 0.321 \times (1 - 3/43) = 0.321 \times 0.930 = 0.299

95% CI for dzd_z (approximate):

SEdz=1/12+0.3212/(2×11)=0.0833+0.00468=0.0880=0.297SE_{d_z} = \sqrt{1/12 + 0.321^2/(2 \times 11)} = \sqrt{0.0833 + 0.00468} = \sqrt{0.0880} = 0.297

95% CI:0.321±1.96(0.297)=[0.261,0.903]95\%\text{ CI}: 0.321 \pm 1.96(0.297) = [-0.261, 0.903]

The CI spans from negative (Drug B better) to positive (Drug A better), confirming non-significance.

Step 8 — Equivalence test (TOST)

The pharmacologist wishes to establish whether the drugs are equivalent within ±10\pm 10 VAS points (Δ=10\Delta = 10).

t1=(2.083(10))/1.873=12.083/1.873=6.449t_1 = (2.083 - (-10))/1.873 = 12.083/1.873 = 6.449

t2=(2.08310)/1.873=7.917/1.873=4.227t_2 = (2.083 - 10)/1.873 = -7.917/1.873 = -4.227

Both t1=6.449t_1 = 6.449 and t2=4.227|t_2| = 4.227 exceed t.05,  11=1.796t_{.05,\;11} = 1.796 (one-tailed).

The 90% CI for μd\mu_d:

2.083±1.796×1.873=2.083±3.364=[1.281,5.447]2.083 \pm 1.796 \times 1.873 = 2.083 \pm 3.364 = [-1.281, 5.447]

Since the 90% CI [1.28,5.45][-1.28, 5.45] falls entirely within [10,+10][-10, +10], equivalence is established at α=.05\alpha = .05.

Summary:

StatisticValueInterpretation
Mean diff (A − B)2.082.08 ptsDrug A produces slightly higher pain
SDdSD_d6.496.49 ptsHigh variability in within-person differences
t(11)t(11)1.1121.112
pp (two-tailed).292.292Not significant
95% CI for μd\mu_d[2.04,6.20][-2.04, 6.20]Includes 0
Cohen's dzd_z0.3210.321 (Small)
Hedges' gzg_z0.2990.299
95% CI for dzd_z[0.26,0.90][-0.26, 0.90]
TOST resultEquivalent90% CI within ±\pm10 pts

APA write-up: "A paired samples t-test examined whether Drug A and Drug B differed in pain relief in a crossover design (n=12n = 12). Difference scores were normally distributed (W=0.964W = 0.964, p=.836p = .836). The mean pain rating was not significantly different for Drug A (M=52.25M = 52.25, SD=11.43SD = 11.43) vs. Drug B (M=50.17M = 50.17, SD=8.42SD = 8.42), t(11)=1.11t(11) = 1.11, p=.292p = .292, dz=0.32d_z = 0.32 [95% CI: -0.26, 0.90]. The 95% CI for the mean difference was [-2.04, 6.20] VAS points. A TOST equivalence test with bounds of ±\pm10 VAS points demonstrated that the drugs are equivalent in pain relief, with the 90% CI [-1.28, 5.45] falling entirely within the equivalence interval."


Example 3: Reaction Time — Noise vs. Silence Condition

A cognitive psychologist tests whether background noise affects simple reaction time (ms) in n=25n = 25 university students. Each participant completes both a silent and a noise condition (order counterbalanced). di=RTnoiseRTsilenced_i = RT_{noise} - RT_{silence} (positive = noise increases RT).

Summary statistics (raw data not shown):

n=25,xˉnoise=312.4 ms,xˉsilence=298.7 msn = 25, \quad \bar{x}_{noise} = 312.4\text{ ms}, \quad \bar{x}_{silence} = 298.7\text{ ms}

snoise=41.3 ms,ssilence=38.6 ms,r12=0.82s_{noise} = 41.3\text{ ms}, \quad s_{silence} = 38.6\text{ ms}, \quad r_{12} = 0.82

Step 1 — Compute sds_d:

sd=s12+s222r12s1s2=41.32+38.622(0.82)(41.3)(38.6)s_d = \sqrt{s_1^2 + s_2^2 - 2r_{12}s_1s_2} = \sqrt{41.3^2 + 38.6^2 - 2(0.82)(41.3)(38.6)}

=1705.69+1489.962617.40=578.25=24.05 ms= \sqrt{1705.69 + 1489.96 - 2617.40} = \sqrt{578.25} = 24.05\text{ ms}

Step 2 — Mean difference:

dˉ=312.4298.7=13.7 ms\bar{d} = 312.4 - 298.7 = 13.7\text{ ms}

Step 3 — Standard error:

SEdˉ=24.05/25=24.05/5=4.81 msSE_{\bar{d}} = 24.05/\sqrt{25} = 24.05/5 = 4.81\text{ ms}

Step 4 — t-statistic:

t=13.7/4.81=2.849t = 13.7/4.81 = 2.849

Step 5 — df and p-value:

ν=24\nu = 24; p=2×P(T242.849)=2×.0045=.009p = 2 \times P(T_{24} \geq 2.849) = 2 \times .0045 = .009

Step 6 — 95% CI:

t.025,  24=2.064t_{.025,\;24} = 2.064

13.7±2.064×4.81=13.7±9.93=[3.77,23.63] ms13.7 \pm 2.064 \times 4.81 = 13.7 \pm 9.93 = [3.77, 23.63]\text{ ms}

Step 7 — Effect sizes:

dz=13.7/24.05=0.570(or t/n=2.849/5=0.570d_z = 13.7/24.05 = 0.570 \quad (\text{or } t/\sqrt{n} = 2.849/5 = 0.570)

gz=0.570×(13/(4×241))=0.570×0.969=0.552g_z = 0.570 \times (1-3/(4\times24-1)) = 0.570 \times 0.969 = 0.552

sav=(41.3+38.6)/2=39.95 mss_{av} = (41.3+38.6)/2 = 39.95\text{ ms}

dav=13.7/39.95=0.343d_{av} = 13.7/39.95 = 0.343

drm=0.343×2(10.82)=0.343×0.36=0.343×0.600=0.206d_{rm} = 0.343 \times \sqrt{2(1-0.82)} = 0.343 \times \sqrt{0.36} = 0.343 \times 0.600 = 0.206

CL=Φ(0.570/2)=Φ(0.403)=0.657CL = \Phi(0.570/\sqrt{2}) = \Phi(0.403) = 0.657

Contrast: What if this had been run as (incorrect) independent test?

spooled=(24×41.32+24×38.62)/48=(40936.56+35757.84)/48=1597.80=39.97s_{pooled} = \sqrt{(24 \times 41.3^2 + 24 \times 38.6^2)/48} = \sqrt{(40936.56+35757.84)/48} = \sqrt{1597.80} = 39.97

tind=13.7/(39.972/25)=13.7/11.318=1.211,p=.230t_{ind} = 13.7/(39.97\sqrt{2/25}) = 13.7/11.318 = 1.211, \quad p = .230

The incorrect independent t-test fails to detect the effect (p=.230p = .230) while the paired test clearly identifies it (p=.009p = .009). This illustrates the dramatic power advantage of the paired design when r12=0.82r_{12} = 0.82.

Summary:

StatisticValue
Mean RT: Noise312.4312.4 ms
Mean RT: Silence298.7298.7 ms
Mean difference dˉ\bar{d}13.713.7 ms (noise slower)
sds_d24.0524.05 ms
r12r_{12}0.8200.820 (high pairing efficiency)
t(24)t(24) (paired)2.8492.849
pp (paired, two-tailed).009.009
tt (if independent, incorrect)1.2111.211
pp (if independent, incorrect).230.230
95% CI for μd\mu_d[3.77,23.63][3.77, 23.63] ms
Cohen's dzd_z0.5700.570 (Medium)
Cohen's davd_{av}0.3430.343 (Small-Medium)
Cohen's drmd_{rm}0.2060.206 (Small)
CL65.7%65.7\%

APA write-up: "A paired samples t-test examined whether background noise affected reaction time. Participants were significantly slower in the noise condition (M=312.4M = 312.4 ms, SD=41.3SD = 41.3 ms) than in the silence condition (M=298.7M = 298.7 ms, SD=38.6SD = 38.6 ms), t(24)=2.85t(24) = 2.85, p=.009p = .009, dz=0.57d_z = 0.57 [95% CI: 0.14, 0.99], dav=0.34d_{av} = 0.34. The mean slowing of 13.7 ms [95% CI: 3.77, 23.63 ms] represents a medium within-subjects effect. The high pre-post correlation (r=0.82r = 0.82) confirms the efficiency of the paired design."


Example 4: Small Sample with Non-Significant Result — Teaching Method

An education researcher compares mathematics test scores (n=10n = 10 students) before and after a new tutoring method. Scores range 0–100.

Data:

StudentBefore (x1ix_{1i})After (x2ix_{2i})did_i
16267−5
27882−4
35559−4
47178−7
58384−1
66773−6
75961−2
87476−2
98891−3
106166−5

Note: di=BeforeAfterd_i = Before - After, so negative did_i indicates improvement.

dˉ=(5447162235)/10=39/10=3.900\bar{d} = (-5-4-4-7-1-6-2-2-3-5)/10 = -39/10 = -3.900

di2=25+16+16+49+1+36+4+4+9+25=185\sum d_i^2 = 25+16+16+49+1+36+4+4+9+25 = 185

sd=(18510×3.9002)/9=(185152.10)/9=32.9/9=3.656=1.912s_d = \sqrt{(185-10\times3.900^2)/9} = \sqrt{(185-152.10)/9} = \sqrt{32.9/9} = \sqrt{3.656} = 1.912

SE=1.912/10=0.605SE = 1.912/\sqrt{10} = 0.605

t=3.900/0.605=6.446t = -3.900/0.605 = -6.446; ν=9\nu = 9; p<.001p < .001

95% CI: 3.900±2.262×0.605=[5.268,2.532]-3.900 \pm 2.262 \times 0.605 = [-5.268, -2.532]

dz=3.900/1.912=2.039d_z = 3.900/1.912 = 2.039 (Huge); gz=2.039×(13/35)=2.039×0.914=1.864g_z = 2.039 \times (1-3/35) = 2.039 \times 0.914 = 1.864

Despite the small sample, the effect is large and the result is highly significant because individual differences in change scores are small relative to the mean improvement.

Note on interpreting CIs with small samples:

SEdz=1/10+2.0392/18=0.100+0.231=0.331=0.575SE_{d_z} = \sqrt{1/10 + 2.039^2/18} = \sqrt{0.100+0.231} = \sqrt{0.331} = 0.575

95% CI for dz:2.039±1.96(0.575)=[0.912,3.166]95\%\text{ CI for } d_z: 2.039 \pm 1.96(0.575) = [0.912, 3.166]

Wide CI but entirely positive — effect is definitively large even at the lower bound.

APA write-up: "A paired samples t-test showed that student mathematics scores improved significantly after tutoring (Mbefore=69.8M_{before} = 69.8, SD=10.8SD = 10.8; Mafter=73.7M_{after} = 73.7, SD=10.8SD = 10.8), t(9)=6.45t(9) = 6.45, p<.001p < .001, dz=2.04d_z = 2.04 [95% CI: 0.91, 3.17]. The mean improvement of 3.9 points [95% CI: 2.53, 5.27] represents a very large within-subjects effect, indicating the tutoring method produced consistent, substantial gains across students."


13. Common Mistakes and How to Avoid Them

Mistake 1: Using the Independent Samples t-Test for Paired Data

Problem: Treating pre-post measurements (or matched pairs) as independent groups and running an independent samples t-test. This ignores the within-pair correlation r12r_{12}, inflates the error term with between-person variability, severely reduces power, and may produce a non-significant result for a large, real effect.

How serious: In Example 3 above, the paired test correctly detected an effect at p=.009p = .009, while the incorrect independent test gave p=.230p = .230. When r12=0.82r_{12} = 0.82, the independent test has less than 50% of the power of the paired test.

Solution: Before running any test, determine the study design: if each participant contributes two scores, the paired t-test is required. Check the data file structure — paired data should have one row per participant (or matched pair), not one row per observation.


Mistake 2: Reporting dzd_z Without Acknowledging It Is Not Comparable to Between-Subjects dd

Problem: Reporting dz=1.50d_z = 1.50 from a paired design and implying this is comparable to d=1.50d = 1.50 from a between-subjects study. Because sd<spooleds_d < s_{pooled} when r12>0r_{12} > 0, dzd_z will be systematically larger than davd_{av} or dd from an independent design for the same mean difference.

How serious: For r12=0.80r_{12} = 0.80, dzd_z is 1.58×dav1.58 \times d_{av} — reporting dz=1.50d_z = 1.50 when the comparable between-subjects dd would be 0.95\approx 0.95 could grossly inflate perceived effect sizes in a research domain.

Solution: Always specify the effect size variant. Report both dzd_z and davd_{av} (or drmd_{rm}). When comparing to between-subjects studies, use davd_{av} or drmd_{rm}.


Mistake 3: Not Checking Normality of the Difference Scores

Problem: Applying the paired t-test without checking whether the difference scores are approximately normally distributed. This is especially risky with small samples (n<30n < 30), where the CLT does not yet provide adequate protection, and the t-test's p-values may be inaccurate under skewed or heavy-tailed difference distributions.

Solution: Always run the Shapiro-Wilk test on the difference scores (not on the raw scores) and inspect the Q-Q plot of differences. If normality is violated and n<30n < 30, use the Wilcoxon Signed-Rank test.


Mistake 4: Running Separate t-Tests on Each Condition Instead of a Paired Test

Problem: Testing whether Condition 1 mean differs from zero, then testing whether Condition 2 mean differs from zero, and comparing the significance of the two tests. This approach is fundamentally flawed — a condition can be significantly different from zero in both tests but not significantly different from each other, or vice versa.

Solution: The appropriate question is whether the mean difference between conditions is significant. Use the paired t-test, which directly tests H0:μd=0H_0: \mu_d = 0.


Mistake 5: Failing to Report the 95% CI for the Mean Difference in Original Units

Problem: Reporting only tt, pp, and dzd_z without reporting the 95% CI for μd\mu_d in the original measurement units. The CI in original units is the most practically interpretable result — it tells readers how large the difference is in terms they can evaluate against a minimum important clinical or practical difference.

Solution: Always report the 95% CI for dˉ\bar{d} in original units, alongside the CI for the effect size dzd_z. For clinical or applied research, also discuss whether the CI for the mean difference exceeds a minimally important clinical difference (MICD).


Mistake 6: Treating a Non-Significant Result as Evidence of No Change

Problem: Reporting t(n)=1.20t(n) = 1.20, p=.25p = .25 and concluding "the intervention had no effect." A non-significant result only means the data are insufficient to reject H0H_0 under the test's sensitivity — it does NOT establish that the true effect is zero. With small nn, even large effects fail to reach significance.

Solution: Report the 95% CI for the mean difference. If the CI is wide and includes both clinically trivial and clinically meaningful differences, explicitly acknowledge the study's limited power rather than claiming no effect. Use equivalence testing (TOST) with pre-specified bounds if the research goal is to demonstrate absence of a meaningful effect.


Mistake 7: Applying a One-Tailed Test After Observing the Data Direction

Problem: Observing that dˉ\bar{d} is positive, then switching to an upper one-tailed test to achieve p=.03p = .03 when the two-tailed result was p=.06p = .06. This is p-hacking and doubles the effective Type I error rate.

Solution: Directional hypotheses must be pre-registered before data collection. Document the hypothesis direction in a pre-registration (e.g., on the OSF) before seeing any data. In the absence of a pre-registered directional prediction, use two-tailed tests.


Mistake 8: Using the Same Participants Twice Without Pairing

Problem: Collecting data from 30 participants under two conditions but entering all 60 observations as an independent-groups design. This creates pseudo-replication, violates independence, and severely inflates Type I error rates because the 60 observations are not all independent.

Solution: Understand the design. If each participant provided data under both conditions, the observations are paired and the within-subjects structure must be accounted for in the analysis (paired t-test, or repeated measures ANOVA for K>2K > 2).


Mistake 9: Ignoring Carryover Effects in Crossover Designs

Problem: In crossover designs, the effect of the first condition may carry over and influence responses in the second condition. Failing to account for order effects can bias the estimate of the mean difference, making the paired comparison misleading.

Solution: Use proper washout periods between conditions. Test for order effects by including condition order as a factor. If order effects are significant, report this and consider using only the first-condition data or modelling the order effect explicitly.


Mistake 10: Not Specifying δ0\delta_0 When Testing Non-Zero Nulls

Problem: Testing whether a treatment effect exceeds a clinically meaningful threshold (e.g., a 5-point improvement on a 100-point scale) using the default H0:μd=0H_0: \mu_d = 0 instead of H0:μd=5H_0: \mu_d = 5. The default test does not answer the right question.

Solution: Set the null hypothesis value δ0\delta_0 to the minimum clinically important difference (MCID) before running the test. In DataStatPro, enter this value in the "Null Hypothesis Value" field.


14. Troubleshooting

ProblemLikely CauseSolution
tt is extremely large (t>10\vert t \vert > 10)Very small sds_d (participants all changed by nearly the same amount) or data entry errorCheck data; if genuine, report with interpretation — very consistent within-person change
sd=0\vert s_d \vert = 0All difference scores are identicalVerify data; if genuine, the test is degenerate — all participants changed by exactly the same amount
p=1.000p = 1.000 exactlydˉ=0\bar{d} = 0 exactlyAll differences cancel out; report dz=0d_z = 0, interpret as no mean change
Shapiro-Wilk significant on large sample (n>50n > 50)High power of normality test; minor deviations detectedWith n30n \geq 30, CLT provides protection; inspect Q-Q plot for severity; t-test likely valid
dzdavd_z \gg d_{av}High within-pair correlation (r12r_{12} large)Both are correct; dzd_z reflects paired design efficiency; davd_{av} more comparable to between-subjects dd
Paired t significant but Wilcoxon signed-rank not significant (or vice versa)Distributional issues or tied difference scoresCheck normality; if differences are non-normal, trust Wilcoxon; report both with rationale
95% CI for dzd_z is very wideSmall nn (n<15n < 15)Report wide CI — it is the truthful reflection of low precision; use exact (non-central t) CI from DataStatPro
Equivalence test fails despite small dzd_zEquivalence bounds are too tight for available nnIncrease nn for replication; widen bounds with theoretical justification or accept insufficient precision
r12r_{12} is negativeRare; could arise from counterbalancing with contrast effectsVerify measurement; pairing reduces power when r12<0r_{12} < 0 — consider independent test
drm>davd_{rm} > d_{av}r12<0.50r_{12} < 0.50; correction factor 2(1r12)>1\sqrt{2(1-r_{12})} > 1Both values correct; report both; specify which is primary
Bayes Factor BF101BF_{10} \approx 1Insensitive data; study underpoweredCollect more data; report BF10BF_{10} as reflecting insensitivity rather than evidence for either hypothesis
TOST bounds are difficult to specifyLack of prior knowledge about MCIDConsult domain literature; use dz=0.20d_z = 0.20 as a generic "trivially small" effect bound; pre-register choice
Dataset has missing values for some pairsIncomplete data collection; attritionUse complete-case analysis if MCAR; use multiple imputation or MLM if MAR; document clearly
Two conditions have very different SDs (s1/s2>2s_1/s_2 > 2)Treatment changes variabilityNote heteroscedasticity; consider Glass's Δ\Delta (baseline SD) rather than davd_{av}; Wilcoxon is robust

15. Quick Reference Cheat Sheet

Core Formulas

FormulaDescription
di=x1ix2id_i = x_{1i} - x_{2i}Difference score for pair ii
dˉ=1ndi=xˉ1xˉ2\bar{d} = \frac{1}{n}\sum d_i = \bar{x}_1 - \bar{x}_2Mean difference
sd=(didˉ)2n1s_d = \sqrt{\frac{\sum(d_i-\bar{d})^2}{n-1}}SD of differences
SEdˉ=sd/nSE_{\bar{d}} = s_d/\sqrt{n}Standard error of mean difference
t=(dˉδ0)/(sd/n)t = (\bar{d} - \delta_0)/(s_d/\sqrt{n})Paired t-statistic
ν=n1\nu = n-1Degrees of freedom
$p = 2 \times P(T_{n-1} \geqt
dˉ±tα/2,  n1×sd/n\bar{d} \pm t_{\alpha/2,\;n-1} \times s_d/\sqrt{n}95% CI for mean difference
sd2=s12+s222r12s1s2s_d^2 = s_1^2 + s_2^2 - 2r_{12}s_1s_2sds_d from raw score statistics
r12=(s12+s22sd2)/(2s1s2)r_{12} = (s_1^2+s_2^2-s_d^2)/(2s_1s_2)Pre-post correlation from summary stats

Effect Size Formulas

FormulaDescription
dz=dˉ/sd=t/nd_z = \bar{d}/s_d = t/\sqrt{n}Cohen's dzd_z (most common for paired)
gz=dz×(13/(4n5))g_z = d_z \times (1-3/(4n-5))Hedges' gzg_z (bias-corrected)
dav=dˉ/sav,sav=(s1+s2)/2d_{av} = \bar{d}/s_{av},\quad s_{av}=(s_1+s_2)/2Cohen's davd_{av} (comparable to between)
drm=dav2(1r12)d_{rm} = d_{av}\sqrt{2(1-r_{12})}Corrected dd (most cross-design comparable)
Δ=dˉ/s1\Delta = \bar{d}/s_1Glass's Δ\Delta (baseline standardiser)
dz=dav/2(1r12)d_z = d_{av}/\sqrt{2(1-r_{12})}Converting davd_{av} to dzd_z
dav=dz2(1r12)d_{av} = d_z\sqrt{2(1-r_{12})}Converting dzd_z to davd_{av}
CL=Φ(dz/2)CL = \Phi(d_z/\sqrt{2})Common Language Effect Size
SEdz1/n+dz2/(2(n1))SE_{d_z} \approx \sqrt{1/n + d_z^2/(2(n-1))}Approximate SE for CI of dzd_z
λ=dzn\lambda = d_z\sqrt{n}Non-centrality parameter for power

TOST Equivalence Test Formulas

FormulaDescription
t1=(dˉ+ΔL)/(sd/n)t_1 = (\bar{d}+\Delta_L)/(s_d/\sqrt{n})Lower TOST t-statistic
t2=(dˉΔU)/(sd/n)t_2 = (\bar{d}-\Delta_U)/(s_d/\sqrt{n})Upper TOST t-statistic
90% CI within (ΔL,ΔU)(-\Delta_L, \Delta_U)Equivalence decision criterion

Effect Size Variant Comparison

VariantDenominatorComparable ToWhen to Use
dzd_zsds_dOther paired designs onlyWithin-study; paired vs. paired
gzg_zsds_d (corrected)Other paired designsSmall samples (n<20n < 20)
davd_{av}(s1+s2)/2(s_1+s_2)/2Between-subjects ddCross-design comparison
drmd_{rm}(s1+s2)/2(s_1+s_2)/2 correctedMost generalisedMeta-analysis; cross-design
Glass's Δ\Deltas1s_1 (pre-test)Between-subjects from baselinePre-post change from baseline

Cohen's Benchmarks for dzd_z

dz\vert d_z \vertLabelCLCL (%)U3U_3 (%)
<0.10< 0.10Tiny<52.8%< 52.8\%<54.0%< 54.0\%
0.100.190.10 - 0.19Very small52.855.3%52.8 - 55.3\%54.057.5%54.0 - 57.5\%
0.200.490.20 - 0.49Small55.663.4%55.6 - 63.4\%57.968.8%57.9 - 68.8\%
0.500.790.50 - 0.79Medium63.871.1%63.8 - 71.1\%69.178.5%69.1 - 78.5\%
0.801.190.80 - 1.19Large71.480.0%71.4 - 80.0\%78.888.3%78.8 - 88.3\%
1.201.991.20 - 1.99Very large80.292.0%80.2 - 92.0\%88.597.7%88.5 - 97.7\%
2.00\geq 2.00Huge92.1%\geq 92.1\%97.7%\geq 97.7\%

Required Sample Size (Pairs) — Two-Tailed α=.05\alpha = .05

dzd_zPower = 0.70Power = 0.80Power = 0.90Power = 0.95
0.20185264354434
0.3083119160196
0.40476790111
0.5031445973
0.6022314252
0.8013182430
1.009131721
1.20791315
1.5057911
2.004568

APA 7th Edition Reporting Templates

Full APA report (raw data available):

"A paired samples t-test was conducted to examine whether [DV] differed between [Condition 1] and [Condition 2]. Difference scores were [normally / not normally] distributed as assessed by Shapiro-Wilk (W=W = [value], p=p = [value]). [Condition 1] (M=M = , SD=SD = ) [was / was not] significantly [higher / lower] than [Condition 2] (M=M = , SD=SD = ), t(n1)=t(n-1) = [value], p=p = [value], dz=d_z = [value] [95% CI: LB, UB]. The mean difference of [value] [units] [95% CI: LB, UB] represents a [small / medium / large] within-subjects effect."

Compact format (for results section):

t(n1)=t(n-1) = [value], p=p = [value], dz=d_z = [value] [95% CI: LB, UB], Md=M_d = [value] [units] [95% CI: LB, UB].

Non-significant result with equivalence:

"The mean difference was not statistically significant, t(n1)=t(n-1) = [value], p=p = [value], dz=d_z = [value] [95% CI: LB, UB]. A TOST equivalence test with bounds of ±Δ\pm\Delta [units] [demonstrated / failed to demonstrate] equivalence at α=.05\alpha = .05, with the 90% CI [LB, UB] falling [entirely within / outside] the equivalence interval."

Bayesian paired t-test:

"A Bayesian paired t-test with the default Cauchy prior (r=2/2r = \sqrt{2}/2) provided [extreme / very strong / strong / moderate / anecdotal / no] evidence for [H1_1: μd0\mu_d \neq 0 / H0_0: μd=0\mu_d = 0], BF10=BF_{10} = [value]."

Test Decision Flowchart

Two related conditions, continuous DV?
├── YES
│   └── Are difference scores approximately normally distributed?
│       (Check: Shapiro-Wilk on d_i; Q-Q plot of d_i)
│       ├── YES (or n ≥ 30)
│       │   └── Paired t-test ✅
│       │       ├── Significant F: Report t, p, CI, d_z, d_av
│       │       ├── Non-significant: Report CI, sensitivity analysis
│       │       └── Claiming equivalence: Add TOST
│       └── NO (and n < 30)
│           └── Wilcoxon Signed-Rank Test ✅
│               └── Report W, z, p, r_rb
└── NO
    ├── Ordinal DV → Wilcoxon Signed-Rank Test
    └── 3+ conditions → Repeated Measures ANOVA

Assumption Checks Reference

AssumptionCheckToolAction if Violated
Normality of did_iShapiro-Wilk on differencesshapiro.test(d) in RWilcoxon signed-rank; transform
Independence of pairsDesign reviewStudy protocolMultilevel model if clustered
Correct pairingID matchingInspect data fileRe-match; verify data entry
Interval scaleMeasurement theoryConceptual checkWilcoxon signed-rank
No influential outliersBoxplot, zi>3z_i > 3 of did_iboxplot(d)Investigate; robust t-test

Paired t-Test Reporting Checklist

ItemRequired
Mean and SD for each condition✅ Always
Mean and SD of difference scores✅ Always
t-statistic with ν=n1\nu = n-1✅ Always
Exact p-value (or p<.001p < .001)✅ Always
95% CI for mean difference (original units)✅ Always
Cohen's dzd_z with 95% CI✅ Always
Which dd variant reported (dzd_z, davd_{av}, etc.)✅ Always
Sample size nn (number of pairs)✅ Always
Shapiro-Wilk result on differences✅ When n<50n < 50
Hedges' gzg_z instead of dzd_z✅ When n<20n < 20
davd_{av} or drmd_{rm} alongside dzd_z✅ When comparing to between-subjects
r12r_{12} (pre-post/within-pair correlation)✅ Recommended
TOST result if claiming null✅ When claiming no meaningful difference
Bayes Factor✅ For ambiguous or null results
Power or sensitivity analysis✅ For null or inconclusive results
Direction of effect stated✅ Always
Domain-specific benchmark context✅ Recommended

Conversion Formulas: Paired \leftrightarrow Other Metrics

FromToFormula
tt, nndzd_zdz=t/nd_z = t/\sqrt{n}
dzd_z, nnttt=dznt = d_z\sqrt{n}
dzd_z, r12r_{12}, s1s_1, s2s_2davd_{av}dav=dz2(1r12)d_{av} = d_z\sqrt{2(1-r_{12})}
davd_{av}, r12r_{12}dzd_zdz=dav/2(1r12)d_z = d_{av}/\sqrt{2(1-r_{12})}
davd_{av}, r12r_{12}drmd_{rm}drm=dav2(1r12)d_{rm} = d_{av}\sqrt{2(1-r_{12})}
dzd_z, nnrr (point-biserial)r=dz/dz2+1r = d_z/\sqrt{d_z^2+1} (approx)
dzd_zCLCLCL=Φ(dz/2)CL = \Phi(d_z/\sqrt{2})
dzd_zU3U_3U3=Φ(dz)U_3 = \Phi(d_z)
dzd_z, r12r_{12}dbetweend_{between} (comparable)Use drmd_{rm}

This tutorial provides a comprehensive foundation for understanding, conducting, and reporting the Paired t-Test within the DataStatPro application. For further reading, consult Field's "Discovering Statistics Using IBM SPSS Statistics" (5th ed., 2018) for an applied introduction; Lakens's "Calculating and Reporting Effect Sizes to Facilitate Cumulative Science" (Frontiers in Psychology, 2013) for the critical discussion of dzd_z vs. davd_{av} and drmd_{rm}; Morris & DeShon's "Combining Effect Size Estimates in Meta-Analysis With Repeated Measures and Independent-Groups Designs" (Psychological Methods, 2002) for the drmd_{rm} formula; Rouder et al.'s "Bayesian t-Tests for Accepting and Rejecting the Null Hypothesis" (Psychonomic Bulletin & Review, 2009) for the Bayesian approach; and Lakens's "Equivalence Tests: A Practical Primer for t-Tests, Correlations, and Meta-Analyses" (Social Psychological and Personality Science, 2017) for TOST equivalence testing. For feature requests or support, contact the DataStatPro team.