Knowledge Base / T-Tests and Alternatives Inferential Statistics 61 min read

T-Tests and Alternatives

Comprehensive reference guide for t-tests and non-parametric alternatives.

t-Tests and Alternatives: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of hypothesis testing all the way through advanced t-test variants, non-parametric alternatives, interpretation, reporting, and practical usage within the DataStatPro application. Whether you are encountering t-tests for the first time or seeking a deeper understanding of when and how to apply parametric and non-parametric tests for comparing means, this guide builds your knowledge systematically from the ground up.


Table of Contents

  1. Prerequisites and Background Concepts
  2. What is a t-Test?
  3. The Mathematics Behind t-Tests
  4. Assumptions of t-Tests
  5. Types of t-Tests
  6. Using the t-Test Calculator Component
  7. One-Sample t-Test
  8. Independent Samples t-Test
  9. Paired Samples t-Test
  10. Welch's t-Test — Unequal Variances
  11. Non-Parametric Alternatives
  12. Advanced Topics
  13. Worked Examples
  14. Common Mistakes and How to Avoid Them
  15. Troubleshooting
  16. Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into t-tests, it is essential to be comfortable with the following foundational statistical concepts. Each is briefly reviewed below.

1.1 Populations, Samples, and Parameters

A population is the complete set of individuals or observations of interest. A sample is a subset drawn from the population. Parameters describe population characteristics (e.g., μ\mu, σ\sigma), while statistics describe sample characteristics (e.g., xˉ\bar{x}, ss).

The t-test is an inferential procedure — it uses sample statistics to draw conclusions about unknown population parameters. The fundamental question in every t-test is: "Is the difference between observed means large enough to conclude that the true population means differ?"

1.2 The Sampling Distribution of the Mean

If we repeatedly drew samples of size nn from a population with mean μ\mu and standard deviation σ\sigma, the distribution of sample means xˉ\bar{x} would itself be a distribution — the sampling distribution of the mean. By the Central Limit Theorem (CLT):

xˉN ⁣(μ,  σ2n)as n\bar{x} \sim \mathcal{N}\!\left(\mu,\; \frac{\sigma^2}{n}\right) \quad \text{as } n \to \infty

The standard error of the mean is:

SExˉ=σnSE_{\bar{x}} = \frac{\sigma}{\sqrt{n}}

When the population σ\sigma is unknown (as in virtually all real applications), it is estimated by the sample standard deviation ss, giving the estimated standard error:

SE^xˉ=sn\widehat{SE}_{\bar{x}} = \frac{s}{\sqrt{n}}

This substitution is what necessitates the use of the tt-distribution rather than the standard normal distribution.

1.3 The t-Distribution

The Student's t-distribution was derived by William Sealy Gosset (publishing under the pseudonym "Student") in 1908. It arises whenever we estimate a normally distributed population's mean using a small sample and an unknown variance.

The t-distribution is characterised by a single parameter: degrees of freedom ν\nu. As ν\nu \to \infty, the t-distribution converges to the standard normal N(0,1)\mathcal{N}(0,1).

Key properties:

Critical values for common α\alpha levels:

ν\nu (df)t.025t_{.025} (two-tailed α=.05\alpha=.05)t.005t_{.005} (two-tailed α=.01\alpha=.01)t.0005t_{.0005} (two-tailed α=.001\alpha=.001)
52.5714.0328.610
102.2283.1694.587
202.0862.8453.850
302.0422.7503.646
602.0002.6603.460
1201.9802.6173.373
\infty1.9602.5763.291

1.4 Hypothesis Testing Framework

Every t-test operates within the Neyman-Pearson hypothesis testing framework:

Step 1 — State the hypotheses:

Step 2 — Choose α\alpha: The significance level is the acceptable Type I error rate (conventionally α=.05\alpha = .05). It is the probability of rejecting H0H_0 when it is true.

Step 3 — Compute the test statistic: The t-statistic measures how many standard errors the observed result is from the null hypothesis value.

Step 4 — Compute the p-value: The probability of observing a t-statistic at least as extreme as the one obtained, assuming H0H_0 is true.

Step 5 — Make a decision: Reject H0H_0 if pαp \leq \alpha; fail to reject H0H_0 if p>αp > \alpha.

Step 6 — Compute and report the effect size with CI: Statistical significance alone is insufficient. Always accompany the t-test result with Cohen's dd (or equivalent) and its 95% confidence interval.

1.5 Type I and Type II Errors

DecisionH0H_0 TrueH0H_0 False
Reject H0H_0Type I error (α\alpha)Correct (Power = 1β1-\beta)
Fail to Reject H0H_0Correct (1α1-\alpha)Type II error (β\beta)

1.6 One-Tailed vs. Two-Tailed Tests

A two-tailed test places the rejection region in both tails of the distribution and is appropriate when the direction of the effect is not specified in advance:

H1:μ1μ2H_1: \mu_1 \neq \mu_2

A one-tailed test places the entire rejection region in one tail, appropriate only when a directional prediction is made before data collection on strong theoretical grounds:

H1:μ1>μ2orH1:μ1<μ2H_1: \mu_1 > \mu_2 \quad \text{or} \quad H_1: \mu_1 < \mu_2

⚠️ One-tailed tests should be pre-registered and theoretically justified before data collection. Using a one-tailed test post-hoc to achieve significance is p-hacking. In the absence of a strong directional prediction, always use a two-tailed test.

1.7 Confidence Intervals and Their Relationship to t-Tests

A (1α)×100%(1-\alpha) \times 100\% confidence interval for the mean difference is directly related to the two-tailed t-test at significance level α\alpha: the null hypothesis H0:μ1μ2=0H_0: \mu_1 - \mu_2 = 0 is rejected at level α\alpha if and only if 00 lies outside the (1α)×100%(1-\alpha) \times 100\% CI.

The CI provides strictly more information than the p-value — it communicates both the direction and precision of the estimate and enables assessment of practical significance.


2. What is a t-Test?

2.1 The Core Idea

A t-test is a parametric inferential statistical test used to determine whether there is a statistically significant difference between means. The general form of the t-statistic is:

t=Observed differenceNull hypothesis valueEstimated standard error of the differencet = \frac{\text{Observed difference} - \text{Null hypothesis value}}{\text{Estimated standard error of the difference}}

The denominator — the standard error — is the key: it scales the observed difference by the sampling variability, allowing us to determine whether the difference is larger than what we would typically expect from sampling variation alone.

2.2 When to Use a t-Test

A t-test is appropriate when:

2.3 The Three Versions of the t-Test

t-Test TypeResearch QuestionExample
One-sampleDoes a sample mean differ from a known/hypothesised value?Is average exam score different from 70?
Independent samplesDo two unrelated groups have different means?Do males and females differ on anxiety?
Paired samplesDo two related measurements differ within the same units?Does anxiety change from pre- to post-treatment?

2.4 The t-Test in Context

The t-test is one member of a broader family of inferential tests:

SituationTest
One group vs. known value (normal data)One-sample t-test
Two independent groups (normal, equal variances)Student's independent t-test
Two independent groups (normal, unequal variances)Welch's t-test
Two related groups (normal data)Paired samples t-test
Two independent groups (non-normal or ordinal data)Mann-Whitney U test
Two related groups (non-normal or ordinal data)Wilcoxon signed-rank test
One group vs. known value (non-normal)Wilcoxon signed-rank (one-sample)
More than two groups (normal data)One-way ANOVA (F-test)
More than two groups (non-normal)Kruskal-Wallis test

2.5 Statistical Significance vs. Practical Significance

A t-test answers: "Is the observed mean difference larger than expected by chance?" It does not answer: "Is the difference large enough to matter in practice?"

With large samples, trivially small differences become statistically significant. A study comparing two teaching methods with n=5,000n = 5{,}000 per group might find t(9998)=3.20t(9998) = 3.20, p=.001p = .001, for a mean difference of 0.3 points on a 100-point scale — significant but practically meaningless.

Always report:

  1. The t-statistic and p-value (statistical significance).
  2. Cohen's dd or equivalent effect size (practical significance).
  3. The 95% CI for the mean difference and for the effect size.

3. The Mathematics Behind t-Tests

3.1 The One-Sample t-Statistic

The one-sample t-test tests whether a sample mean xˉ\bar{x} differs significantly from a hypothesised population mean μ0\mu_0:

t=xˉμ0s/nt = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}

Where:

Under H0:μ=μ0H_0: \mu = \mu_0, this statistic follows a t-distribution with ν=n1\nu = n - 1 degrees of freedom.

The 95% CI for the population mean:

xˉ±tα/2,  n1sn\bar{x} \pm t_{\alpha/2,\; n-1} \cdot \frac{s}{\sqrt{n}}

3.2 The Independent Samples t-Statistic (Student's)

The independent samples t-test (Student's version) tests whether two population means are equal, assuming homogeneity of variance:

t=xˉ1xˉ2spooled1n1+1n2t = \frac{\bar{x}_1 - \bar{x}_2}{s_{pooled}\sqrt{\dfrac{1}{n_1} + \dfrac{1}{n_2}}}

Where the pooled standard deviation is:

spooled=(n11)s12+(n21)s22n1+n22s_{pooled} = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}}

Degrees of freedom: ν=n1+n22\nu = n_1 + n_2 - 2

The 95% CI for the mean difference (μ1μ2)(\mu_1 - \mu_2):

(xˉ1xˉ2)±tα/2,  n1+n22spooled1n1+1n2(\bar{x}_1 - \bar{x}_2) \pm t_{\alpha/2,\; n_1+n_2-2} \cdot s_{pooled}\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}

3.3 Welch's t-Statistic — Unequal Variances

Welch's t-test does not assume equal population variances. It computes a separate variance estimate for each group:

tW=xˉ1xˉ2s12n1+s22n2t_W = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}}}

The degrees of freedom are approximated by the Welch-Satterthwaite equation:

νW=(s12n1+s22n2)2(s12/n1)2n11+(s22/n2)2n21\nu_W = \frac{\left(\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}\right)^2}{\dfrac{\left(s_1^2/n_1\right)^2}{n_1-1} + \dfrac{\left(s_2^2/n_2\right)^2}{n_2-1}}

Note: νW\nu_W is generally non-integer and is typically rounded down. Welch's df are always n1+n22\leq n_1 + n_2 - 2 (i.e., always fewer or equal df than Student's t-test, making it a more conservative test).

The 95% CI for the mean difference:

(xˉ1xˉ2)±tα/2,  νWs12n1+s22n2(\bar{x}_1 - \bar{x}_2) \pm t_{\alpha/2,\; \nu_W} \cdot \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}

3.4 The Paired Samples t-Statistic

The paired t-test treats the data as a set of nn difference scores di=x1ix2id_i = x_{1i} - x_{2i} computed for each pair. It tests whether the mean difference dˉ\bar{d} is significantly different from zero:

t=dˉsd/nt = \frac{\bar{d}}{s_d / \sqrt{n}}

Where:

Degrees of freedom: ν=n1\nu = n - 1

The 95% CI for the mean difference:

dˉ±tα/2,  n1sdn\bar{d} \pm t_{\alpha/2,\; n-1} \cdot \frac{s_d}{\sqrt{n}}

The relationship between the paired t-statistic, the correlation r12r_{12} between paired measurements, and the independent samples t-statistic:

sd2=s12+s222r12s1s2s_d^2 = s_1^2 + s_2^2 - 2r_{12}s_1 s_2

This shows that when r12>0r_{12} > 0 (paired measurements are positively correlated), the paired test has a smaller denominator (less error variance) and thus greater statistical power than the independent samples test for the same data.

3.5 The p-value

The p-value is computed from the t-statistic and degrees of freedom using the cumulative distribution function (CDF) of the t-distribution:

Two-tailed p-value:

p=2×P(Tνtobs)=2×[1Ft,ν(tobs)]p = 2 \times P(T_\nu \geq \lvert t_{obs} \rvert) = 2 \times [1 - F_{t,\nu}(\lvert t_{obs} \rvert)]

One-tailed p-value (upper tail):

p=P(Tνtobs)=1Ft,ν(tobs)p = P(T_\nu \geq t_{obs}) = 1 - F_{t,\nu}(t_{obs})

One-tailed p-value (lower tail):

p=P(Tνtobs)=Ft,ν(tobs)p = P(T_\nu \leq t_{obs}) = F_{t,\nu}(t_{obs})

Where Ft,νF_{t,\nu} is the CDF of the t-distribution with ν\nu degrees of freedom.

3.6 Computing Effect Sizes from t-Statistics

When raw data are unavailable, effect sizes can be computed directly from the reported t-statistic:

Cohen's dd from independent samples t-test:

d=tn1+n2n1n2=tn1+n2n1n2d = t\sqrt{\frac{n_1 + n_2}{n_1 n_2}} = \frac{t\sqrt{n_1+n_2}}{\sqrt{n_1 n_2}}

For equal group sizes (n1=n2=nn_1 = n_2 = n):

d=2t2n=t2nd = \frac{2t}{\sqrt{2n}} = t\sqrt{\frac{2}{n}}

Cohen's dzd_z from one-sample or paired t-test:

dz=tnd_z = \frac{t}{\sqrt{n}}

Pearson rr from any t-statistic:

r=t2t2+νr = \sqrt{\frac{t^2}{t^2 + \nu}}

Where ν\nu is the degrees of freedom. Note: this rr is equivalent to the point-biserial correlation between the binary group variable and the continuous outcome.

3.7 The Non-Central t-Distribution and Exact CIs for dd

Under the alternative hypothesis (when a true effect exists), the t-statistic does not follow a central t-distribution — it follows a non-central t-distribution with non-centrality parameter:

λ=μ1μ2σ1/n1+1/n2=dn1n2n1+n2\lambda = \frac{\mu_1 - \mu_2}{\sigma\sqrt{1/n_1 + 1/n_2}} = d \cdot \sqrt{\frac{n_1 n_2}{n_1 + n_2}}

This non-centrality parameter links the population effect size dd to the expected t-statistic. Exact 95% CIs for Cohen's dd invert this relationship numerically (no closed form exists) — a computation performed automatically by DataStatPro.

3.8 Statistical Power of the t-Test

Power is the probability that the t-test correctly rejects H0H_0 when a true effect dd exists:

Power=P ⁣(Tν(λ)>tcrit)\text{Power} = P\!\left(T_\nu(\lambda) > t_{crit}\right)

Where Tν(λ)T_\nu(\lambda) is the non-central t-distribution with non-centrality parameter:

λ=dn1n2n1+n2\lambda = d\sqrt{\frac{n_1 n_2}{n_1 + n_2}} (independent) \quad or \quad λ=dn\lambda = d\sqrt{n} (one-sample or paired)

For the independent samples t-test with equal groups, the approximate required sample size for power 1β1-\beta at two-sided level α\alpha:

nper  group2(z1α/2+z1β)2d2n_{per\;group} \approx \frac{2(z_{1-\alpha/2} + z_{1-\beta})^2}{d^2}

ddPower = 0.80 (nn/group)Power = 0.90 (nn/group)Power = 0.95 (nn/group)
0.20 (small)394527651
0.50 (medium)6485105
0.80 (large)263442
1.00172227
1.20121620

4. Assumptions of t-Tests

4.1 Normality of the Sampling Distribution

The t-test assumes that the sampling distribution of the mean difference is normal. This is satisfied when either:

How to check:

Robustness: The t-test is remarkably robust to mild non-normality, especially for larger samples. For moderate non-normality with n20n \geq 20 per group, the t-test's Type I error rate remains close to the nominal α\alpha.

When violated: Use the Mann-Whitney U test (independent) or Wilcoxon signed-rank test (paired) as non-parametric alternatives. Consider data transformation (log, square root) if the distribution is strongly skewed.

4.2 Homogeneity of Variance (for Independent Samples t-Test)

Student's independent t-test assumes that the two populations have equal variances (σ12=σ22\sigma_1^2 = \sigma_2^2). This assumption is required for the pooled standard deviation to be a valid common estimator.

How to check:

⚠️ A statistically significant Levene's test does not automatically invalidate Student's t-test for large equal-sized samples (the test is robust). However, when groups are unequal in size AND have unequal variances, Student's t-test can be severely anti-conservative (inflated Type I error). In this case, always use Welch's t-test.

When violated: Use Welch's t-test, which does not assume equal variances and is generally recommended as the default for independent samples comparisons (see Section 10).

4.3 Independence of Observations

Within each group, all observations must be independent — the score of one participant must not influence the score of any other. This is an assumption about the study design, not about the data, and cannot be tested statistically.

Common violations:

When violated: For clustered data, use multilevel models. For repeated measures within the same participant, use the paired t-test or repeated measures ANOVA.

4.4 Scale of Measurement

t-Tests assume the dependent variable is measured on at least an interval scale — that is, the differences between values are meaningful and equal across the scale.

When violated: If the outcome is ordinal (ranked categories) or continuous but severely non-normal, use non-parametric alternatives (Mann-Whitney U, Wilcoxon signed-rank).

4.5 Random Sampling

For inferential conclusions to generalise to the population, the sample should be randomly selected from the population of interest. In practice, many research samples are convenience samples; this limits the generalisability of conclusions but does not invalidate the mathematical procedure of the t-test itself.

4.6 Absence of Influential Outliers

Extreme outliers can dramatically distort the mean and standard deviation, leading to inflated or deflated t-statistics. The t-test is sensitive to outliers, particularly in small samples.

How to check:

When outliers are present: Investigate whether outliers represent data entry errors, measurement errors, or genuine extreme values. Report analyses with and without outliers. Consider using the trimmed mean t-test or a robust alternative.

4.7 Assumption Summary Table

AssumptionOne-SampleIndependentPairedHow to CheckRemedy if Violated
Normality✅ (differences)Shapiro-Wilk, Q-QMann-Whitney / Wilcoxon
Equal variancesLevene's testWelch's t-test
Independence✅ (within groups)✅ (between pairs)Design checkMultilevel models
Interval scaleMeasurement theoryNon-parametric test
No severe outliersBoxplots, zz-scoresTrimmed mean / robust test

5. Types of t-Tests

5.1 Decision Flowchart for Test Selection

The following logic guides selection of the appropriate t-test or alternative:

Is the outcome variable continuous (interval/ratio)?
├── NO  → Use chi-squared / Fisher's exact (categorical outcomes)
└── YES → How many groups?
          ├── MORE THAN 2 → Use ANOVA (or Kruskal-Wallis)
          └── 1 OR 2 → Are observations independent or paired?
                        ├── PAIRED (same units, two conditions)
                        │   ├── Normal differences? → Paired t-test
                        │   └── Non-normal?         → Wilcoxon signed-rank
                        └── INDEPENDENT (different participants)
                            ├── One group vs. known value?
                            │   ├── Normal?     → One-sample t-test
                            │   └── Non-normal? → Wilcoxon signed-rank (one-sample)
                            └── Two independent groups
                                ├── Normal + equal variances → Student's t-test
                                ├── Normal + unequal variances → Welch's t-test ✅ (recommended default)
                                └── Non-normal or ordinal → Mann-Whitney U

5.2 Choosing Between Student's and Welch's t-Test

A persistent question in applied statistics is whether to use Student's t-test (assuming equal variances) or Welch's t-test (not assuming equal variances) for independent samples.

The consensus recommendation: Use Welch's t-test as the default for independent samples comparisons:

ScenarioStudent's t-testWelch's t-test
Equal nn, equal σ\sigma✅ Correct size✅ Correct size
Equal nn, unequal σ\sigma⚠️ Slightly liberal✅ Correct size
Unequal nn, equal σ\sigma✅ Correct size✅ Slightly conservative
Unequal nn, unequal σ\sigma❌ Severely liberal✅ Correct size

Simulation studies (Ruxton, 2006; Delacre et al., 2017) consistently show that Welch's t-test controls Type I error across all conditions, whereas Student's t-test fails when nn and σ\sigma are both unequal. The loss of power from using Welch's when variances are truly equal is negligible.

💡 The recommendation to default to Welch's t-test is supported by simulation evidence and is increasingly standard practice. DataStatPro reports both Student's and Welch's results by default, with Welch's highlighted as the recommended result.


6. Using the t-Test Calculator Component

The t-Test Calculator component in DataStatPro provides a comprehensive tool for running, diagnosing, visualising, and reporting t-tests and their alternatives.

Step-by-Step Guide

Step 1 — Select the Test Type

Choose from the "Test Type" dropdown:

Step 2 — Input Method

Choose how to provide the data:

💡 When using raw data, DataStatPro automatically runs Shapiro-Wilk tests for normality and Levene's test for equality of variances, and displays the results alongside the main output with colour-coded warnings for violations.

Step 3 — Specify the Null Hypothesis Value

Step 4 — Select the Alternative Hypothesis

Step 5 — Choose the Significance Level

Select α\alpha (default: .05.05). DataStatPro also provides results for α=.01\alpha = .01 and α=.001\alpha = .001 simultaneously for reference.

Step 6 — Select the Variance Assumption

For independent samples tests:

Step 7 — Select Display Options

Choose which outputs to display:

Step 8 — Run the Analysis

Click "Run t-Test". DataStatPro will:

  1. Compute the t-statistic, degrees of freedom, and p-value.
  2. Construct the 95% CI for the mean difference.
  3. Compute Cohen's dd, Hedges' gg, and their exact CIs.
  4. Run all selected assumption tests and display warnings.
  5. Generate all selected visualisations.
  6. Generate an APA 7th edition-compliant results paragraph.

7. One-Sample t-Test

7.1 Purpose and Design

The one-sample t-test answers the question: "Is the mean of my sample significantly different from a specific, theoretically or practically meaningful value μ0\mu_0?"

Common applications:

7.2 Full Procedure

Given: A sample of nn observations with mean xˉ\bar{x} and standard deviation ss. Test H0:μ=μ0H_0: \mu = \mu_0 vs. H1:μμ0H_1: \mu \neq \mu_0.

Step 1 — Compute the sample mean and SD

xˉ=1ni=1nxi,s=1n1i=1n(xixˉ)2\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i, \qquad s = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2}

Step 2 — Compute the standard error

SE=snSE = \frac{s}{\sqrt{n}}

Step 3 — Compute the t-statistic

t=xˉμ0SE=xˉμ0s/nt = \frac{\bar{x} - \mu_0}{SE} = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}

Step 4 — Determine degrees of freedom

ν=n1\nu = n - 1

Step 5 — Compute the p-value

p=2×P(Tn1t)p = 2 \times P(T_{n-1} \geq \lvert t \rvert)

Compare to α\alpha. Reject H0H_0 if pαp \leq \alpha.

Step 6 — Compute the 95% CI for μ\mu

xˉ±tα/2,  n1sn\bar{x} \pm t_{\alpha/2,\; n-1} \cdot \frac{s}{\sqrt{n}}

Step 7 — Compute Cohen's dd

d=xˉμ0sd = \frac{\bar{x} - \mu_0}{s}

Hedges' gg (bias-corrected):

g=d×(134(n1)1)g = d \times \left(1 - \frac{3}{4(n-1) - 1}\right)

7.3 Interpreting the One-Sample t-Test

ResultInterpretation
pαp \leq \alpha and CI excludes μ0\mu_0Reject H0H_0: sample mean differs significantly from μ0\mu_0
p>αp > \alpha and CI includes μ0\mu_0Fail to reject H0H_0: insufficient evidence of a difference
Large dd, pαp \leq \alphaSignificant AND practically meaningful departure from μ0\mu_0
Small dd, pαp \leq \alpha (large nn)Significant but practically negligible departure from μ0\mu_0
Large dd, p>αp > \alpha (small nn)Non-significant due to low power; effect may be real but undetected

8. Independent Samples t-Test

8.1 Purpose and Design

The independent samples t-test answers: "Do two independent groups have the same population mean?" It requires that the two groups are composed of entirely different participants with no systematic pairing or matching.

Common applications:

8.2 Full Procedure (Student's)

Given: Group 1 with n1n_1 observations (xˉ1\bar{x}_1, s1s_1) and Group 2 with n2n_2 observations (xˉ2\bar{x}_2, s2s_2). Test H0:μ1=μ2H_0: \mu_1 = \mu_2.

Step 1 — Compute summary statistics

xˉj=1nji=1njxji,sj=1nj1i=1nj(xjixˉj)2,j{1,2}\bar{x}_j = \frac{1}{n_j}\sum_{i=1}^{n_j} x_{ji}, \qquad s_j = \sqrt{\frac{1}{n_j-1}\sum_{i=1}^{n_j}(x_{ji}-\bar{x}_j)^2}, \quad j \in \{1, 2\}

Step 2 — Check variance homogeneity (Levene's test)

Run Levene's test. If pLevene.05p_{Levene} \leq .05, favour Welch's t-test (Section 10). Regardless, reporting both is best practice.

Step 3 — Compute pooled standard deviation

spooled=(n11)s12+(n21)s22n1+n22s_{pooled} = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}}

Step 4 — Compute the t-statistic

t=xˉ1xˉ2spooled1/n1+1/n2t = \frac{\bar{x}_1 - \bar{x}_2}{s_{pooled}\sqrt{1/n_1 + 1/n_2}}

Step 5 — Degrees of freedom and p-value

ν=n1+n22\nu = n_1 + n_2 - 2

p=2×P(Tνt)p = 2 \times P(T_\nu \geq \lvert t \rvert)

Step 6 — 95% CI for the mean difference

(xˉ1xˉ2)±tα/2,  νspooled1n1+1n2(\bar{x}_1 - \bar{x}_2) \pm t_{\alpha/2,\;\nu} \cdot s_{pooled}\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}

Step 7 — Effect sizes

d=xˉ1xˉ2spooled,g=d×(134(n1+n22)1)d = \frac{\bar{x}_1 - \bar{x}_2}{s_{pooled}}, \qquad g = d \times \left(1 - \frac{3}{4(n_1+n_2-2)-1}\right)

Common Language Effect Size:

CL=Φ ⁣(d2)CL = \Phi\!\left(\frac{d}{\sqrt{2}}\right)

8.3 APA Reporting Template

"An independent samples t-test revealed [a significant / no significant] difference in [DV] between [Group 1] (M=M = , SD=SD = ) and [Group 2] (M=M = , SD=SD = ), t(ν)=t(\nu) = , p=p = , d=d = [95% CI: , ]. This represents a [small / medium / large] effect according to Cohen's (1988) benchmarks."

Example: "An independent samples Welch's t-test revealed a significant difference in anxiety scores between the CBT group (M=12.3M = 12.3, SD=4.1SD = 4.1) and the waitlist control group (M=18.7M = 18.7, SD=5.2SD = 5.2), t(57.4)=5.62t(57.4) = -5.62, p<.001p < .001, d=1.38d = 1.38 [95% CI: 0.87, 1.88]. This represents a large treatment effect."


9. Paired Samples t-Test

9.1 Purpose and Design

The paired samples t-test (also: dependent samples, matched pairs, or repeated measures t-test) answers: "Do two related measurements differ significantly from each other?"

When observations are paired:

Advantage over independent t-test: By comparing within-person differences, the paired design removes between-person variability from the error term, substantially increasing power when the within-person correlation r12r_{12} is positive.

9.2 Full Procedure

Given: nn pairs of observations (x1i,x2i)(x_{1i}, x_{2i}).

Step 1 — Compute difference scores

di=x1ix2i,i=1,2,,nd_i = x_{1i} - x_{2i}, \qquad i = 1, 2, \ldots, n

Step 2 — Compute mean and SD of differences

dˉ=1ni=1ndi,sd=1n1i=1n(didˉ)2\bar{d} = \frac{1}{n}\sum_{i=1}^n d_i, \qquad s_d = \sqrt{\frac{1}{n-1}\sum_{i=1}^n(d_i - \bar{d})^2}

Step 3 — Compute the standard error of the mean difference

SEdˉ=sdnSE_{\bar{d}} = \frac{s_d}{\sqrt{n}}

Step 4 — Compute the t-statistic

t=dˉSEdˉ=dˉsd/nt = \frac{\bar{d}}{SE_{\bar{d}}} = \frac{\bar{d}}{s_d/\sqrt{n}}

Step 5 — Degrees of freedom and p-value

ν=n1\nu = n - 1

p=2×P(Tn1t)p = 2 \times P(T_{n-1} \geq \lvert t \rvert)

Step 6 — 95% CI for the mean difference

dˉ±tα/2,  n1sdn\bar{d} \pm t_{\alpha/2,\; n-1} \cdot \frac{s_d}{\sqrt{n}}

Step 7 — Effect sizes

Cohen's dzd_z (most commonly reported for paired designs):

dz=dˉsd=tnd_z = \frac{\bar{d}}{s_d} = \frac{t}{\sqrt{n}}

Cohen's drmd_{rm} (repeated measures dd, accounting for the pre-post correlation):

drm=dˉsav×2(1r12)d_{rm} = \frac{\bar{d}}{s_{av}} \times \sqrt{2(1 - r_{12})}

Where sav=(s1+s2)/2s_{av} = (s_1 + s_2)/2 and r12r_{12} is the correlation between the two measurements. Note that drmd_{rm} is more comparable to dd from independent samples designs than dzd_z is.

Cohen's davd_{av} (standardised by the average SD):

dav=dˉsav=xˉ1xˉ2(s1+s2)/2d_{av} = \frac{\bar{d}}{s_{av}} = \frac{\bar{x}_1 - \bar{x}_2}{(s_1+s_2)/2}

9.3 Comparing Paired and Independent t-Tests for the Same Data

When data are paired (pre-post), computing the incorrect independent t-test is a serious error. The relationship between the two t-statistics is:

tpaired=tindependent2(1r12)1nt_{paired} = t_{independent} \cdot \sqrt{\frac{2(1-r_{12})}{1}} \cdot \sqrt{n}

More precisely, the t-statistics are related through the within-pair correlation r12r_{12}:

SEpaired=SEindependent2(1r12)SE_{paired} = SE_{independent} \cdot \sqrt{2(1-r_{12})}

When r12>0r_{12} > 0 (typical for repeated measures): SEpaired<SEindependentSE_{paired} < SE_{independent}, so tpaired>tindependentt_{paired} > t_{independent} — the paired test is more powerful. When r12=0r_{12} = 0 (independence), both tests are equivalent. When r12<0r_{12} < 0 (rare), the independent test is more powerful.

⚠️ Never apply an independent samples t-test to paired data. Doing so ignores the within-pair correlation, produces an inflated standard error, and loses statistical power. Conversely, applying a paired t-test to genuinely independent data violates the independence assumption of the difference scores.


10. Welch's t-Test — Unequal Variances

10.1 Why Welch's is Preferred

Welch's t-test (1947) is a modification of Student's t-test that does not assume equal population variances. It is the recommended default for independent samples comparisons for three reasons:

  1. Robustness: It maintains correct Type I error rates regardless of whether variances are equal or unequal.
  2. Negligible power loss: When variances are truly equal, Welch's test loses very little power compared to Student's.
  3. Correct coverage: The CI from Welch's has the correct nominal coverage probability across all variance ratio conditions.

10.2 Full Procedure

Step 1 — Compute group summary statistics

xˉ1,s1,n1\bar{x}_1, s_1, n_1 and xˉ2,s2,n2\bar{x}_2, s_2, n_2

Step 2 — Compute separate variance estimates

v1=s12n1,v2=s22n2v_1 = \frac{s_1^2}{n_1}, \qquad v_2 = \frac{s_2^2}{n_2}

Step 3 — Compute Welch's t-statistic

tW=xˉ1xˉ2v1+v2t_W = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{v_1 + v_2}}

Step 4 — Compute Welch-Satterthwaite degrees of freedom

νW=(v1+v2)2v12n11+v22n21\nu_W = \frac{(v_1 + v_2)^2}{\dfrac{v_1^2}{n_1-1} + \dfrac{v_2^2}{n_2-1}}

Round νW\nu_W down to the nearest integer for conservative inference.

Step 5 — p-value

p=2×P(TνWtW)p = 2 \times P(T_{\nu_W} \geq \lvert t_W \rvert)

Step 6 — 95% CI for mean difference

(xˉ1xˉ2)±tα/2,  νWv1+v2(\bar{x}_1 - \bar{x}_2) \pm t_{\alpha/2,\; \nu_W} \cdot \sqrt{v_1 + v_2}

Step 7 — Effect size (Glass's Δ\Delta or Welch's dd)

When variances are unequal, the appropriate standardiser for Cohen's dd is debated. Options include:

Pooled SD (ignores heterogeneity — caution):

d=xˉ1xˉ2spooledd = \frac{\bar{x}_1 - \bar{x}_2}{s_{pooled}}

Glass's Δ\Delta (control group SD as standardiser — recommended for treatment/control):

Δ=xˉ1xˉ2s2\Delta = \frac{\bar{x}_1 - \bar{x}_2}{s_2}

Average SD (unbiased when neither group is the reference):

dav=xˉ1xˉ2(s1+s2)/2d_{av} = \frac{\bar{x}_1 - \bar{x}_2}{(s_1 + s_2)/2}

💡 DataStatPro reports all three standardisers alongside Welch's t-test, with Glass's Δ\Delta highlighted when one group is a designated control, and davd_{av} highlighted when neither group is a natural reference.

10.3 Student's vs. Welch's: A Direct Comparison

PropertyStudent's t-testWelch's t-test
Assumes equal variances✅ Yes❌ No
dfn1+n22n_1 + n_2 - 2Welch-Satterthwaite (always \leq Student's)
Type I error (equal nn, unequal σ\sigma)≈ nominal≈ nominal
Type I error (unequal nn, unequal σ\sigma)❌ Inflated✅ Nominal
Power (equal variances)Marginally higher≈ equivalent
RecommendationAvoid as default✅ Recommended default

11. Non-Parametric Alternatives

11.1 When to Use Non-Parametric Tests

Non-parametric tests (also called distribution-free tests) are appropriate when:

Trade-off: Non-parametric tests are more robust to assumption violations but have lower statistical power than their parametric counterparts when parametric assumptions ARE met. When normality holds, using a non-parametric test discards information.

💡 Non-parametric does not mean "assumption-free." The Mann-Whitney U test assumes that the two distributions have the same shape (just shifted); violation of this shape assumption means U tests the combined null of equal location AND equal shape, not just equal medians.

11.2 Mann-Whitney U Test (Non-Parametric Independent Samples)

The Mann-Whitney U test (also Wilcoxon rank-sum test) is the non-parametric alternative to the independent samples t-test. It tests whether the distributions of two independent groups are identical (or, under the shape assumption, whether one group tends to have higher ranks than the other).

Procedure:

Step 1 — Rank all observations

Combine all n1+n2n_1 + n_2 observations and assign ranks from 1 (smallest) to N=n1+n2N = n_1 + n_2. For tied values, assign the average of the tied ranks.

Step 2 — Compute the rank sums

W1=i=1n1Ri(sum of ranks for Group 1)W_1 = \sum_{i=1}^{n_1} R_i \quad \text{(sum of ranks for Group 1)}

W2=j=1n2Rj(sum of ranks for Group 2)W_2 = \sum_{j=1}^{n_2} R_j \quad \text{(sum of ranks for Group 2)}

Check: W1+W2=N(N+1)/2W_1 + W_2 = N(N+1)/2

Step 3 — Compute U statistics

U1=n1n2+n1(n1+1)2W1U_1 = n_1 n_2 + \frac{n_1(n_1+1)}{2} - W_1

U2=n1n2+n2(n2+1)2W2U_2 = n_1 n_2 + \frac{n_2(n_2+1)}{2} - W_2

Check: U1+U2=n1n2U_1 + U_2 = n_1 n_2

The test statistic is U=min(U1,U2)U = \min(U_1, U_2).

Step 4 — Compute z-approximation (for n>10n > 10)

Under H0H_0, with continuity correction:

z=Un1n2/2n1n2(n1+n2+1)/12z = \frac{U - n_1 n_2/2}{\sqrt{n_1 n_2(n_1+n_2+1)/12}}

For ties, the variance requires a correction factor:

z=Un1n2/2n1n212 ⁣(n+1k(tk3tk)n(n1))z = \frac{U - n_1 n_2/2}{\sqrt{\dfrac{n_1 n_2}{12}\!\left(n+1 - \dfrac{\sum_{k}(t_k^3 - t_k)}{n(n-1)}\right)}}

Where tkt_k is the number of observations in the kk-th tied group.

Step 5 — Effect size: Rank-biserial correlation

rrb=12Un1n2=U1U2n1n2r_{rb} = 1 - \frac{2U}{n_1 n_2} = \frac{U_1 - U_2}{n_1 n_2}

Or equivalently:

rrb=2zn1+n2r_{rb} = \frac{2z}{\sqrt{n_1 + n_2}}

Interpretation: rrb=0.5r_{rb} = 0.5 means that 75% of observations in Group 1 exceed those in Group 2.

Cohen's benchmarks for rrbr_{rb} (same as rr):

rrb\lvert r_{rb} \rvertLabel
0.100.10Small
0.300.30Medium
0.500.50Large

11.3 Wilcoxon Signed-Rank Test (Non-Parametric Paired)

The Wilcoxon signed-rank test is the non-parametric alternative to the paired t-test. It tests whether the distribution of difference scores is symmetric about zero.

Procedure:

Step 1 — Compute and rank absolute differences

Compute di=x1ix2id_i = x_{1i} - x_{2i}. Remove pairs where di=0d_i = 0. Let nn' = number of non-zero differences.

Rank di\lvert d_i \rvert from 1 (smallest) to nn' (largest), assigning average ranks for ties.

Step 2 — Sum positive and negative ranks

W+=di>0Ri(sum of ranks of positive differences)W^+ = \sum_{d_i > 0} R_i \quad \text{(sum of ranks of positive differences)}

W=di<0Ri(sum of ranks of negative differences)W^- = \sum_{d_i < 0} R_i \quad \text{(sum of ranks of negative differences)}

Check: W++W=n(n+1)/2W^+ + W^- = n'(n'+1)/2

The test statistic is W=min(W+,W)W = \min(W^+, W^-).

Step 3 — z-approximation (for n>10n' > 10)

z=W+n(n+1)/4n(n+1)(2n+1)/24z = \frac{W^+ - n'(n'+1)/4}{\sqrt{n'(n'+1)(2n'+1)/24}}

With tie correction:

z=W+n(n+1)/4n(n+1)(2n+1)24k(tk3tk)48z = \frac{W^+ - n'(n'+1)/4}{\sqrt{\dfrac{n'(n'+1)(2n'+1)}{24} - \dfrac{\sum_k(t_k^3-t_k)}{48}}}

Step 4 — Effect size

rW=znr_W = \frac{z}{\sqrt{n'}}

Or, the matched-pairs rank-biserial correlation:

rrb=14W+n(n+1)r_{rb} = 1 - \frac{4W^+}{n'(n'+1)}

11.4 One-Sample Wilcoxon Signed-Rank Test

The one-sample version tests H0H_0: the population median equals θ0\theta_0. Compute di=xiθ0d_i = x_i - \theta_0 and apply the Wilcoxon signed-rank procedure as above.

11.5 Comparing Parametric and Non-Parametric Tests

Propertyt-Test (parametric)Mann-Whitney / Wilcoxon (non-parametric)
TestsMean differenceDistribution shift (median/rank)
Assumes normality✅ Yes❌ No
Sensitive to outliers✅ Yes❌ No (rank-based)
Power (when normal)✅ Higher✅ 95% efficiency of t-test
Power (when non-normal)❌ Lower✅ Can exceed t-test
Effect sizeCohen's dd, Hedges' ggRank-biserial rrbr_{rb}
Handles ordinal data❌ Questionable✅ Appropriate
InterpretabilityMean differenceProbability of superiority

⚠️ The Asymptotic Relative Efficiency (ARE) of the Mann-Whitney U test relative to the t-test is 3/π0.9553/\pi \approx 0.955 for normal data — meaning you only need about 5% more observations with Mann-Whitney to achieve the same power as the t-test. This near-equality of efficiency makes Mann-Whitney a safe choice when normality is questionable.

11.6 Brunner-Munzel Test — Handling Unequal Shapes

When the two distributions have different shapes (not just different locations), the Mann-Whitney test actually tests a compound null of equal location AND equal spread. The Brunner-Munzel test (Brunner & Munzel, 2000) is a robust alternative that tests only the stochastic equality of the two groups without the shape assumption:

H0:P(X1>X2)+0.5P(X1=X2)=0.5H_0: P(X_1 > X_2) + 0.5 \cdot P(X_1 = X_2) = 0.5

The test statistic uses ranked data with separate within-group rankings:

tBM=n1n2(Rˉ1(int)Rˉ2(int))Nn1S^12+n2S^22t_{BM} = \frac{n_1 n_2 (\bar{R}_1^{(int)} - \bar{R}_2^{(int)})}{N\sqrt{n_1\hat{S}_1^2 + n_2\hat{S}_2^2}}

Where Rˉj(int)\bar{R}_j^{(int)} are internal group ranks. DataStatPro reports the Brunner-Munzel test as an option under the Non-Parametric menu when the distribution shape assumption may be violated.


12. Advanced Topics

12.1 Robust t-Tests — Trimmed Means and Winsorisation

Trimmed mean t-tests use the α\alpha-trimmed mean (removing the top and bottom α\alpha proportion of observations) as the measure of central tendency. Yuen's (1974) tt-test for trimmed means:

ttrim=xˉt1xˉt2SEtrimt_{trim} = \frac{\bar{x}_{t1} - \bar{x}_{t2}}{SE_{trim}}

With 20% trimming from each tail:

xˉt=i=h+1nhx(i)n2h,h=αn\bar{x}_{t} = \frac{\sum_{i=h+1}^{n-h} x_{(i)}}{n - 2h}, \quad h = \lfloor \alpha n \rfloor

SEtrim=W1h1(h11)+W2h2(h21)SE_{trim} = \sqrt{\frac{W_1}{h_1(h_1-1)} + \frac{W_2}{h_2(h_2-1)}}

Where WjW_j are the Winsorised sum of squared deviations, and hj=nj2αnjh_j = n_j - 2\lfloor \alpha n_j \rfloor.

Trimmed mean t-tests are substantially more powerful than rank-based tests for heavy-tailed symmetric distributions while maintaining good Type I error control.

12.2 Bootstrap t-Tests

The bootstrap t-test (Efron & Tibshirani, 1994) makes no parametric distributional assumptions. It constructs the null distribution of the t-statistic empirically:

Procedure:

  1. Compute the observed t-statistic tobst_{obs}.
  2. Centre both samples on a common mean: xi=xixˉj+xˉgrandx_i^* = x_i - \bar{x}_j + \bar{x}_{grand}.
  3. Draw BB bootstrap samples (typically B=10,000B = 10{,}000) from the centred samples with replacement and compute tbt^*_b for each.
  4. The p-value is the proportion of bootstrap t-statistics exceeding tobs\lvert t_{obs} \rvert.

Percentile bootstrap CI for the mean difference:

Resample from the original data and compute the mean difference for each bootstrap sample. The 95% CI is the 2.5th and 97.5th percentiles of the bootstrap distribution.

The bootstrap is particularly valuable for small, non-normal samples where both parametric and asymptotic approximations may be poor.

12.3 Bayesian t-Tests

The Bayesian t-test (Rouder et al., 2009; Jeffreys, 1961) quantifies evidence for both H0H_0 (no effect) and H1H_1 (an effect exists) using the Bayes Factor (BF10BF_{10}):

BF10=P(dataH1)P(dataH0)BF_{10} = \frac{P(\text{data} \mid H_1)}{P(\text{data} \mid H_0)}

BF10BF_{10} represents how many times more likely the data are under H1H_1 than H0H_0.

For the default Bayesian t-test (Rouder et al., 2009), the prior on effect size under H1H_1 is a Cauchy distribution with scale rr:

δCauchy(0,r)\delta \sim \text{Cauchy}(0, r)

The default scale r=2/20.707r = \sqrt{2}/2 \approx 0.707 represents a "medium" effect size prior.

Interpreting Bayes Factors:

BF10BF_{10}Evidence for H1H_1
>100> 100Extreme
3010030 - 100Very strong
103010 - 30Strong
3103 - 10Moderate
131 - 3Anecdotal
11No evidence (equal)
1/311/3 - 1Anecdotal for H0H_0
1/101/31/10 - 1/3Moderate for H0H_0
1/301/101/30 - 1/10Strong for H0H_0
<1/30< 1/30Very strong for H0H_0

Advantages of Bayesian t-tests:

Limitations: Sensitive to the choice of prior. Results should always be reported with the prior specification and checked for sensitivity to alternative priors.

12.4 Equivalence Testing with TOST

Standard null hypothesis testing only allows rejection of H0:d=0H_0: d = 0. When the goal is to demonstrate absence of a meaningful effect (e.g., showing that a generic drug is bioequivalent to a brand-name drug), the Two One-Sided Tests (TOST) procedure is required.

Specify equivalence bounds [ΔL,ΔU][-\Delta_L, \Delta_U] (e.g., d=±0.20d = \pm 0.20, corresponding to a "trivially small" effect).

TOST procedure:

Test H01:μ1μ2ΔLH_{01}: \mu_1 - \mu_2 \leq -\Delta_L using upper one-tailed test. Test H02:μ1μ2ΔUH_{02}: \mu_1 - \mu_2 \geq \Delta_U using lower one-tailed test.

Equivalence is concluded at level α\alpha when both one-tailed tests reject their respective nulls — equivalently, when the 90%90\% CI (for α=.05\alpha = .05) for the mean difference falls entirely within (ΔL,ΔU)(-\Delta_L, \Delta_U).

💡 Note that TOST uses a 90% CI (not 95%) when α=.05\alpha = .05, because each one-tailed test is at the α=.05\alpha = .05 level. The 90% CI corresponds to two one-tailed 5% tests.

Minimum detectable equivalence with nn per group:

Δmin=tα,  n12s2n+tβ,  n12s2n\Delta_{min} = t_{\alpha,\; n-1} \cdot \sqrt{\frac{2s^2}{n}} + t_{\beta,\; n-1} \cdot \sqrt{\frac{2s^2}{n}}

12.5 Sequential t-Tests and Alpha Spending

Traditional t-tests do not allow for interim analyses — looking at the data multiple times inflates the Type I error rate. Sequential approaches address this:

Sequential Probability Ratio Test (SPRT): Compute a likelihood ratio Λ\Lambda after each observation. Stop when ΛB\Lambda \leq B (accept H0H_0) or ΛA\Lambda \geq A (reject H0H_0), where A=(1β)/αA = (1-\beta)/\alpha and B=β/(1α)B = \beta/(1-\alpha).

Alpha spending functions (O'Brien-Fleming, Pocock): Pre-specify how the total α\alpha budget is distributed across planned interim and final analyses.

Bayesian sequential testing: Use Bayes Factors to monitor evidence continuously. Unlike frequentist sequential testing, Bayesian sequential testing is valid at any stopping point without an alpha correction.

12.6 Multiple Comparisons and t-Tests

When multiple t-tests are conducted within the same study, the familywise error rate (FWER) — the probability of at least one Type I error — inflates:

FWER=1(1α)kFWER = 1 - (1-\alpha)^k

Where kk is the number of tests. For k=5k = 5 independent tests at α=.05\alpha = .05: FWER=1(0.95)5=.226FWER = 1 - (0.95)^5 = .226.

Correction methods:

MethodAdjusted α\alphaProperties
Bonferroniα/k\alpha/kConservative; controls FWER; simple
HolmSequential BonferroniLess conservative than Bonferroni
Benjamini-HochbergControls FDRLess conservative; for exploratory work
Šidák1(1α)1/k1-(1-\alpha)^{1/k}Slightly less conservative than Bonferroni

⚠️ Corrections for multiple comparisons should be planned before data collection and applied to the entire family of tests. Post-hoc correction of selected tests is not valid. When tests are planned contrasts from a theoretically derived framework, no correction may be necessary — this should be justified explicitly.

12.7 Effect Sizes for t-Tests: Choosing the Right dd Variant

Multiple variants of Cohen's dd exist for the paired design, and choosing the wrong one leads to incomparable effect sizes across studies:

VariantFormulaDenominatorComparability
dzd_zdˉ/sd\bar{d}/s_dSD of differencesPaired designs only
davd_{av}dˉ/sav\bar{d}/s_{av}Average of group SDsComparable to between-subjects
drmd_{rm}dav2(1r)d_{av}\sqrt{2(1-r)}Adjusted for correlationCorrects davd_{av} for non-independence
dsd_s (Glass)dˉ/spre\bar{d}/s_{pre}Pre-test (baseline) SDTreatment-control; Glass's Δ\Delta

Which to use:

12.8 Reporting t-Tests According to APA 7th Edition

The APA Publication Manual (7th ed.) requires:

  1. Test statistic: t(ν)=t(\nu) = value
  2. p-value: p=p = value (report exact value; use p<.001p < .001 only when below .001)
  3. Effect size with 95% CI: Cohen's d=d = [LB, UB] or Hedges' gg
  4. Group means and standard deviations
  5. Whether Welch's correction was applied (for independent t-tests)
  6. Whether the CI is for the mean difference or the standardised effect size

Full APA template:

"[Group 1] (M=M = , SD=SD = , n=n = ) and [Group 2] (M=M = , SD=SD = , n=n = ) were compared using [Student's / Welch's] independent samples t-test. The test revealed [a significant / no significant] mean difference, t(ν)=t(\nu) = , p=p = , d=d = [95% CI: , ], indicating a [small / medium / large] effect."


13. Worked Examples

Example 1: One-Sample t-Test — Comparing Response Time to a Normative Standard

A cognitive neuroscience researcher measures simple reaction times (in ms) for n=25n = 25 adults diagnosed with ADHD. The published population norm for neurotypical adults is μ0=250\mu_0 = 250 ms. The researcher tests whether the ADHD sample has a significantly different mean reaction time.

Data summary:

n=25,xˉ=281.4 ms,s=42.8 msn = 25, \quad \bar{x} = 281.4 \text{ ms}, \quad s = 42.8 \text{ ms}

Step 1 — Standard error:

SE=42.825=42.85=8.56 msSE = \frac{42.8}{\sqrt{25}} = \frac{42.8}{5} = 8.56 \text{ ms}

Step 2 — t-statistic:

t=281.42508.56=31.48.56=3.668t = \frac{281.4 - 250}{8.56} = \frac{31.4}{8.56} = 3.668

Step 3 — Degrees of freedom:

ν=251=24\nu = 25 - 1 = 24

Step 4 — p-value:

p=2×P(T243.668)=2×0.00062=.001p = 2 \times P(T_{24} \geq 3.668) = 2 \times 0.00062 = .001

Step 5 — 95% CI for μ\mu:

t.025,24=2.064t_{.025, 24} = 2.064

281.4±2.064×8.56=281.4±17.7=[263.7,299.1]281.4 \pm 2.064 \times 8.56 = 281.4 \pm 17.7 = [263.7, 299.1]

Step 6 — Cohen's dd:

d=281.425042.8=31.442.8=0.734d = \frac{281.4 - 250}{42.8} = \frac{31.4}{42.8} = 0.734

Hedges' gg:

g=0.734×(134(24)1)=0.734×(1395)=0.734×0.9684=0.711g = 0.734 \times \left(1 - \frac{3}{4(24)-1}\right) = 0.734 \times \left(1 - \frac{3}{95}\right) = 0.734 \times 0.9684 = 0.711

95% CI for dd (approximate):

SEd=125+0.73422(24)=0.04+0.01121=0.0512=0.226SE_d = \sqrt{\frac{1}{25} + \frac{0.734^2}{2(24)}} = \sqrt{0.04 + 0.01121} = \sqrt{0.0512} = 0.226

95% CI:0.734±1.96(0.226)=[0.291,1.177]95\% \text{ CI}: 0.734 \pm 1.96(0.226) = [0.291, 1.177]

Common Language Effect Size:

CL=Φ ⁣(0.7342)=Φ(0.519)=0.698CL = \Phi\!\left(\frac{0.734}{\sqrt{2}}\right) = \Phi(0.519) = 0.698

Summary:

StatisticValueInterpretation
t(24)t(24)3.6683.668
pp (two-tailed).001.001Significant at α=.05\alpha = .05
Mean difference31.431.4 msADHD group is 31.4 ms slower
95% CI (ms)[13.7,49.1][13.7, 49.1]Excludes 0; significant
Cohen's dd0.7340.734Medium-large effect
95% CI for dd[0.291,1.177][0.291, 1.177]From small to very large — wide CI
Hedges' gg0.7110.711Minimal bias correction
CL69.8%69.8\%ADHD group exceeds norm 69.8%69.8\% of time

APA write-up: "Adults with ADHD (M=281.4M = 281.4 ms, SD=42.8SD = 42.8 ms) showed significantly longer reaction times than the neurotypical normative mean of 250 ms, t(24)=3.67t(24) = 3.67, p=.001p = .001, d=0.73d = 0.73 [95% CI: 0.29, 1.18]. This represents a medium-to-large deviation from the normative standard. The 95% CI for the mean difference was [13.7, 49.1] ms."


Example 2: Welch's Independent Samples t-Test — Sleep Duration by Shift Type

A workplace health researcher compares average nightly sleep duration (hours) between day-shift (n1=40n_1 = 40) and night-shift (n2=35n_2 = 35) nurses.

Summary statistics:

GroupnnMean (hrs)SD
Day shift407.211.02
Night shift355.841.73

Levene's test: F(1,73)=9.82F(1, 73) = 9.82, p=.002p = .002 — significant heterogeneity of variances. → Use Welch's t-test.

Step 1 — Variance estimates:

v1=1.02240=1.04040=0.02601v_1 = \frac{1.02^2}{40} = \frac{1.040}{40} = 0.02601

v2=1.73235=2.99335=0.08551v_2 = \frac{1.73^2}{35} = \frac{2.993}{35} = 0.08551

Step 2 — Welch's t-statistic:

tW=7.215.840.02601+0.08551=1.370.11152=1.370.3339=4.103t_W = \frac{7.21 - 5.84}{\sqrt{0.02601 + 0.08551}} = \frac{1.37}{\sqrt{0.11152}} = \frac{1.37}{0.3339} = 4.103

Step 3 — Welch-Satterthwaite df:

νW=(0.02601+0.08551)20.02601239+0.08551234=(0.11152)20.00067739+0.00731234=0.0124370.0000174+0.000215=0.0124370.000232=53.6\nu_W = \frac{(0.02601 + 0.08551)^2}{\dfrac{0.02601^2}{39} + \dfrac{0.08551^2}{34}} = \frac{(0.11152)^2}{\dfrac{0.000677}{39} + \dfrac{0.007312}{34}} = \frac{0.012437}{0.0000174 + 0.000215} = \frac{0.012437}{0.000232} = 53.6

Rounded down: νW=53\nu_W = 53.

Step 4 — p-value:

p=2×P(T534.103)<.001p = 2 \times P(T_{53} \geq 4.103) < .001

Step 5 — 95% CI:

t.025,53=2.006t_{.025, 53} = 2.006

(7.215.84)±2.006×0.3339=1.37±0.670=[0.700,2.040](7.21 - 5.84) \pm 2.006 \times 0.3339 = 1.37 \pm 0.670 = [0.700, 2.040]

Step 6 — Effect sizes:

spooled=39(1.02)2+34(1.73)273=40.56+101.7873=142.3473=1.949=1.396s_{pooled} = \sqrt{\frac{39(1.02)^2 + 34(1.73)^2}{73}} = \sqrt{\frac{40.56 + 101.78}{73}} = \sqrt{\frac{142.34}{73}} = \sqrt{1.949} = 1.396

Cohen's dd:

d=7.215.841.396=1.371.396=0.981d = \frac{7.21 - 5.84}{1.396} = \frac{1.37}{1.396} = 0.981

Glass's Δ\Delta (using night-shift SD as the standardiser — the "comparison" group):

Δ=7.215.841.73=1.371.73=0.792\Delta = \frac{7.21 - 5.84}{1.73} = \frac{1.37}{1.73} = 0.792

Average SD: sav=(1.02+1.73)/2=1.375s_{av} = (1.02 + 1.73)/2 = 1.375

dav=1.371.375=0.996d_{av} = \frac{1.37}{1.375} = 0.996

Summary:

StatisticValue
Levene's FF9.829.82, p=.002p = .002 (unequal variances confirmed)
tW(53.6)t_W(53.6)4.1034.103
pp (two-tailed)<.001< .001
Mean difference1.371.37 hrs (day > night)
95% CI (hrs)[0.700,2.040][0.700, 2.040]
Cohen's dd0.9810.981 (Large)
Glass's Δ\Delta0.7920.792 (Large)
davd_{av}0.9960.996 (Large)

APA write-up: "Day-shift nurses (M=7.21M = 7.21 hrs, SD=1.02SD = 1.02) slept significantly longer than night-shift nurses (M=5.84M = 5.84 hrs, SD=1.73SD = 1.73). Due to significant variance heterogeneity (Levene's F(1,73)=9.82F(1, 73) = 9.82, p=.002p = .002), Welch's t-test was applied. Results indicated a significant difference, tW(53.6)=4.10t_W(53.6) = 4.10, p<.001p < .001, d=0.98d = 0.98 [95% CI: 0.54, 1.43], representing a large effect. Night-shift nurses slept on average 1.37 hours less per night [95% CI: 0.70, 2.04]."


Example 3: Paired Samples t-Test — Pre-Post Mindfulness Intervention

A clinical psychologist tests whether an 8-week mindfulness-based stress reduction (MBSR) programme reduces perceived stress. Perceived Stress Scale (PSS-10; range 0–40) scores are recorded before and after the programme for n=20n = 20 participants.

Summary statistics:

MeasurementMeanSDrr (pre-post)
Pre-MBSR24.75.8
Post-MBSR18.35.1r12=0.74r_{12} = 0.74
Differences (di=prepostd_i = pre - post)6.44.1

Step 1 — t-statistic:

t=dˉsd/n=6.44.1/20=6.44.1/4.472=6.40.917=6.979t = \frac{\bar{d}}{s_d/\sqrt{n}} = \frac{6.4}{4.1/\sqrt{20}} = \frac{6.4}{4.1/4.472} = \frac{6.4}{0.917} = 6.979

Step 2 — Degrees of freedom and p-value:

ν=201=19\nu = 20 - 1 = 19

p=2×P(T196.979)<.001p = 2 \times P(T_{19} \geq 6.979) < .001

Step 3 — 95% CI for mean difference:

t.025,19=2.093t_{.025, 19} = 2.093

6.4±2.093×0.917=6.4±1.919=[4.48,8.32]6.4 \pm 2.093 \times 0.917 = 6.4 \pm 1.919 = [4.48, 8.32]

Step 4 — Effect sizes:

dz=6.44.1=1.561d_z = \frac{6.4}{4.1} = 1.561

dav=6.4(5.8+5.1)/2=6.45.45=1.174d_{av} = \frac{6.4}{(5.8+5.1)/2} = \frac{6.4}{5.45} = 1.174

drm=dav2(1r12)=1.174×2(10.74)=1.174×0.52=1.174×0.721=0.847d_{rm} = d_{av}\sqrt{2(1-r_{12})} = 1.174 \times \sqrt{2(1-0.74)} = 1.174 \times \sqrt{0.52} = 1.174 \times 0.721 = 0.847

Note the difference:

Comparison: what if the independent t-test had been (incorrectly) applied?

spooled=19(5.82)+19(5.12)38=640.04+493.5938=29.83=5.462s_{pooled} = \sqrt{\frac{19(5.8^2) + 19(5.1^2)}{38}} = \sqrt{\frac{640.04 + 493.59}{38}} = \sqrt{29.83} = 5.462

tindependent=24.718.35.4621/20+1/20=6.45.462×0.3162=6.41.726=3.709t_{independent} = \frac{24.7 - 18.3}{5.462\sqrt{1/20+1/20}} = \frac{6.4}{5.462 \times 0.3162} = \frac{6.4}{1.726} = 3.709

The paired test (t=6.98t = 6.98) is substantially more powerful than the incorrect independent test (t=3.71t = 3.71) — reflecting the benefit of removing between-person variance through pairing.

Summary:

StatisticValue
t(19)t(19)6.9796.979
pp (two-tailed)<.001< .001
Mean reduction6.46.4 PSS points
95% CI for difference[4.48,8.32][4.48, 8.32]
dzd_z1.5611.561
davd_{av}1.1741.174
drmd_{rm}0.8470.847
tt (if independent, incorrect)3.7093.709

APA write-up: "Perceived stress scores decreased significantly from pre-MBSR (M=24.7M = 24.7, SD=5.8SD = 5.8) to post-MBSR (M=18.3M = 18.3, SD=5.1SD = 5.1), t(19)=6.98t(19) = 6.98, p<.001p < .001, dz=1.56d_z = 1.56 [95% CI: 0.99, 2.11]. The mean reduction of 6.4 PSS points (95% CI: [4.48, 8.32]) represents a large within-person effect."


Example 4: Mann-Whitney U Test — Non-Parametric Independent Comparison

A researcher compares pain ratings (0–10 scale, ordinal) between two physiotherapy protocols. Shapiro-Wilk tests indicate non-normality in both groups. Group 1 (Protocol A, n1=8n_1 = 8): ratings {3,5,2,7,4,6,3,5}\{3, 5, 2, 7, 4, 6, 3, 5\}. Group 2 (Protocol B, n2=7n_2 = 7): ratings {7,8,6,9,7,8,6}\{7, 8, 6, 9, 7, 8, 6\}.

Step 1 — Rank all N=15N = 15 observations:

Combined sorted values: 2, 3, 3, 4, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 9

ValueFreqRanksAvg RankGroup
2111.0A
322–32.5A, A
4144.0A
525–65.5A, A
637–98.0A, B, B
7310–1211.0A, B, B
8213–1413.5B, B
911515.0B

Step 2 — Rank sums:

W1=1.0+2.5+2.5+4.0+5.5+5.5+8.0+11.0=40.0W_1 = 1.0 + 2.5 + 2.5 + 4.0 + 5.5 + 5.5 + 8.0 + 11.0 = 40.0

W2=8.0+11.0+11.0+13.5+13.5+15.0+8.0=80.0W_2 = 8.0 + 11.0 + 11.0 + 13.5 + 13.5 + 15.0 + 8.0 = 80.0

Check: 40.0+80.0=120=15×16/240.0 + 80.0 = 120 = 15 \times 16/2

Step 3 — U statistics:

U1=8×7+8×9240=56+3640=52U_1 = 8 \times 7 + \frac{8 \times 9}{2} - 40 = 56 + 36 - 40 = 52

U2=8×7+7×8280=56+2880=4U_2 = 8 \times 7 + \frac{7 \times 8}{2} - 80 = 56 + 28 - 80 = 4

Check: 52+4=56=n1n252 + 4 = 56 = n_1 n_2

Test statistic: U=min(52,4)=4U = \min(52, 4) = 4

Step 4 — z-approximation:

μU=n1n22=562=28\mu_U = \frac{n_1 n_2}{2} = \frac{56}{2} = 28

σU=8×7×1612=89612=74.67=8.64\sigma_U = \sqrt{\frac{8 \times 7 \times 16}{12}} = \sqrt{\frac{896}{12}} = \sqrt{74.67} = 8.64 (without tie correction)

z=4288.64=248.64=2.778z = \frac{4 - 28}{8.64} = \frac{-24}{8.64} = -2.778

p=2×P(Z2.778)=2×.0027=.005p = 2 \times P(Z \leq -2.778) = 2 \times .0027 = .005

Step 5 — Rank-biserial correlation:

rrb=12×456=10.143=0.857r_{rb} = 1 - \frac{2 \times 4}{56} = 1 - 0.143 = 0.857

Or: rrb=2×2.77815=5.5563.873=0.814r_{rb} = \frac{2 \times 2.778}{\sqrt{15}} = \frac{5.556}{3.873} = 0.814 (from zz)

Interpretation: Protocol B produces substantially higher pain ratings — rrb=0.857r_{rb} = 0.857 indicates a large effect (Protocol A ranks lower/better with probability 93%\approx 93\%).

Summary:

StatisticValue
UU44
zz (approximate)2.78-2.78
pp (two-tailed).005.005
rrbr_{rb}0.8570.857 (Large)
Median Protocol A4.54.5
Median Protocol B7.07.0

APA write-up: "Due to non-normal distributions, a Mann-Whitney U test was conducted. Protocol A (Mdn=4.5\text{Mdn} = 4.5) produced significantly lower pain ratings than Protocol B (Mdn=7.0\text{Mdn} = 7.0), U=4U = 4, z=2.78z = -2.78, p=.005p = .005, rrb=0.86r_{rb} = 0.86, indicating a large effect."


14. Common Mistakes and How to Avoid Them

Mistake 1: Using the Independent t-Test for Paired Data

Problem: Treating pre-post measurements or matched pairs as independent samples. This ignores the within-person correlation, inflates the error term, and substantially reduces power.

Solution: Identify the study design before analysis. If each participant contributes two scores (repeated measures, matched pairs), use the paired t-test. Check whether the data file has one row per participant (paired) vs. one row per observation (independent).


Mistake 2: Defaulting to Student's t-Test Without Checking Variance Equality

Problem: SPSS, Excel, and older textbooks default to Student's t-test. When groups differ in sample size AND variance, Student's t-test can have a severely inflated Type I error rate.

Solution: Always use Welch's t-test as the default for independent samples. The cost in power when variances are equal is negligible, whereas the benefit when variances are unequal is substantial. Report Welch's results; note if Levene's test is significant.


Mistake 3: Interpreting a Non-Significant p-Value as Evidence of No Effect

Problem: Concluding that p>.05p > .05 means μ1=μ2\mu_1 = \mu_2. A non-significant result means the data are insufficient to reject H0H_0 — it does NOT mean the null hypothesis is true.

Solution: Report the 95% CI for the mean difference alongside the p-value. A wide CI that spans from negative to positive values reflects uncertainty, not evidence of zero effect. To positively establish absence of a meaningful effect, use equivalence testing (TOST) with prespecified bounds.


Mistake 4: Reporting Only p-Values Without Effect Sizes

Problem: Reporting t(48)=2.11t(48) = 2.11, p=.040p = .040 without Cohen's dd conveys nothing about the magnitude of the effect. With n=1,000n = 1{,}000, the same p-value might correspond to d=0.08d = 0.08 (trivial); with n=10n = 10, it might correspond to d=1.10d = 1.10 (large).

Solution: Always report Cohen's dd (or Hedges' gg) and its 95% CI alongside every t-test. DataStatPro computes these automatically.


Mistake 5: Switching to One-Tailed Tests After Seeing the Data

Problem: Observing that Group 1 > Group 2, then switching to a one-tailed test to achieve p<.05p < .05 when the two-tailed result was p=.07p = .07. This is p-hacking and inflates the Type I error to approximately 10%10\%.

Solution: Directional hypotheses must be pre-registered before data collection and must be based on strong theoretical or prior empirical grounds. If in doubt, use a two-tailed test.


Mistake 6: Applying t-Tests to Likert Items Without Justification

Problem: Treating 5-point Likert items as interval-scale data and applying t-tests. Strictly, Likert items are ordinal — the intervals between adjacent scale points are not necessarily equal.

Solution: For a single Likert item, use the Mann-Whitney U (independent) or Wilcoxon signed-rank (paired) test. For a Likert scale (composite of multiple items), the summed score is typically treated as approximately interval, and t-tests are generally considered acceptable. Clearly state this assumption.


Mistake 7: Ignoring Outliers Before Running the t-Test

Problem: The t-test uses means, which are highly sensitive to outliers, especially in small samples. A single extreme value can drastically alter the t-statistic and p-value.

Solution: Always inspect data with boxplots and zz-scores before running a t-test. Investigate outliers (data entry error? valid extreme value?). Report analyses with and without outliers. Consider using trimmed mean t-tests or the Mann-Whitney test when outliers cannot be removed.


Mistake 8: Confusing Statistical Power with the Probability the Null is False

Problem: Interpreting power =0.80= 0.80 as meaning "there is an 80% probability the null hypothesis is false, given I found p<.05p < .05." Power is a property of the study design computed before data collection — it is the probability of getting a significant result IF a true effect of size dd exists.

Solution: Understand that power is computed under H1H_1 and is not a posterior probability about H0H_0. The probability that a significant result reflects a true effect (positive predictive value) also depends on the prior probability of H1H_1 being true.


Mistake 9: Using the Wrong dd Variant and Comparing Across Designs

Problem: Reporting dzd_z from a paired design and comparing it to dd from an independent samples study as if they are the same quantity. dzd_z depends on the pre-post correlation and is typically larger than davd_{av} for the same mean difference.

Solution: When comparing effect sizes across designs, convert all effect sizes to a common metric. Use davd_{av} for paired designs when comparing to between-subjects studies. Always specify which variant of dd was computed.


Mistake 10: Running Multiple t-Tests Instead of ANOVA

Problem: Comparing three groups (A, B, C) with three separate t-tests (A vs. B, A vs. C, B vs. C) inflates the familywise error rate to 14%\approx 14\% instead of the nominal 5%5\%.

Solution: When comparing more than two groups, use one-way ANOVA (or Kruskal-Wallis for non-parametric data) followed by appropriate post-hoc tests (Tukey HSD, Bonferroni, Games-Howell for unequal variances). Reserve t-tests for pre-planned pairwise contrasts with appropriate alpha correction.


15. Troubleshooting

ProblemLikely CauseSolution
t-statistic is extremely large (t>10\lvert t \rvert > 10)Very large nn or data entry errorCheck for duplicate entries, errors; report effect size — even large tt may indicate a small dd
p=1.000p = 1.000 or exactly 0Floating point overflow; identical group meansCheck that both groups have variance; verify data coding
Welch's df is very small (<5< 5)One group has very small nn or near-zero varianceCheck data; use exact permutation test for very small nn
Student's and Welch's give very different resultsUnequal variances with unequal nnLevene's test is likely significant; use Welch's result
Paired t-test gives larger tt than expectedHigh pre-post correlation (good — this is the efficiency gain)Report as normal; note the within-person correlation r12r_{12}
Shapiro-Wilk is significant but nn is largePower of normality test increases with nn; minor deviations become significantWith n30n \geq 30, CLT usually ensures valid t-test; inspect Q-Q plots and skewness
Mann-Whitney gives different conclusion than t-testDistribution is non-normal and sample is smallFor non-normal data, trust Mann-Whitney; report both with a note on assumption violation
Effect size CI is very wideSmall sample sizeReport the wide CI — it is informative about low precision; conduct a priori power analysis for future study
Cohen's dzd_z is much larger than davd_{av}High pre-post correlation (r12r_{12} is large)Both are correct; specify which was computed and when each is appropriate
Equivalence test fails despite small ddEquivalence bounds are too tight for the sample sizeEither increase nn or widen the equivalence bounds with justification
Negative pp-value or p>1p > 1 reportedSoftware error or data corruptionRe-check data file; rerun analysis in DataStatPro
One-tailed pp is larger than two-tailed ppEffect is in the direction opposite to H1H_1The one-tailed test is not significant in the predicted direction; the effect is in the wrong direction
Bootstrap CI does not include 0 but t-test p>.05p > .05Small sample; bootstrap and t-test diverge for highly non-normal dataInvestigate distribution; report both with rationale for preferred method
rr computed from tt and ν\nu seems too smallCorrect — rr from tt is the point-biserial correlation, not Cohen's ddUse d=2r/1r2d = 2r/\sqrt{1-r^2} to convert to Cohen's dd
Bayes Factor is not decisive (BF1BF \approx 1)Data provide no evidence in either direction; study is underpoweredCollect more data; report BF as evidence of insensitivity; avoid interpreting as supporting either hypothesis

16. Quick Reference Cheat Sheet

Core t-Test Formulas

FormulaDescription
t=(xˉμ0)/(s/n)t = (\bar{x} - \mu_0)/(s/\sqrt{n})One-sample t-statistic
t=(xˉ1xˉ2)/(sp1/n1+1/n2)t = (\bar{x}_1-\bar{x}_2)/(s_p\sqrt{1/n_1+1/n_2})Independent samples (Student's)
sp=[(n11)s12+(n21)s22]/(n1+n22)s_p = \sqrt{[(n_1-1)s_1^2+(n_2-1)s_2^2]/(n_1+n_2-2)}Pooled standard deviation
tW=(xˉ1xˉ2)/s12/n1+s22/n2t_W = (\bar{x}_1-\bar{x}_2)/\sqrt{s_1^2/n_1+s_2^2/n_2}Welch's t-statistic
νW=(v1+v2)2/(v12/(n11)+v22/(n21))\nu_W = (v_1+v_2)^2/(v_1^2/(n_1-1)+v_2^2/(n_2-1))Welch-Satterthwaite df
t=dˉ/(sd/n)t = \bar{d}/(s_d/\sqrt{n})Paired t-statistic
SExˉ=s/nSE_{\bar{x}} = s/\sqrt{n}Standard error of the mean
xˉ±tα/2,νSE\bar{x} \pm t_{\alpha/2,\nu} \cdot SEConfidence interval for mean
p=2×P(Tνt)p = 2 \times P(T_\nu \geq \lvert t \rvert)Two-tailed p-value

Effect Size Formulas for t-Tests

FormulaDescription
d=(xˉ1xˉ2)/spooledd = (\bar{x}_1-\bar{x}_2)/s_{pooled}Cohen's dd (independent)
dz=dˉ/sd=t/nd_z = \bar{d}/s_d = t/\sqrt{n}Cohen's dzd_z (paired)
dav=dˉ/savd_{av} = \bar{d}/s_{av}Cohen's davd_{av} (paired, comparable to between)
drm=dav2(1r12)d_{rm} = d_{av}\sqrt{2(1-r_{12})}drmd_{rm} (corrected for dependency)
Δ=(xˉ1xˉ2)/scontrol\Delta = (\bar{x}_1-\bar{x}_2)/s_{control}Glass's Δ\Delta
g=d×(13/(4ν1))g = d \times (1-3/(4\nu-1))Hedges' gg (bias-corrected)
r=t2/(t2+ν)r = \sqrt{t^2/(t^2+\nu)}Point-biserial rr from tt
d=t(n1+n2)/(n1n2)d = t\sqrt{(n_1+n_2)/(n_1 n_2)}dd from independent tt
dz=t/nd_z = t/\sqrt{n}dzd_z from paired/one-sample tt
d=2r/1r2d = 2r/\sqrt{1-r^2}Convert rr to dd (equal groups)
CL=Φ(d/2)CL = \Phi(d/\sqrt{2})Common Language Effect Size

Non-Parametric Formulas

FormulaDescription
U1=n1n2+n1(n1+1)/2W1U_1 = n_1 n_2 + n_1(n_1+1)/2 - W_1Mann-Whitney UU statistic
z=(Un1n2/2)/n1n2(N+1)/12z = (U - n_1 n_2/2)/\sqrt{n_1 n_2(N+1)/12}Mann-Whitney zz-approximation
rrb=12U/(n1n2)r_{rb} = 1 - 2U/(n_1 n_2)Rank-biserial correlation (Mann-Whitney)
W+=di>0RiW^+ = \sum_{d_i>0} R_iWilcoxon positive rank sum
z=(W+n(n+1)/4)/n(n+1)(2n+1)/24z = (W^+ - n'(n'+1)/4)/\sqrt{n'(n'+1)(2n'+1)/24}Wilcoxon zz-approximation
rW=z/nr_W = z/\sqrt{n'}Effect size for Wilcoxon test

Test Selection Guide

DesignNormal?Equal Variances?Recommended Test
1 group vs. known valueOne-sample t-test
1 group vs. known valueWilcoxon signed-rank
2 independent groupsEqual or unknownWelch's t-test
2 independent groupsKnown unequalWelch's t-test
2 independent groupsMann-Whitney U
2 related groups✅ (differences)Paired t-test
2 related groups❌ (differences)Wilcoxon signed-rank
>2> 2 groupsEqualOne-way ANOVA
>2> 2 groupsUnequalWelch's ANOVA
>2> 2 groupsKruskal-Wallis

Cohen's Benchmarks for t-Test Effect Sizes

Labeld\lvert d \rvertr\lvert r \rvertPower needed (nn/group)
Small0.200.200.100.10394
Medium0.500.500.240.2464
Large0.800.800.370.3726
Very large1.201.200.510.5112
Huge2.002.000.710.715

All power figures assume α=.05\alpha = .05, two-tailed, 80% power, equal group sizes.

Degrees of Freedom Reference

Testdf
One-sample t-testn1n - 1
Independent t-test (Student's)n1+n22n_1 + n_2 - 2
Independent t-test (Welch's)Welch-Satterthwaite (always n1+n22\leq n_1+n_2-2)
Paired t-testn1n - 1 (where nn = number of pairs)

Assumption Checks Reference

AssumptionTestSoftware FunctionAction if Violated
NormalityShapiro-Wilkshapiro.test()Mann-Whitney / Wilcoxon
NormalityQ-Q plotqqnorm()Assess visually
Equal variancesLevene'sleveneTest()Welch's t-test
Equal variancesBrown-Forsythebf.test()Welch's t-test
Outlierszz-score, boxplotboxplot()Investigate; trimmed mean
IndependenceDesign reviewMultilevel model

Confidence Interval Interpretation

CI PropertyInterpretation
Entirely above zeroEffect is significantly positive at the chosen α\alpha
Entirely below zeroEffect is significantly negative at the chosen α\alpha
Contains zeroEffect is not statistically significant
Narrow CIPrecise estimate (large nn)
Wide CIImprecise estimate (small nn) — interpret point estimate cautiously
90% CI within equivalence boundsEquivalence demonstrated (TOST)

APA 7th Edition Reporting Templates

One-sample: t(ν)=t(\nu) = [value], p=p = [value], d=d = [value], 95% CI [LB, UB].

Independent samples (Welch's): tW(νW)=t_W(\nu_W) = [value], p=p = [value], d=d = [value], 95% CI [LB, UB].

Paired samples: t(ν)=t(\nu) = [value], p=p = [value], dz=d_z = [value], 95% CI [LB, UB].

Mann-Whitney: U=U = [value], z=z = [value], p=p = [value], rrb=r_{rb} = [value].

Wilcoxon signed-rank: W=W = [value], z=z = [value], p=p = [value], rW=r_W = [value].

Required Sample Size Quick Reference

Two-sided α=.05\alpha = .05, two independent equal groups:

Powerd=0.20d = 0.20d=0.35d = 0.35d=0.50d = 0.50d=0.80d = 0.80d=1.00d = 1.00
0.70310102502014
0.80394130642617
0.90527174853422
0.956512151054227

All nn values are per group. Double for total NN.

t-Test Reporting Checklist

ItemRequired
t-statistic with sign✅ Always
Degrees of freedom✅ Always
Exact p-value (or p<.001p < .001)✅ Always
Mean and SD for each group✅ Always
95% CI for mean difference✅ Always
Cohen's dd or Hedges' gg✅ Always
95% CI for effect size✅ Always
Sample sizes for each group✅ Always
Whether Student's or Welch's was used✅ For independent t-tests
Levene's test result✅ For independent t-tests
Normality check result✅ When n<30n < 30 per group
Which dd variant was used (dzd_z, davd_{av}, etc.)✅ For paired designs
Power or sensitivity analysis✅ For null or inconclusive results
Equivalence test if claiming null✅ Always for null results
Pre-registration of one-tailed hypotheses✅ If one-tailed test used

This tutorial provides a comprehensive foundation for understanding, conducting, and reporting t-tests and their alternatives within the DataStatPro application. For further reading, consult Gravetter & Wallnau's "Statistics for the Behavioral Sciences" (10th ed.), Field's "Discovering Statistics Using IBM SPSS Statistics" (5th ed., 2018), Wilcox's "Introduction to Robust Estimation and Hypothesis Testing" (4th ed., 2017), and Lakens's "Equivalence Tests: A Practical Primer for t-Tests, Correlations, and Meta-Analyses" (Social Psychological and Personality Science, 2017). For the recommendation to default to Welch's t-test, see Delacre, Lakens, and Leys (2017) in "Why Psychologists Should by Default Use Welch's t-test Instead of Student's t-test" (International Review of Social Psychology). For feature requests or support, contact the DataStatPro team.