t-Tests and Alternatives: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of hypothesis testing all the way through advanced t-test variants, non-parametric alternatives, interpretation, reporting, and practical usage within the DataStatPro application. Whether you are encountering t-tests for the first time or seeking a deeper understanding of when and how to apply parametric and non-parametric tests for comparing means, this guide builds your knowledge systematically from the ground up.

Prerequisites and Background Concepts
What is a t-Test?
The Mathematics Behind t-Tests
Assumptions of t-Tests
Types of t-Tests
Using the t-Test Calculator Component
One-Sample t-Test
Independent Samples t-Test
Paired Samples t-Test
Welch's t-Test — Unequal Variances
Non-Parametric Alternatives
Advanced Topics
Worked Examples
Common Mistakes and How to Avoid Them
Troubleshooting
Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into t-tests, it is essential to be comfortable with the following foundational statistical concepts. Each is briefly reviewed below.

1.1 Populations, Samples, and Parameters

A population is the complete set of individuals or observations of interest. A sample is a subset drawn from the population. Parameters describe population characteristics (e.g., $\mu$ , $\sigma$ ), while statistics describe sample characteristics (e.g., $\bar{x}$ , $s$ ).

The t-test is an inferential procedure — it uses sample statistics to draw conclusions about unknown population parameters. The fundamental question in every t-test is: "Is the difference between observed means large enough to conclude that the true population means differ?"

1.2 The Sampling Distribution of the Mean

If we repeatedly drew samples of size $n$ from a population with mean $\mu$ and standard deviation $\sigma$ , the distribution of sample means $\bar{x}$ would itself be a distribution — the sampling distribution of the mean. By the Central Limit Theorem (CLT):

$\bar{x} \sim \mathcal{N}\!\left(\mu,\; \frac{\sigma^2}{n}\right) \quad \text{as } n \to \infty$

The standard error of the mean is:

$SE_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$

When the population $\sigma$ is unknown (as in virtually all real applications), it is estimated by the sample standard deviation $s$ , giving the estimated standard error:

$\widehat{SE}_{\bar{x}} = \frac{s}{\sqrt{n}}$

This substitution is what necessitates the use of the $t$ -distribution rather than the standard normal distribution.

1.3 The t-Distribution

The Student's t-distribution was derived by William Sealy Gosset (publishing under the pseudonym "Student") in 1908. It arises whenever we estimate a normally distributed population's mean using a small sample and an unknown variance.

The t-distribution is characterised by a single parameter: degrees of freedom $\nu$ . As $\nu \to \infty$ , the t-distribution converges to the standard normal $\mathcal{N}(0,1)$ .

Key properties:

Symmetric and bell-shaped, centred at 0.
Has heavier tails than the standard normal (more probability in the extremes).
Heavier tails for smaller $\nu$ — reflecting greater uncertainty from estimating $\sigma$ .
Requires looking up critical values $t_{\alpha/2, \nu}$ for hypothesis testing.

Critical values for common $\alpha$ levels:

$\nu$ (df)	$t_{.025}$ (two-tailed $\alpha=.05$ )	$t_{.005}$ (two-tailed $\alpha=.01$ )	$t_{.0005}$ (two-tailed $\alpha=.001$ )
5	2.571	4.032	8.610
10	2.228	3.169	4.587
20	2.086	2.845	3.850
30	2.042	2.750	3.646
60	2.000	2.660	3.460
120	1.980	2.617	3.373
$\infty$	1.960	2.576	3.291

1.4 Hypothesis Testing Framework

Every t-test operates within the Neyman-Pearson hypothesis testing framework:

Step 1 — State the hypotheses:

$H_0$ (null hypothesis): The parameter equals a specified value (e.g., $\mu_1 = \mu_2$ ).
$H_1$ (alternative hypothesis): The parameter differs from that value.

Step 2 — Choose $\alpha$ : The significance level is the acceptable Type I error rate (conventionally $\alpha = .05$ ). It is the probability of rejecting $H_0$ when it is true.

Step 3 — Compute the test statistic: The t-statistic measures how many standard errors the observed result is from the null hypothesis value.

Step 4 — Compute the p-value: The probability of observing a t-statistic at least as extreme as the one obtained, assuming $H_0$ is true.

Step 5 — Make a decision: Reject $H_0$ if $p \leq \alpha$ ; fail to reject $H_0$ if $p > \alpha$ .

Step 6 — Compute and report the effect size with CI: Statistical significance alone is insufficient. Always accompany the t-test result with Cohen's $d$ (or equivalent) and its 95% confidence interval.

1.5 Type I and Type II Errors

Decision	$H_0$ True	$H_0$ False
Reject $H_0$	Type I error ( $\alpha$ )	Correct (Power = $1-\beta$ )
Fail to Reject $H_0$	Correct ( $1-\alpha$ )	Type II error ( $\beta$ )

Type I error ( $\alpha$ ): Concluding there is an effect when there is none (false positive).
Type II error ( $\beta$ ): Missing a true effect (false negative).
Power ( $1-\beta$ ): The probability of correctly detecting a true effect.

1.6 One-Tailed vs. Two-Tailed Tests

A two-tailed test places the rejection region in both tails of the distribution and is appropriate when the direction of the effect is not specified in advance:

$H_1: \mu_1 \neq \mu_2$

A one-tailed test places the entire rejection region in one tail, appropriate only when a directional prediction is made before data collection on strong theoretical grounds:

$H_1: \mu_1 > \mu_2 \quad \text{or} \quad H_1: \mu_1 < \mu_2$

⚠️ One-tailed tests should be pre-registered and theoretically justified before data collection. Using a one-tailed test post-hoc to achieve significance is p-hacking. In the absence of a strong directional prediction, always use a two-tailed test.

1.7 Confidence Intervals and Their Relationship to t-Tests

A $(1-\alpha) \times 100\%$ confidence interval for the mean difference is directly related to the two-tailed t-test at significance level $\alpha$ : the null hypothesis $H_0: \mu_1 - \mu_2 = 0$ is rejected at level $\alpha$ if and only if $0$ lies outside the $(1-\alpha) \times 100\%$ CI.

The CI provides strictly more information than the p-value — it communicates both the direction and precision of the estimate and enables assessment of practical significance.

2. What is a t-Test?

2.1 The Core Idea

A t-test is a parametric inferential statistical test used to determine whether there is a statistically significant difference between means. The general form of the t-statistic is:

$t = \frac{\text{Observed difference} - \text{Null hypothesis value}}{\text{Estimated standard error of the difference}}$

The denominator — the standard error — is the key: it scales the observed difference by the sampling variability, allowing us to determine whether the difference is larger than what we would typically expect from sampling variation alone.

2.2 When to Use a t-Test

A t-test is appropriate when:

The outcome variable is continuous (interval or ratio scale).
You are comparing one or two group means.
The data are approximately normally distributed (or $n$ is large enough for CLT).
Observations within each group are independent (for independent t-tests).

2.3 The Three Versions of the t-Test

t-Test Type	Research Question	Example
One-sample	Does a sample mean differ from a known/hypothesised value?	Is average exam score different from 70?
Independent samples	Do two unrelated groups have different means?	Do males and females differ on anxiety?
Paired samples	Do two related measurements differ within the same units?	Does anxiety change from pre- to post-treatment?

2.4 The t-Test in Context

The t-test is one member of a broader family of inferential tests:

Situation	Test
One group vs. known value (normal data)	One-sample t-test
Two independent groups (normal, equal variances)	Student's independent t-test
Two independent groups (normal, unequal variances)	Welch's t-test
Two related groups (normal data)	Paired samples t-test
Two independent groups (non-normal or ordinal data)	Mann-Whitney U test
Two related groups (non-normal or ordinal data)	Wilcoxon signed-rank test
One group vs. known value (non-normal)	Wilcoxon signed-rank (one-sample)
More than two groups (normal data)	One-way ANOVA (F-test)
More than two groups (non-normal)	Kruskal-Wallis test

2.5 Statistical Significance vs. Practical Significance

A t-test answers: "Is the observed mean difference larger than expected by chance?" It does not answer: "Is the difference large enough to matter in practice?"

With large samples, trivially small differences become statistically significant. A study comparing two teaching methods with $n = 5{,}000$ per group might find $t(9998) = 3.20$ , $p = .001$ , for a mean difference of 0.3 points on a 100-point scale — significant but practically meaningless.

Always report:

The t-statistic and p-value (statistical significance).
Cohen's $d$ or equivalent effect size (practical significance).
The 95% CI for the mean difference and for the effect size.

3. The Mathematics Behind t-Tests

3.1 The One-Sample t-Statistic

The one-sample t-test tests whether a sample mean $\bar{x}$ differs significantly from a hypothesised population mean $\mu_0$ :

$t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}$

Where:

$\bar{x}$ = sample mean
$\mu_0$ = hypothesised population mean (null value)
$s$ = sample standard deviation
$n$ = sample size

Under $H_0: \mu = \mu_0$ , this statistic follows a t-distribution with $\nu = n - 1$ degrees of freedom.

The 95% CI for the population mean:

$\bar{x} \pm t_{\alpha/2,\; n-1} \cdot \frac{s}{\sqrt{n}}$

3.2 The Independent Samples t-Statistic (Student's)

The independent samples t-test (Student's version) tests whether two population means are equal, assuming homogeneity of variance:

$t = \frac{\bar{x}_1 - \bar{x}_2}{s_{pooled}\sqrt{\dfrac{1}{n_1} + \dfrac{1}{n_2}}}$

Where the pooled standard deviation is:

$s_{pooled} = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}}$

Degrees of freedom: $\nu = n_1 + n_2 - 2$

The 95% CI for the mean difference $(\mu_1 - \mu_2)$ :

$(\bar{x}_1 - \bar{x}_2) \pm t_{\alpha/2,\; n_1+n_2-2} \cdot s_{pooled}\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}$

3.3 Welch's t-Statistic — Unequal Variances

Welch's t-test does not assume equal population variances. It computes a separate variance estimate for each group:

$t_W = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}}}$

The degrees of freedom are approximated by the Welch-Satterthwaite equation:

$\nu_W = \frac{\left(\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}\right)^2}{\dfrac{\left(s_1^2/n_1\right)^2}{n_1-1} + \dfrac{\left(s_2^2/n_2\right)^2}{n_2-1}}$

Note: $\nu_W$ is generally non-integer and is typically rounded down. Welch's df are always $\leq n_1 + n_2 - 2$ (i.e., always fewer or equal df than Student's t-test, making it a more conservative test).

The 95% CI for the mean difference:

$(\bar{x}_1 - \bar{x}_2) \pm t_{\alpha/2,\; \nu_W} \cdot \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$

3.4 The Paired Samples t-Statistic

The paired t-test treats the data as a set of $n$ difference scores $d_i = x_{1i} - x_{2i}$ computed for each pair. It tests whether the mean difference $\bar{d}$ is significantly different from zero:

$t = \frac{\bar{d}}{s_d / \sqrt{n}}$

Where:

$\bar{d} = \frac{1}{n}\sum_{i=1}^n d_i$ = mean of the difference scores
$s_d = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (d_i - \bar{d})^2}$ = SD of the difference scores
$n$ = number of pairs

Degrees of freedom: $\nu = n - 1$

The 95% CI for the mean difference:

$\bar{d} \pm t_{\alpha/2,\; n-1} \cdot \frac{s_d}{\sqrt{n}}$

The relationship between the paired t-statistic, the correlation $r_{12}$ between paired measurements, and the independent samples t-statistic:

$s_d^2 = s_1^2 + s_2^2 - 2r_{12}s_1 s_2$

This shows that when $r_{12} > 0$ (paired measurements are positively correlated), the paired test has a smaller denominator (less error variance) and thus greater statistical power than the independent samples test for the same data.

3.5 The p-value

The p-value is computed from the t-statistic and degrees of freedom using the cumulative distribution function (CDF) of the t-distribution:

Two-tailed p-value:

$p = 2 \times P(T_\nu \geq \lvert t_{obs} \rvert) = 2 \times [1 - F_{t,\nu}(\lvert t_{obs} \rvert)]$

One-tailed p-value (upper tail):

$p = P(T_\nu \geq t_{obs}) = 1 - F_{t,\nu}(t_{obs})$

One-tailed p-value (lower tail):

$p = P(T_\nu \leq t_{obs}) = F_{t,\nu}(t_{obs})$

Where $F_{t,\nu}$ is the CDF of the t-distribution with $\nu$ degrees of freedom.

3.6 Computing Effect Sizes from t-Statistics

When raw data are unavailable, effect sizes can be computed directly from the reported t-statistic:

Cohen's $d$ from independent samples t-test:

$d = t\sqrt{\frac{n_1 + n_2}{n_1 n_2}} = \frac{t\sqrt{n_1+n_2}}{\sqrt{n_1 n_2}}$

For equal group sizes ( $n_1 = n_2 = n$ ):

$d = \frac{2t}{\sqrt{2n}} = t\sqrt{\frac{2}{n}}$

Cohen's $d_z$ from one-sample or paired t-test:

$d_z = \frac{t}{\sqrt{n}}$

Pearson $r$ from any t-statistic:

$r = \sqrt{\frac{t^2}{t^2 + \nu}}$

Where $\nu$ is the degrees of freedom. Note: this $r$ is equivalent to the point-biserial correlation between the binary group variable and the continuous outcome.

3.7 The Non-Central t-Distribution and Exact CIs for $d$

Under the alternative hypothesis (when a true effect exists), the t-statistic does not follow a central t-distribution — it follows a non-central t-distribution with non-centrality parameter:

$\lambda = \frac{\mu_1 - \mu_2}{\sigma\sqrt{1/n_1 + 1/n_2}} = d \cdot \sqrt{\frac{n_1 n_2}{n_1 + n_2}}$

This non-centrality parameter links the population effect size $d$ to the expected t-statistic. Exact 95% CIs for Cohen's $d$ invert this relationship numerically (no closed form exists) — a computation performed automatically by DataStatPro.

3.8 Statistical Power of the t-Test

Power is the probability that the t-test correctly rejects $H_0$ when a true effect $d$ exists:

$\text{Power} = P\!\left(T_\nu(\lambda) > t_{crit}\right)$

Where $T_\nu(\lambda)$ is the non-central t-distribution with non-centrality parameter:

$\lambda = d\sqrt{\frac{n_1 n_2}{n_1 + n_2}}$ (independent) $\quad$ or $\quad$ $\lambda = d\sqrt{n}$ (one-sample or paired)

For the independent samples t-test with equal groups, the approximate required sample size for power $1-\beta$ at two-sided level $\alpha$ :

$n_{per\;group} \approx \frac{2(z_{1-\alpha/2} + z_{1-\beta})^2}{d^2}$

$d$	Power = 0.80 ( $n$ /group)	Power = 0.90 ( $n$ /group)	Power = 0.95 ( $n$ /group)
0.20 (small)	394	527	651
0.50 (medium)	64	85	105
0.80 (large)	26	34	42
1.00	17	22	27
1.20	12	16	20

4. Assumptions of t-Tests

4.1 Normality of the Sampling Distribution

The t-test assumes that the sampling distribution of the mean difference is normal. This is satisfied when either:

The population from which data are drawn is normally distributed, OR
The sample size is sufficiently large for the CLT to ensure approximate normality of $\bar{x}$ (generally $n \geq 30$ per group, though skewed distributions may require larger $n$ ).

How to check:

Shapiro-Wilk test for $n < 50$ (most powerful normality test for small samples).
Kolmogorov-Smirnov or Lilliefors test for $n \geq 50$ .
Q-Q (quantile-quantile) plots: points should fall approximately on the diagonal line.
Histograms and density plots: assess approximate bell shape.
Skewness ( $\lvert z_{skew} \rvert < 2$ ) and kurtosis ( $\lvert z_{kurt} \rvert < 7$ ).

Robustness: The t-test is remarkably robust to mild non-normality, especially for larger samples. For moderate non-normality with $n \geq 20$ per group, the t-test's Type I error rate remains close to the nominal $\alpha$ .

When violated: Use the Mann-Whitney U test (independent) or Wilcoxon signed-rank test (paired) as non-parametric alternatives. Consider data transformation (log, square root) if the distribution is strongly skewed.

4.2 Homogeneity of Variance (for Independent Samples t-Test)

Student's independent t-test assumes that the two populations have equal variances ( $\sigma_1^2 = \sigma_2^2$ ). This assumption is required for the pooled standard deviation to be a valid common estimator.

How to check:

Levene's test (preferred — robust to non-normality): $H_0$ : equal variances.
Brown-Forsythe test: more robust variant of Levene's for non-normal data.
Variance ratio rule of thumb: if $s^2_{larger}/s^2_{smaller} > 4$ , heterogeneity is potentially problematic.
$F$ -test of equality of variances (sensitive to non-normality — use with caution).

⚠️ A statistically significant Levene's test does not automatically invalidate Student's t-test for large equal-sized samples (the test is robust). However, when groups are unequal in size AND have unequal variances, Student's t-test can be severely anti-conservative (inflated Type I error). In this case, always use Welch's t-test.

When violated: Use Welch's t-test, which does not assume equal variances and is generally recommended as the default for independent samples comparisons (see Section 10).

4.3 Independence of Observations

Within each group, all observations must be independent — the score of one participant must not influence the score of any other. This is an assumption about the study design, not about the data, and cannot be tested statistically.

Common violations:

Students sampled from the same classroom (classroom effect).
Patients sampled from the same hospital ward.
Animals from the same litter.
Repeated measurements on the same participant (use paired t-test instead).
Family members in the same study.

When violated: For clustered data, use multilevel models. For repeated measures within the same participant, use the paired t-test or repeated measures ANOVA.

4.4 Scale of Measurement

t-Tests assume the dependent variable is measured on at least an interval scale — that is, the differences between values are meaningful and equal across the scale.

When violated: If the outcome is ordinal (ranked categories) or continuous but severely non-normal, use non-parametric alternatives (Mann-Whitney U, Wilcoxon signed-rank).

4.5 Random Sampling

For inferential conclusions to generalise to the population, the sample should be randomly selected from the population of interest. In practice, many research samples are convenience samples; this limits the generalisability of conclusions but does not invalidate the mathematical procedure of the t-test itself.

4.6 Absence of Influential Outliers

Extreme outliers can dramatically distort the mean and standard deviation, leading to inflated or deflated t-statistics. The t-test is sensitive to outliers, particularly in small samples.

How to check:

Boxplots: inspect for values more than $1.5 \times IQR$ beyond the quartiles.
Standardised scores: flag $\lvert z \rvert > 3$ as potential outliers.
Grubbs' test or generalised ESD method for formal outlier detection.

When outliers are present: Investigate whether outliers represent data entry errors, measurement errors, or genuine extreme values. Report analyses with and without outliers. Consider using the trimmed mean t-test or a robust alternative.

4.7 Assumption Summary Table

Assumption	One-Sample	Independent	Paired	How to Check	Remedy if Violated
Normality	✅	✅	✅ (differences)	Shapiro-Wilk, Q-Q	Mann-Whitney / Wilcoxon
Equal variances	—	✅	—	Levene's test	Welch's t-test
Independence	✅	✅ (within groups)	✅ (between pairs)	Design check	Multilevel models
Interval scale	✅	✅	✅	Measurement theory	Non-parametric test
No severe outliers	✅	✅	✅	Boxplots, $z$ -scores	Trimmed mean / robust test

5. Types of t-Tests

5.1 Decision Flowchart for Test Selection

The following logic guides selection of the appropriate t-test or alternative:

Is the outcome variable continuous (interval/ratio)?
├── NO  → Use chi-squared / Fisher's exact (categorical outcomes)
└── YES → How many groups?
          ├── MORE THAN 2 → Use ANOVA (or Kruskal-Wallis)
          └── 1 OR 2 → Are observations independent or paired?
                        ├── PAIRED (same units, two conditions)
                        │   ├── Normal differences? → Paired t-test
                        │   └── Non-normal?         → Wilcoxon signed-rank
                        └── INDEPENDENT (different participants)
                            ├── One group vs. known value?
                            │   ├── Normal?     → One-sample t-test
                            │   └── Non-normal? → Wilcoxon signed-rank (one-sample)
                            └── Two independent groups
                                ├── Normal + equal variances → Student's t-test
                                ├── Normal + unequal variances → Welch's t-test ✅ (recommended default)
                                └── Non-normal or ordinal → Mann-Whitney U

5.2 Choosing Between Student's and Welch's t-Test

A persistent question in applied statistics is whether to use Student's t-test (assuming equal variances) or Welch's t-test (not assuming equal variances) for independent samples.

The consensus recommendation: Use Welch's t-test as the default for independent samples comparisons:

Scenario	Student's t-test	Welch's t-test
Equal $n$ , equal $\sigma$	✅ Correct size	✅ Correct size
Equal $n$ , unequal $\sigma$	⚠️ Slightly liberal	✅ Correct size
Unequal $n$ , equal $\sigma$	✅ Correct size	✅ Slightly conservative
Unequal $n$ , unequal $\sigma$	❌ Severely liberal	✅ Correct size

Simulation studies (Ruxton, 2006; Delacre et al., 2017) consistently show that Welch's t-test controls Type I error across all conditions, whereas Student's t-test fails when $n$ and $\sigma$ are both unequal. The loss of power from using Welch's when variances are truly equal is negligible.

💡 The recommendation to default to Welch's t-test is supported by simulation evidence and is increasingly standard practice. DataStatPro reports both Student's and Welch's results by default, with Welch's highlighted as the recommended result.

6. Using the t-Test Calculator Component

The t-Test Calculator component in DataStatPro provides a comprehensive tool for running, diagnosing, visualising, and reporting t-tests and their alternatives.

Step-by-Step Guide

Step 1 — Select the Test Type

Choose from the "Test Type" dropdown:

One-Sample t-Test: Compare a sample mean to a known or hypothesised value.
Independent Samples t-Test: Compare means from two unrelated groups.
Paired Samples t-Test: Compare means from two related measurements.
Welch's t-Test: Independent samples without the equal variance assumption.
Mann-Whitney U Test: Non-parametric independent samples comparison.
Wilcoxon Signed-Rank Test: Non-parametric paired or one-sample comparison.

Step 2 — Input Method

Choose how to provide the data:

Raw data: Upload or paste the dataset directly. DataStatPro computes all statistics automatically, checks assumptions, and flags potential issues.
Summary statistics: Enter $n$ , $\bar{x}$ , and $s$ for each group. Full assumption checks are not available but all test statistics and effect sizes are computed.
Test statistic + df: Enter the $t$ -statistic and degrees of freedom to compute p-values, effect sizes, and CIs from a published result.

💡 When using raw data, DataStatPro automatically runs Shapiro-Wilk tests for normality and Levene's test for equality of variances, and displays the results alongside the main output with colour-coded warnings for violations.

Step 3 — Specify the Null Hypothesis Value

One-sample: Enter $\mu_0$ (default: 0). This is the population mean under $H_0$ .
Independent/Paired: Default is $H_0: \mu_1 - \mu_2 = 0$ . Specify a non-zero value for non-inferiority or superiority testing (e.g., $H_0: \mu_1 - \mu_2 \leq -\Delta$ ).

Step 4 — Select the Alternative Hypothesis

Two-tailed (default): $H_1: \mu_1 \neq \mu_2$ — most common, requires no directional prediction.
Upper one-tailed: $H_1: \mu_1 > \mu_2$ — pre-registered directional prediction.
Lower one-tailed: $H_1: \mu_1 < \mu_2$ — pre-registered directional prediction.

Step 5 — Choose the Significance Level

Select $\alpha$ (default: $.05$ ). DataStatPro also provides results for $\alpha = .01$ and $\alpha = .001$ simultaneously for reference.

Step 6 — Select the Variance Assumption

For independent samples tests:

Equal variances (Student's): Uses pooled SD; $\nu = n_1 + n_2 - 2$ .
Unequal variances (Welch's) — Recommended: Uses Welch-Satterthwaite df.
Auto: Run both; flag discrepancies based on Levene's test result.

Step 7 — Select Display Options

Choose which outputs to display:

✅ t-statistic, df, p-value, and decision.
✅ Means, SDs, and standard errors for each group.
✅ Mean difference with 95% CI (exact).
✅ Cohen's $d$ (or $d_z$ ) and Hedges' $g$ with 95% CI.
✅ Common Language Effect Size (CL %).
✅ Assumption test results (Shapiro-Wilk, Levene's).
✅ Distribution visualisation (two overlapping density curves with shaded overlap).
✅ Effect size visualisation (Cohen's $d$ diagram with $U_1$ , $U_2$ , $U_3$ ).
✅ Power analysis and required $n$ for 80%, 90%, 95% power.
✅ APA-style results paragraph.
✅ Equivalence test (TOST) for assessing negligibility of the effect.

Step 8 — Run the Analysis

Click "Run t-Test". DataStatPro will:

Compute the t-statistic, degrees of freedom, and p-value.
Construct the 95% CI for the mean difference.
Compute Cohen's $d$ , Hedges' $g$ , and their exact CIs.
Run all selected assumption tests and display warnings.
Generate all selected visualisations.
Generate an APA 7th edition-compliant results paragraph.

7. One-Sample t-Test

7.1 Purpose and Design

The one-sample t-test answers the question: "Is the mean of my sample significantly different from a specific, theoretically or practically meaningful value $\mu_0$ ?"

Common applications:

Testing whether a sample's mean IQ differs from the population mean of 100.
Testing whether a sample's mean reaction time differs from a published normative value.
Quality control: testing whether a machine produces items with a target weight.
Testing whether a clinical sample's mean score differs from the healthy population norm.

7.2 Full Procedure

Given: A sample of $n$ observations with mean $\bar{x}$ and standard deviation $s$ . Test $H_0: \mu = \mu_0$ vs. $H_1: \mu \neq \mu_0$ .

Step 1 — Compute the sample mean and SD

$\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i, \qquad s = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2}$

Step 2 — Compute the standard error

$SE = \frac{s}{\sqrt{n}}$

Step 3 — Compute the t-statistic

$t = \frac{\bar{x} - \mu_0}{SE} = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}$

Step 4 — Determine degrees of freedom

$\nu = n - 1$

Step 5 — Compute the p-value

$p = 2 \times P(T_{n-1} \geq \lvert t \rvert)$

Compare to $\alpha$ . Reject $H_0$ if $p \leq \alpha$ .

Step 6 — Compute the 95% CI for $\mu$

$\bar{x} \pm t_{\alpha/2,\; n-1} \cdot \frac{s}{\sqrt{n}}$

Step 7 — Compute Cohen's $d$

$d = \frac{\bar{x} - \mu_0}{s}$

Hedges' $g$ (bias-corrected):

$g = d \times \left(1 - \frac{3}{4(n-1) - 1}\right)$

7.3 Interpreting the One-Sample t-Test

Result	Interpretation
$p \leq \alpha$ and CI excludes $\mu_0$	Reject $H_0$ : sample mean differs significantly from $\mu_0$
$p > \alpha$ and CI includes $\mu_0$	Fail to reject $H_0$ : insufficient evidence of a difference
Large $d$ , $p \leq \alpha$	Significant AND practically meaningful departure from $\mu_0$
Small $d$ , $p \leq \alpha$ (large $n$ )	Significant but practically negligible departure from $\mu_0$
Large $d$ , $p > \alpha$ (small $n$ )	Non-significant due to low power; effect may be real but undetected

8. Independent Samples t-Test

8.1 Purpose and Design

The independent samples t-test answers: "Do two independent groups have the same population mean?" It requires that the two groups are composed of entirely different participants with no systematic pairing or matching.

Common applications:

Comparing test scores between a treatment group and a control group.
Comparing anxiety levels between males and females.
Comparing response times between two experimental conditions (between-subjects design).
Comparing customer satisfaction between two product versions.

8.2 Full Procedure (Student's)

Given: Group 1 with $n_1$ observations ( $\bar{x}_1$ , $s_1$ ) and Group 2 with $n_2$ observations ( $\bar{x}_2$ , $s_2$ ). Test $H_0: \mu_1 = \mu_2$ .

Step 1 — Compute summary statistics

$\bar{x}_j = \frac{1}{n_j}\sum_{i=1}^{n_j} x_{ji}, \qquad s_j = \sqrt{\frac{1}{n_j-1}\sum_{i=1}^{n_j}(x_{ji}-\bar{x}_j)^2}, \quad j \in \{1, 2\}$

Step 2 — Check variance homogeneity (Levene's test)

Run Levene's test. If $p_{Levene} \leq .05$ , favour Welch's t-test (Section 10). Regardless, reporting both is best practice.

Step 3 — Compute pooled standard deviation

$s_{pooled} = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}}$

Step 4 — Compute the t-statistic

$t = \frac{\bar{x}_1 - \bar{x}_2}{s_{pooled}\sqrt{1/n_1 + 1/n_2}}$

Step 5 — Degrees of freedom and p-value

$\nu = n_1 + n_2 - 2$

$p = 2 \times P(T_\nu \geq \lvert t \rvert)$

Step 6 — 95% CI for the mean difference

$(\bar{x}_1 - \bar{x}_2) \pm t_{\alpha/2,\;\nu} \cdot s_{pooled}\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}$

Step 7 — Effect sizes

$d = \frac{\bar{x}_1 - \bar{x}_2}{s_{pooled}}, \qquad g = d \times \left(1 - \frac{3}{4(n_1+n_2-2)-1}\right)$

Common Language Effect Size:

$CL = \Phi\!\left(\frac{d}{\sqrt{2}}\right)$

8.3 APA Reporting Template

"An independent samples t-test revealed [a significant / no significant] difference in [DV] between [Group 1] ( $M =$ , $SD =$ ) and [Group 2] ( $M =$ , $SD =$ ), $t(\nu) =$ , $p =$ , $d =$ [95% CI: , ]. This represents a [small / medium / large] effect according to Cohen's (1988) benchmarks."

Example: "An independent samples Welch's t-test revealed a significant difference in anxiety scores between the CBT group ( $M = 12.3$ , $SD = 4.1$ ) and the waitlist control group ( $M = 18.7$ , $SD = 5.2$ ), $t(57.4) = -5.62$ , $p < .001$ , $d = 1.38$ [95% CI: 0.87, 1.88]. This represents a large treatment effect."

9. Paired Samples t-Test

9.1 Purpose and Design

The paired samples t-test (also: dependent samples, matched pairs, or repeated measures t-test) answers: "Do two related measurements differ significantly from each other?"

When observations are paired:

Pre-post designs: The same participant measured before and after an intervention.
Matched pairs: Participants matched on key characteristics (age, sex, IQ) and randomised to different conditions.
Within-subjects designs: Each participant experiences both conditions.
Natural pairs: Twins, left hand vs. right hand, matched siblings.

Advantage over independent t-test: By comparing within-person differences, the paired design removes between-person variability from the error term, substantially increasing power when the within-person correlation $r_{12}$ is positive.

9.2 Full Procedure

Given: $n$ pairs of observations $(x_{1i}, x_{2i})$ .

Step 1 — Compute difference scores

$d_i = x_{1i} - x_{2i}, \qquad i = 1, 2, \ldots, n$

Step 2 — Compute mean and SD of differences

$\bar{d} = \frac{1}{n}\sum_{i=1}^n d_i, \qquad s_d = \sqrt{\frac{1}{n-1}\sum_{i=1}^n(d_i - \bar{d})^2}$

Step 3 — Compute the standard error of the mean difference

$SE_{\bar{d}} = \frac{s_d}{\sqrt{n}}$

Step 4 — Compute the t-statistic

$t = \frac{\bar{d}}{SE_{\bar{d}}} = \frac{\bar{d}}{s_d/\sqrt{n}}$

Step 5 — Degrees of freedom and p-value

$\nu = n - 1$

$p = 2 \times P(T_{n-1} \geq \lvert t \rvert)$

Step 6 — 95% CI for the mean difference

$\bar{d} \pm t_{\alpha/2,\; n-1} \cdot \frac{s_d}{\sqrt{n}}$

Step 7 — Effect sizes

Cohen's $d_z$ (most commonly reported for paired designs):

$d_z = \frac{\bar{d}}{s_d} = \frac{t}{\sqrt{n}}$

Cohen's $d_{rm}$ (repeated measures $d$ , accounting for the pre-post correlation):

$d_{rm} = \frac{\bar{d}}{s_{av}} \times \sqrt{2(1 - r_{12})}$

Where $s_{av} = (s_1 + s_2)/2$ and $r_{12}$ is the correlation between the two measurements. Note that $d_{rm}$ is more comparable to $d$ from independent samples designs than $d_z$ is.

Cohen's $d_{av}$ (standardised by the average SD):

$d_{av} = \frac{\bar{d}}{s_{av}} = \frac{\bar{x}_1 - \bar{x}_2}{(s_1+s_2)/2}$

9.3 Comparing Paired and Independent t-Tests for the Same Data

When data are paired (pre-post), computing the incorrect independent t-test is a serious error. The relationship between the two t-statistics is:

$t_{paired} = t_{independent} \cdot \sqrt{\frac{2(1-r_{12})}{1}} \cdot \sqrt{n}$

More precisely, the t-statistics are related through the within-pair correlation $r_{12}$ :

$SE_{paired} = SE_{independent} \cdot \sqrt{2(1-r_{12})}$

When $r_{12} > 0$ (typical for repeated measures): $SE_{paired} < SE_{independent}$ , so $t_{paired} > t_{independent}$ — the paired test is more powerful. When $r_{12} = 0$ (independence), both tests are equivalent. When $r_{12} < 0$ (rare), the independent test is more powerful.

⚠️ Never apply an independent samples t-test to paired data. Doing so ignores the within-pair correlation, produces an inflated standard error, and loses statistical power. Conversely, applying a paired t-test to genuinely independent data violates the independence assumption of the difference scores.

10. Welch's t-Test — Unequal Variances

10.1 Why Welch's is Preferred

Welch's t-test (1947) is a modification of Student's t-test that does not assume equal population variances. It is the recommended default for independent samples comparisons for three reasons:

Robustness: It maintains correct Type I error rates regardless of whether variances are equal or unequal.
Negligible power loss: When variances are truly equal, Welch's test loses very little power compared to Student's.
Correct coverage: The CI from Welch's has the correct nominal coverage probability across all variance ratio conditions.

10.2 Full Procedure

Step 1 — Compute group summary statistics

$\bar{x}_1, s_1, n_1$ and $\bar{x}_2, s_2, n_2$

Step 2 — Compute separate variance estimates

$v_1 = \frac{s_1^2}{n_1}, \qquad v_2 = \frac{s_2^2}{n_2}$

Step 3 — Compute Welch's t-statistic

$t_W = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{v_1 + v_2}}$

Step 4 — Compute Welch-Satterthwaite degrees of freedom

$\nu_W = \frac{(v_1 + v_2)^2}{\dfrac{v_1^2}{n_1-1} + \dfrac{v_2^2}{n_2-1}}$

Round $\nu_W$ down to the nearest integer for conservative inference.

Step 5 — p-value

$p = 2 \times P(T_{\nu_W} \geq \lvert t_W \rvert)$

Step 6 — 95% CI for mean difference

$(\bar{x}_1 - \bar{x}_2) \pm t_{\alpha/2,\; \nu_W} \cdot \sqrt{v_1 + v_2}$

Step 7 — Effect size (Glass's $\Delta$ or Welch's $d$ )

When variances are unequal, the appropriate standardiser for Cohen's $d$ is debated. Options include:

Pooled SD (ignores heterogeneity — caution):

$d = \frac{\bar{x}_1 - \bar{x}_2}{s_{pooled}}$

Glass's $\Delta$ (control group SD as standardiser — recommended for treatment/control):

$\Delta = \frac{\bar{x}_1 - \bar{x}_2}{s_2}$

Average SD (unbiased when neither group is the reference):

$d_{av} = \frac{\bar{x}_1 - \bar{x}_2}{(s_1 + s_2)/2}$

💡 DataStatPro reports all three standardisers alongside Welch's t-test, with Glass's $\Delta$ highlighted when one group is a designated control, and $d_{av}$ highlighted when neither group is a natural reference.

10.3 Student's vs. Welch's: A Direct Comparison

Property	Student's t-test	Welch's t-test
Assumes equal variances	✅ Yes	❌ No
df	$n_1 + n_2 - 2$	Welch-Satterthwaite (always $\leq$ Student's)
Type I error (equal $n$ , unequal $\sigma$ )	≈ nominal	≈ nominal
Type I error (unequal $n$ , unequal $\sigma$ )	❌ Inflated	✅ Nominal
Power (equal variances)	Marginally higher	≈ equivalent
Recommendation	Avoid as default	✅ Recommended default

11. Non-Parametric Alternatives

11.1 When to Use Non-Parametric Tests

Non-parametric tests (also called distribution-free tests) are appropriate when:

Data are ordinal (ranks, Likert items treated as ordinal).
Data are severely non-normally distributed and sample sizes are small.
There are extreme outliers that cannot be removed and that distort parametric statistics.
The research question concerns ranks or medians rather than means.

Trade-off: Non-parametric tests are more robust to assumption violations but have lower statistical power than their parametric counterparts when parametric assumptions ARE met. When normality holds, using a non-parametric test discards information.

💡 Non-parametric does not mean "assumption-free." The Mann-Whitney U test assumes that the two distributions have the same shape (just shifted); violation of this shape assumption means U tests the combined null of equal location AND equal shape, not just equal medians.

11.2 Mann-Whitney U Test (Non-Parametric Independent Samples)

The Mann-Whitney U test (also Wilcoxon rank-sum test) is the non-parametric alternative to the independent samples t-test. It tests whether the distributions of two independent groups are identical (or, under the shape assumption, whether one group tends to have higher ranks than the other).

Procedure:

Step 1 — Rank all observations

Combine all $n_1 + n_2$ observations and assign ranks from 1 (smallest) to $N = n_1 + n_2$ . For tied values, assign the average of the tied ranks.

Step 2 — Compute the rank sums

$W_1 = \sum_{i=1}^{n_1} R_i \quad \text{(sum of ranks for Group 1)}$

$W_2 = \sum_{j=1}^{n_2} R_j \quad \text{(sum of ranks for Group 2)}$

Check: $W_1 + W_2 = N(N+1)/2$

Step 3 — Compute U statistics

$U_1 = n_1 n_2 + \frac{n_1(n_1+1)}{2} - W_1$

$U_2 = n_1 n_2 + \frac{n_2(n_2+1)}{2} - W_2$

Check: $U_1 + U_2 = n_1 n_2$

The test statistic is $U = \min(U_1, U_2)$ .

Step 4 — Compute z-approximation (for $n > 10$ )

Under $H_0$ , with continuity correction:

$z = \frac{U - n_1 n_2/2}{\sqrt{n_1 n_2(n_1+n_2+1)/12}}$

For ties, the variance requires a correction factor:

$z = \frac{U - n_1 n_2/2}{\sqrt{\dfrac{n_1 n_2}{12}\!\left(n+1 - \dfrac{\sum_{k}(t_k^3 - t_k)}{n(n-1)}\right)}}$

Where $t_k$ is the number of observations in the $k$ -th tied group.

Step 5 — Effect size: Rank-biserial correlation

$r_{rb} = 1 - \frac{2U}{n_1 n_2} = \frac{U_1 - U_2}{n_1 n_2}$

Or equivalently:

$r_{rb} = \frac{2z}{\sqrt{n_1 + n_2}}$

Interpretation: $r_{rb} = 0.5$ means that 75% of observations in Group 1 exceed those in Group 2.

Cohen's benchmarks for $r_{rb}$ (same as $r$ ):

$\lvert r_{rb} \rvert$	Label
$0.10$	Small
$0.30$	Medium
$0.50$	Large

11.3 Wilcoxon Signed-Rank Test (Non-Parametric Paired)

The Wilcoxon signed-rank test is the non-parametric alternative to the paired t-test. It tests whether the distribution of difference scores is symmetric about zero.

Procedure:

Step 1 — Compute and rank absolute differences

Compute $d_i = x_{1i} - x_{2i}$ . Remove pairs where $d_i = 0$ . Let $n'$ = number of non-zero differences.

Rank $\lvert d_i \rvert$ from 1 (smallest) to $n'$ (largest), assigning average ranks for ties.

Step 2 — Sum positive and negative ranks

$W^+ = \sum_{d_i > 0} R_i \quad \text{(sum of ranks of positive differences)}$

$W^- = \sum_{d_i < 0} R_i \quad \text{(sum of ranks of negative differences)}$

Check: $W^+ + W^- = n'(n'+1)/2$

The test statistic is $W = \min(W^+, W^-)$ .

Step 3 — z-approximation (for $n' > 10$ )

$z = \frac{W^+ - n'(n'+1)/4}{\sqrt{n'(n'+1)(2n'+1)/24}}$

With tie correction:

$z = \frac{W^+ - n'(n'+1)/4}{\sqrt{\dfrac{n'(n'+1)(2n'+1)}{24} - \dfrac{\sum_k(t_k^3-t_k)}{48}}}$

Step 4 — Effect size

$r_W = \frac{z}{\sqrt{n'}}$

Or, the matched-pairs rank-biserial correlation:

$r_{rb} = 1 - \frac{4W^+}{n'(n'+1)}$

11.4 One-Sample Wilcoxon Signed-Rank Test

The one-sample version tests $H_0$ : the population median equals $\theta_0$ . Compute $d_i = x_i - \theta_0$ and apply the Wilcoxon signed-rank procedure as above.

11.5 Comparing Parametric and Non-Parametric Tests

Property	t-Test (parametric)	Mann-Whitney / Wilcoxon (non-parametric)
Tests	Mean difference	Distribution shift (median/rank)
Assumes normality	✅ Yes	❌ No
Sensitive to outliers	✅ Yes	❌ No (rank-based)
Power (when normal)	✅ Higher	✅ 95% efficiency of t-test
Power (when non-normal)	❌ Lower	✅ Can exceed t-test
Effect size	Cohen's $d$ , Hedges' $g$	Rank-biserial $r_{rb}$
Handles ordinal data	❌ Questionable	✅ Appropriate
Interpretability	Mean difference	Probability of superiority

⚠️ The Asymptotic Relative Efficiency (ARE) of the Mann-Whitney U test relative to the t-test is $3/\pi \approx 0.955$ for normal data — meaning you only need about 5% more observations with Mann-Whitney to achieve the same power as the t-test. This near-equality of efficiency makes Mann-Whitney a safe choice when normality is questionable.

11.6 Brunner-Munzel Test — Handling Unequal Shapes

When the two distributions have different shapes (not just different locations), the Mann-Whitney test actually tests a compound null of equal location AND equal spread. The Brunner-Munzel test (Brunner & Munzel, 2000) is a robust alternative that tests only the stochastic equality of the two groups without the shape assumption:

$H_0: P(X_1 > X_2) + 0.5 \cdot P(X_1 = X_2) = 0.5$

The test statistic uses ranked data with separate within-group rankings:

$t_{BM} = \frac{n_1 n_2 (\bar{R}_1^{(int)} - \bar{R}_2^{(int)})}{N\sqrt{n_1\hat{S}_1^2 + n_2\hat{S}_2^2}}$

Where $\bar{R}_j^{(int)}$ are internal group ranks. DataStatPro reports the Brunner-Munzel test as an option under the Non-Parametric menu when the distribution shape assumption may be violated.

12. Advanced Topics

12.1 Robust t-Tests — Trimmed Means and Winsorisation

Trimmed mean t-tests use the $\alpha$ -trimmed mean (removing the top and bottom $\alpha$ proportion of observations) as the measure of central tendency. Yuen's (1974) $t$ -test for trimmed means:

$t_{trim} = \frac{\bar{x}_{t1} - \bar{x}_{t2}}{SE_{trim}}$

With 20% trimming from each tail:

$\bar{x}_{t} = \frac{\sum_{i=h+1}^{n-h} x_{(i)}}{n - 2h}, \quad h = \lfloor \alpha n \rfloor$

$SE_{trim} = \sqrt{\frac{W_1}{h_1(h_1-1)} + \frac{W_2}{h_2(h_2-1)}}$

Where $W_j$ are the Winsorised sum of squared deviations, and $h_j = n_j - 2\lfloor \alpha n_j \rfloor$ .

Trimmed mean t-tests are substantially more powerful than rank-based tests for heavy-tailed symmetric distributions while maintaining good Type I error control.

12.2 Bootstrap t-Tests

The bootstrap t-test (Efron & Tibshirani, 1994) makes no parametric distributional assumptions. It constructs the null distribution of the t-statistic empirically:

Procedure:

Compute the observed t-statistic $t_{obs}$ .
Centre both samples on a common mean: $x_i^* = x_i - \bar{x}_j + \bar{x}_{grand}$ .
Draw $B$ bootstrap samples (typically $B = 10{,}000$ ) from the centred samples with replacement and compute $t^*_b$ for each.
The p-value is the proportion of bootstrap t-statistics exceeding $\lvert t_{obs} \rvert$ .

Percentile bootstrap CI for the mean difference:

Resample from the original data and compute the mean difference for each bootstrap sample. The 95% CI is the 2.5th and 97.5th percentiles of the bootstrap distribution.

The bootstrap is particularly valuable for small, non-normal samples where both parametric and asymptotic approximations may be poor.

12.3 Bayesian t-Tests

The Bayesian t-test (Rouder et al., 2009; Jeffreys, 1961) quantifies evidence for both $H_0$ (no effect) and $H_1$ (an effect exists) using the Bayes Factor ( $BF_{10}$ ):

$BF_{10} = \frac{P(\text{data} \mid H_1)}{P(\text{data} \mid H_0)}$

$BF_{10}$ represents how many times more likely the data are under $H_1$ than $H_0$ .

For the default Bayesian t-test (Rouder et al., 2009), the prior on effect size under $H_1$ is a Cauchy distribution with scale $r$ :

$\delta \sim \text{Cauchy}(0, r)$

The default scale $r = \sqrt{2}/2 \approx 0.707$ represents a "medium" effect size prior.

Interpreting Bayes Factors:

$BF_{10}$	Evidence for $H_1$
$> 100$	Extreme
$30 - 100$	Very strong
$10 - 30$	Strong
$3 - 10$	Moderate
$1 - 3$	Anecdotal
$1$	No evidence (equal)
$1/3 - 1$	Anecdotal for $H_0$
$1/10 - 1/3$	Moderate for $H_0$
$1/30 - 1/10$	Strong for $H_0$
$< 1/30$	Very strong for $H_0$

Advantages of Bayesian t-tests:

Quantify evidence for $H_0$ (null results can be informative).
Avoid the optional stopping problem (p-values are invalid for sequential testing).
Provide a single, direct statement about the relative evidence for each hypothesis.

Limitations: Sensitive to the choice of prior. Results should always be reported with the prior specification and checked for sensitivity to alternative priors.

12.4 Equivalence Testing with TOST

Standard null hypothesis testing only allows rejection of $H_0: d = 0$ . When the goal is to demonstrate absence of a meaningful effect (e.g., showing that a generic drug is bioequivalent to a brand-name drug), the Two One-Sided Tests (TOST) procedure is required.

Specify equivalence bounds $[-\Delta_L, \Delta_U]$ (e.g., $d = \pm 0.20$ , corresponding to a "trivially small" effect).

TOST procedure:

Test $H_{01}: \mu_1 - \mu_2 \leq -\Delta_L$ using upper one-tailed test. Test $H_{02}: \mu_1 - \mu_2 \geq \Delta_U$ using lower one-tailed test.

Equivalence is concluded at level $\alpha$ when both one-tailed tests reject their respective nulls — equivalently, when the $90\%$ CI (for $\alpha = .05$ ) for the mean difference falls entirely within $(-\Delta_L, \Delta_U)$ .

💡 Note that TOST uses a 90% CI (not 95%) when $\alpha = .05$ , because each one-tailed test is at the $\alpha = .05$ level. The 90% CI corresponds to two one-tailed 5% tests.

Minimum detectable equivalence with $n$ per group:

$\Delta_{min} = t_{\alpha,\; n-1} \cdot \sqrt{\frac{2s^2}{n}} + t_{\beta,\; n-1} \cdot \sqrt{\frac{2s^2}{n}}$

12.5 Sequential t-Tests and Alpha Spending

Traditional t-tests do not allow for interim analyses — looking at the data multiple times inflates the Type I error rate. Sequential approaches address this:

Sequential Probability Ratio Test (SPRT): Compute a likelihood ratio $\Lambda$ after each observation. Stop when $\Lambda \leq B$ (accept $H_0$ ) or $\Lambda \geq A$ (reject $H_0$ ), where $A = (1-\beta)/\alpha$ and $B = \beta/(1-\alpha)$ .

Alpha spending functions (O'Brien-Fleming, Pocock): Pre-specify how the total $\alpha$ budget is distributed across planned interim and final analyses.

Bayesian sequential testing: Use Bayes Factors to monitor evidence continuously. Unlike frequentist sequential testing, Bayesian sequential testing is valid at any stopping point without an alpha correction.

12.6 Multiple Comparisons and t-Tests

When multiple t-tests are conducted within the same study, the familywise error rate (FWER) — the probability of at least one Type I error — inflates:

$FWER = 1 - (1-\alpha)^k$

Where $k$ is the number of tests. For $k = 5$ independent tests at $\alpha = .05$ : $FWER = 1 - (0.95)^5 = .226$ .

Correction methods:

Method	Adjusted $\alpha$	Properties
Bonferroni	$\alpha/k$	Conservative; controls FWER; simple
Holm	Sequential Bonferroni	Less conservative than Bonferroni
Benjamini-Hochberg	Controls FDR	Less conservative; for exploratory work
Šidák	$1-(1-\alpha)^{1/k}$	Slightly less conservative than Bonferroni

⚠️ Corrections for multiple comparisons should be planned before data collection and applied to the entire family of tests. Post-hoc correction of selected tests is not valid. When tests are planned contrasts from a theoretically derived framework, no correction may be necessary — this should be justified explicitly.

12.7 Effect Sizes for t-Tests: Choosing the Right $d$ Variant

Multiple variants of Cohen's $d$ exist for the paired design, and choosing the wrong one leads to incomparable effect sizes across studies:

Variant	Formula	Denominator	Comparability
$d_z$	$\bar{d}/s_d$	SD of differences	Paired designs only
$d_{av}$	$\bar{d}/s_{av}$	Average of group SDs	Comparable to between-subjects
$d_{rm}$	$d_{av}\sqrt{2(1-r)}$	Adjusted for correlation	Corrects $d_{av}$ for non-independence
$d_s$ (Glass)	$\bar{d}/s_{pre}$	Pre-test (baseline) SD	Treatment-control; Glass's $\Delta$

Which to use:

Report $d_z$ when comparing paired designs to other paired designs.
Report $d_{av}$ when comparing a paired design to a between-subjects design.
Always specify which variant was used.

12.8 Reporting t-Tests According to APA 7th Edition

The APA Publication Manual (7th ed.) requires:

Test statistic: $t(\nu) =$ value
p-value: $p =$ value (report exact value; use $p < .001$ only when below .001)
Effect size with 95% CI: Cohen's $d =$ [LB, UB] or Hedges' $g$
Group means and standard deviations
Whether Welch's correction was applied (for independent t-tests)
Whether the CI is for the mean difference or the standardised effect size

Full APA template:

"[Group 1] ( $M =$ , $SD =$ , $n =$ ) and [Group 2] ( $M =$ , $SD =$ , $n =$ ) were compared using [Student's / Welch's] independent samples t-test. The test revealed [a significant / no significant] mean difference, $t(\nu) =$ , $p =$ , $d =$ [95% CI: , ], indicating a [small / medium / large] effect."

13. Worked Examples

Example 1: One-Sample t-Test — Comparing Response Time to a Normative Standard

A cognitive neuroscience researcher measures simple reaction times (in ms) for $n = 25$ adults diagnosed with ADHD. The published population norm for neurotypical adults is $\mu_0 = 250$ ms. The researcher tests whether the ADHD sample has a significantly different mean reaction time.

Data summary:

$n = 25, \quad \bar{x} = 281.4 \text{ ms}, \quad s = 42.8 \text{ ms}$

Step 1 — Standard error:

$SE = \frac{42.8}{\sqrt{25}} = \frac{42.8}{5} = 8.56 \text{ ms}$

Step 2 — t-statistic:

$t = \frac{281.4 - 250}{8.56} = \frac{31.4}{8.56} = 3.668$

Step 3 — Degrees of freedom:

$\nu = 25 - 1 = 24$

Step 4 — p-value:

$p = 2 \times P(T_{24} \geq 3.668) = 2 \times 0.00062 = .001$

Step 5 — 95% CI for $\mu$ :

$t_{.025, 24} = 2.064$

$281.4 \pm 2.064 \times 8.56 = 281.4 \pm 17.7 = [263.7, 299.1]$

Step 6 — Cohen's $d$ :

$d = \frac{281.4 - 250}{42.8} = \frac{31.4}{42.8} = 0.734$

Hedges' $g$ :

$g = 0.734 \times \left(1 - \frac{3}{4(24)-1}\right) = 0.734 \times \left(1 - \frac{3}{95}\right) = 0.734 \times 0.9684 = 0.711$

95% CI for $d$ (approximate):

$SE_d = \sqrt{\frac{1}{25} + \frac{0.734^2}{2(24)}} = \sqrt{0.04 + 0.01121} = \sqrt{0.0512} = 0.226$

$95\% \text{ CI}: 0.734 \pm 1.96(0.226) = [0.291, 1.177]$

Common Language Effect Size:

$CL = \Phi\!\left(\frac{0.734}{\sqrt{2}}\right) = \Phi(0.519) = 0.698$

Summary:

Statistic	Value	Interpretation
$t(24)$	$3.668$
$p$ (two-tailed)	$.001$	Significant at $\alpha = .05$
Mean difference	$31.4$ ms	ADHD group is 31.4 ms slower
95% CI (ms)	$[13.7, 49.1]$	Excludes 0; significant
Cohen's $d$	$0.734$	Medium-large effect
95% CI for $d$	$[0.291, 1.177]$	From small to very large — wide CI
Hedges' $g$	$0.711$	Minimal bias correction
CL	$69.8\%$	ADHD group exceeds norm $69.8\%$ of time

APA write-up: "Adults with ADHD ( $M = 281.4$ ms, $SD = 42.8$ ms) showed significantly longer reaction times than the neurotypical normative mean of 250 ms, $t(24) = 3.67$ , $p = .001$ , $d = 0.73$ [95% CI: 0.29, 1.18]. This represents a medium-to-large deviation from the normative standard. The 95% CI for the mean difference was [13.7, 49.1] ms."

Example 2: Welch's Independent Samples t-Test — Sleep Duration by Shift Type

A workplace health researcher compares average nightly sleep duration (hours) between day-shift ( $n_1 = 40$ ) and night-shift ( $n_2 = 35$ ) nurses.

Summary statistics:

Group	$n$	Mean (hrs)	SD
Day shift	40	7.21	1.02
Night shift	35	5.84	1.73

Levene's test: $F(1, 73) = 9.82$ , $p = .002$ — significant heterogeneity of variances. → Use Welch's t-test.

Step 1 — Variance estimates:

$v_1 = \frac{1.02^2}{40} = \frac{1.040}{40} = 0.02601$

$v_2 = \frac{1.73^2}{35} = \frac{2.993}{35} = 0.08551$

Step 2 — Welch's t-statistic:

$t_W = \frac{7.21 - 5.84}{\sqrt{0.02601 + 0.08551}} = \frac{1.37}{\sqrt{0.11152}} = \frac{1.37}{0.3339} = 4.103$

Step 3 — Welch-Satterthwaite df:

$\nu_W = \frac{(0.02601 + 0.08551)^2}{\dfrac{0.02601^2}{39} + \dfrac{0.08551^2}{34}} = \frac{(0.11152)^2}{\dfrac{0.000677}{39} + \dfrac{0.007312}{34}} = \frac{0.012437}{0.0000174 + 0.000215} = \frac{0.012437}{0.000232} = 53.6$

Rounded down: $\nu_W = 53$ .

Step 4 — p-value:

$p = 2 \times P(T_{53} \geq 4.103) < .001$

Step 5 — 95% CI:

$t_{.025, 53} = 2.006$

$(7.21 - 5.84) \pm 2.006 \times 0.3339 = 1.37 \pm 0.670 = [0.700, 2.040]$

Step 6 — Effect sizes:

$s_{pooled} = \sqrt{\frac{39(1.02)^2 + 34(1.73)^2}{73}} = \sqrt{\frac{40.56 + 101.78}{73}} = \sqrt{\frac{142.34}{73}} = \sqrt{1.949} = 1.396$

Cohen's $d$ :

$d = \frac{7.21 - 5.84}{1.396} = \frac{1.37}{1.396} = 0.981$

Glass's $\Delta$ (using night-shift SD as the standardiser — the "comparison" group):

$\Delta = \frac{7.21 - 5.84}{1.73} = \frac{1.37}{1.73} = 0.792$

Average SD: $s_{av} = (1.02 + 1.73)/2 = 1.375$

$d_{av} = \frac{1.37}{1.375} = 0.996$

Summary:

Statistic	Value
Levene's $F$	$9.82$ , $p = .002$ (unequal variances confirmed)
$t_W(53.6)$	$4.103$
$p$ (two-tailed)	$< .001$
Mean difference	$1.37$ hrs (day > night)
95% CI (hrs)	$[0.700, 2.040]$
Cohen's $d$	$0.981$ (Large)
Glass's $\Delta$	$0.792$ (Large)
$d_{av}$	$0.996$ (Large)

APA write-up: "Day-shift nurses ( $M = 7.21$ hrs, $SD = 1.02$ ) slept significantly longer than night-shift nurses ( $M = 5.84$ hrs, $SD = 1.73$ ). Due to significant variance heterogeneity (Levene's $F(1, 73) = 9.82$ , $p = .002$ ), Welch's t-test was applied. Results indicated a significant difference, $t_W(53.6) = 4.10$ , $p < .001$ , $d = 0.98$ [95% CI: 0.54, 1.43], representing a large effect. Night-shift nurses slept on average 1.37 hours less per night [95% CI: 0.70, 2.04]."

Example 3: Paired Samples t-Test — Pre-Post Mindfulness Intervention

A clinical psychologist tests whether an 8-week mindfulness-based stress reduction (MBSR) programme reduces perceived stress. Perceived Stress Scale (PSS-10; range 0–40) scores are recorded before and after the programme for $n = 20$ participants.

Summary statistics:

Measurement	Mean	SD	$r$ (pre-post)
Pre-MBSR	24.7	5.8
Post-MBSR	18.3	5.1	$r_{12} = 0.74$
Differences ( $d_i = pre - post$ )	6.4	4.1

Step 1 — t-statistic:

$t = \frac{\bar{d}}{s_d/\sqrt{n}} = \frac{6.4}{4.1/\sqrt{20}} = \frac{6.4}{4.1/4.472} = \frac{6.4}{0.917} = 6.979$

Step 2 — Degrees of freedom and p-value:

$\nu = 20 - 1 = 19$

$p = 2 \times P(T_{19} \geq 6.979) < .001$

Step 3 — 95% CI for mean difference:

$t_{.025, 19} = 2.093$

$6.4 \pm 2.093 \times 0.917 = 6.4 \pm 1.919 = [4.48, 8.32]$

Step 4 — Effect sizes:

$d_z = \frac{6.4}{4.1} = 1.561$

$d_{av} = \frac{6.4}{(5.8+5.1)/2} = \frac{6.4}{5.45} = 1.174$

$d_{rm} = d_{av}\sqrt{2(1-r_{12})} = 1.174 \times \sqrt{2(1-0.74)} = 1.174 \times \sqrt{0.52} = 1.174 \times 0.721 = 0.847$

Note the difference:

$d_z = 1.561$ : appropriate for within-study power analysis and comparison to other paired studies.
$d_{av} = 1.174$ : better for comparing to between-subjects studies using the same scale.
$d_{rm} = 0.847$ : corrects for the dependency structure; most generalisable.

Comparison: what if the independent t-test had been (incorrectly) applied?

$s_{pooled} = \sqrt{\frac{19(5.8^2) + 19(5.1^2)}{38}} = \sqrt{\frac{640.04 + 493.59}{38}} = \sqrt{29.83} = 5.462$

$t_{independent} = \frac{24.7 - 18.3}{5.462\sqrt{1/20+1/20}} = \frac{6.4}{5.462 \times 0.3162} = \frac{6.4}{1.726} = 3.709$

The paired test ( $t = 6.98$ ) is substantially more powerful than the incorrect independent test ( $t = 3.71$ ) — reflecting the benefit of removing between-person variance through pairing.

Summary:

Statistic	Value
$t(19)$	$6.979$
$p$ (two-tailed)	$< .001$
Mean reduction	$6.4$ PSS points
95% CI for difference	$[4.48, 8.32]$
$d_z$	$1.561$
$d_{av}$	$1.174$
$d_{rm}$	$0.847$
$t$ (if independent, incorrect)	$3.709$

APA write-up: "Perceived stress scores decreased significantly from pre-MBSR ( $M = 24.7$ , $SD = 5.8$ ) to post-MBSR ( $M = 18.3$ , $SD = 5.1$ ), $t(19) = 6.98$ , $p < .001$ , $d_z = 1.56$ [95% CI: 0.99, 2.11]. The mean reduction of 6.4 PSS points (95% CI: [4.48, 8.32]) represents a large within-person effect."

Example 4: Mann-Whitney U Test — Non-Parametric Independent Comparison

A researcher compares pain ratings (0–10 scale, ordinal) between two physiotherapy protocols. Shapiro-Wilk tests indicate non-normality in both groups. Group 1 (Protocol A, $n_1 = 8$ ): ratings $\{3, 5, 2, 7, 4, 6, 3, 5\}$ . Group 2 (Protocol B, $n_2 = 7$ ): ratings $\{7, 8, 6, 9, 7, 8, 6\}$ .

Step 1 — Rank all $N = 15$ observations:

Combined sorted values: 2, 3, 3, 4, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 9

Value	Freq	Ranks	Avg Rank	Group
2	1	1	1.0	A
3	2	2–3	2.5	A, A
4	1	4	4.0	A
5	2	5–6	5.5	A, A
6	3	7–9	8.0	A, B, B
7	3	10–12	11.0	A, B, B
8	2	13–14	13.5	B, B
9	1	15	15.0	B

Step 2 — Rank sums:

$W_1 = 1.0 + 2.5 + 2.5 + 4.0 + 5.5 + 5.5 + 8.0 + 11.0 = 40.0$

$W_2 = 8.0 + 11.0 + 11.0 + 13.5 + 13.5 + 15.0 + 8.0 = 80.0$

Check: $40.0 + 80.0 = 120 = 15 \times 16/2$ ✅

Step 3 — U statistics:

$U_1 = 8 \times 7 + \frac{8 \times 9}{2} - 40 = 56 + 36 - 40 = 52$

$U_2 = 8 \times 7 + \frac{7 \times 8}{2} - 80 = 56 + 28 - 80 = 4$

Check: $52 + 4 = 56 = n_1 n_2$ ✅

Test statistic: $U = \min(52, 4) = 4$

Step 4 — z-approximation:

$\mu_U = \frac{n_1 n_2}{2} = \frac{56}{2} = 28$

$\sigma_U = \sqrt{\frac{8 \times 7 \times 16}{12}} = \sqrt{\frac{896}{12}} = \sqrt{74.67} = 8.64$ (without tie correction)

$z = \frac{4 - 28}{8.64} = \frac{-24}{8.64} = -2.778$

$p = 2 \times P(Z \leq -2.778) = 2 \times .0027 = .005$

Step 5 — Rank-biserial correlation:

$r_{rb} = 1 - \frac{2 \times 4}{56} = 1 - 0.143 = 0.857$

Or: $r_{rb} = \frac{2 \times 2.778}{\sqrt{15}} = \frac{5.556}{3.873} = 0.814$ (from $z$ )

Interpretation: Protocol B produces substantially higher pain ratings — $r_{rb} = 0.857$ indicates a large effect (Protocol A ranks lower/better with probability $\approx 93\%$ ).

Summary:

Statistic	Value
$U$	$4$
$z$ (approximate)	$-2.78$
$p$ (two-tailed)	$.005$
$r_{rb}$	$0.857$ (Large)
Median Protocol A	$4.5$
Median Protocol B	$7.0$

APA write-up: "Due to non-normal distributions, a Mann-Whitney U test was conducted. Protocol A ( $\text{Mdn} = 4.5$ ) produced significantly lower pain ratings than Protocol B ( $\text{Mdn} = 7.0$ ), $U = 4$ , $z = -2.78$ , $p = .005$ , $r_{rb} = 0.86$ , indicating a large effect."

14. Common Mistakes and How to Avoid Them

Mistake 1: Using the Independent t-Test for Paired Data

Problem: Treating pre-post measurements or matched pairs as independent samples. This ignores the within-person correlation, inflates the error term, and substantially reduces power.

Solution: Identify the study design before analysis. If each participant contributes two scores (repeated measures, matched pairs), use the paired t-test. Check whether the data file has one row per participant (paired) vs. one row per observation (independent).

Mistake 2: Defaulting to Student's t-Test Without Checking Variance Equality

Problem: SPSS, Excel, and older textbooks default to Student's t-test. When groups differ in sample size AND variance, Student's t-test can have a severely inflated Type I error rate.

Solution: Always use Welch's t-test as the default for independent samples. The cost in power when variances are equal is negligible, whereas the benefit when variances are unequal is substantial. Report Welch's results; note if Levene's test is significant.

Mistake 3: Interpreting a Non-Significant p-Value as Evidence of No Effect

Problem: Concluding that $p > .05$ means $\mu_1 = \mu_2$ . A non-significant result means the data are insufficient to reject $H_0$ — it does NOT mean the null hypothesis is true.

Solution: Report the 95% CI for the mean difference alongside the p-value. A wide CI that spans from negative to positive values reflects uncertainty, not evidence of zero effect. To positively establish absence of a meaningful effect, use equivalence testing (TOST) with prespecified bounds.

Mistake 4: Reporting Only p-Values Without Effect Sizes

Problem: Reporting $t(48) = 2.11$ , $p = .040$ without Cohen's $d$ conveys nothing about the magnitude of the effect. With $n = 1{,}000$ , the same p-value might correspond to $d = 0.08$ (trivial); with $n = 10$ , it might correspond to $d = 1.10$ (large).

Solution: Always report Cohen's $d$ (or Hedges' $g$ ) and its 95% CI alongside every t-test. DataStatPro computes these automatically.

Mistake 5: Switching to One-Tailed Tests After Seeing the Data

Problem: Observing that Group 1 > Group 2, then switching to a one-tailed test to achieve $p < .05$ when the two-tailed result was $p = .07$ . This is p-hacking and inflates the Type I error to approximately $10\%$ .

Solution: Directional hypotheses must be pre-registered before data collection and must be based on strong theoretical or prior empirical grounds. If in doubt, use a two-tailed test.

Mistake 6: Applying t-Tests to Likert Items Without Justification

Problem: Treating 5-point Likert items as interval-scale data and applying t-tests. Strictly, Likert items are ordinal — the intervals between adjacent scale points are not necessarily equal.

Solution: For a single Likert item, use the Mann-Whitney U (independent) or Wilcoxon signed-rank (paired) test. For a Likert scale (composite of multiple items), the summed score is typically treated as approximately interval, and t-tests are generally considered acceptable. Clearly state this assumption.

Mistake 7: Ignoring Outliers Before Running the t-Test

Problem: The t-test uses means, which are highly sensitive to outliers, especially in small samples. A single extreme value can drastically alter the t-statistic and p-value.

Solution: Always inspect data with boxplots and $z$ -scores before running a t-test. Investigate outliers (data entry error? valid extreme value?). Report analyses with and without outliers. Consider using trimmed mean t-tests or the Mann-Whitney test when outliers cannot be removed.

Mistake 8: Confusing Statistical Power with the Probability the Null is False

Problem: Interpreting power $= 0.80$ as meaning "there is an 80% probability the null hypothesis is false, given I found $p < .05$ ." Power is a property of the study design computed before data collection — it is the probability of getting a significant result IF a true effect of size $d$ exists.

Solution: Understand that power is computed under $H_1$ and is not a posterior probability about $H_0$ . The probability that a significant result reflects a true effect (positive predictive value) also depends on the prior probability of $H_1$ being true.

Mistake 9: Using the Wrong $d$ Variant and Comparing Across Designs

Problem: Reporting $d_z$ from a paired design and comparing it to $d$ from an independent samples study as if they are the same quantity. $d_z$ depends on the pre-post correlation and is typically larger than $d_{av}$ for the same mean difference.

Solution: When comparing effect sizes across designs, convert all effect sizes to a common metric. Use $d_{av}$ for paired designs when comparing to between-subjects studies. Always specify which variant of $d$ was computed.

Mistake 10: Running Multiple t-Tests Instead of ANOVA

Problem: Comparing three groups (A, B, C) with three separate t-tests (A vs. B, A vs. C, B vs. C) inflates the familywise error rate to $\approx 14\%$ instead of the nominal $5\%$ .

Solution: When comparing more than two groups, use one-way ANOVA (or Kruskal-Wallis for non-parametric data) followed by appropriate post-hoc tests (Tukey HSD, Bonferroni, Games-Howell for unequal variances). Reserve t-tests for pre-planned pairwise contrasts with appropriate alpha correction.

15. Troubleshooting

Problem	Likely Cause	Solution
t-statistic is extremely large ( $\lvert t \rvert > 10$ )	Very large $n$ or data entry error	Check for duplicate entries, errors; report effect size — even large $t$ may indicate a small $d$
$p = 1.000$ or exactly 0	Floating point overflow; identical group means	Check that both groups have variance; verify data coding
Welch's df is very small ( $< 5$ )	One group has very small $n$ or near-zero variance	Check data; use exact permutation test for very small $n$
Student's and Welch's give very different results	Unequal variances with unequal $n$	Levene's test is likely significant; use Welch's result
Paired t-test gives larger $t$ than expected	High pre-post correlation (good — this is the efficiency gain)	Report as normal; note the within-person correlation $r_{12}$
Shapiro-Wilk is significant but $n$ is large	Power of normality test increases with $n$ ; minor deviations become significant	With $n \geq 30$ , CLT usually ensures valid t-test; inspect Q-Q plots and skewness
Mann-Whitney gives different conclusion than t-test	Distribution is non-normal and sample is small	For non-normal data, trust Mann-Whitney; report both with a note on assumption violation
Effect size CI is very wide	Small sample size	Report the wide CI — it is informative about low precision; conduct a priori power analysis for future study
Cohen's $d_z$ is much larger than $d_{av}$	High pre-post correlation ( $r_{12}$ is large)	Both are correct; specify which was computed and when each is appropriate
Equivalence test fails despite small $d$	Equivalence bounds are too tight for the sample size	Either increase $n$ or widen the equivalence bounds with justification
Negative $p$ -value or $p > 1$ reported	Software error or data corruption	Re-check data file; rerun analysis in DataStatPro
One-tailed $p$ is larger than two-tailed $p$	Effect is in the direction opposite to $H_1$	The one-tailed test is not significant in the predicted direction; the effect is in the wrong direction
Bootstrap CI does not include 0 but t-test $p > .05$	Small sample; bootstrap and t-test diverge for highly non-normal data	Investigate distribution; report both with rationale for preferred method
$r$ computed from $t$ and $\nu$ seems too small	Correct — $r$ from $t$ is the point-biserial correlation, not Cohen's $d$	Use $d = 2r/\sqrt{1-r^2}$ to convert to Cohen's $d$
Bayes Factor is not decisive ( $BF \approx 1$ )	Data provide no evidence in either direction; study is underpowered	Collect more data; report BF as evidence of insensitivity; avoid interpreting as supporting either hypothesis

16. Quick Reference Cheat Sheet

Core t-Test Formulas

Formula	Description
$t = (\bar{x} - \mu_0)/(s/\sqrt{n})$	One-sample t-statistic
$t = (\bar{x}_1-\bar{x}_2)/(s_p\sqrt{1/n_1+1/n_2})$	Independent samples (Student's)
$s_p = \sqrt{[(n_1-1)s_1^2+(n_2-1)s_2^2]/(n_1+n_2-2)}$	Pooled standard deviation
$t_W = (\bar{x}_1-\bar{x}_2)/\sqrt{s_1^2/n_1+s_2^2/n_2}$	Welch's t-statistic
$\nu_W = (v_1+v_2)^2/(v_1^2/(n_1-1)+v_2^2/(n_2-1))$	Welch-Satterthwaite df
$t = \bar{d}/(s_d/\sqrt{n})$	Paired t-statistic
$SE_{\bar{x}} = s/\sqrt{n}$	Standard error of the mean
$\bar{x} \pm t_{\alpha/2,\nu} \cdot SE$	Confidence interval for mean
$p = 2 \times P(T_\nu \geq \lvert t \rvert)$	Two-tailed p-value

Effect Size Formulas for t-Tests

Formula	Description
$d = (\bar{x}_1-\bar{x}_2)/s_{pooled}$	Cohen's $d$ (independent)
$d_z = \bar{d}/s_d = t/\sqrt{n}$	Cohen's $d_z$ (paired)
$d_{av} = \bar{d}/s_{av}$	Cohen's $d_{av}$ (paired, comparable to between)
$d_{rm} = d_{av}\sqrt{2(1-r_{12})}$	$d_{rm}$ (corrected for dependency)
$\Delta = (\bar{x}_1-\bar{x}_2)/s_{control}$	Glass's $\Delta$
$g = d \times (1-3/(4\nu-1))$	Hedges' $g$ (bias-corrected)
$r = \sqrt{t^2/(t^2+\nu)}$	Point-biserial $r$ from $t$
$d = t\sqrt{(n_1+n_2)/(n_1 n_2)}$	$d$ from independent $t$
$d_z = t/\sqrt{n}$	$d_z$ from paired/one-sample $t$
$d = 2r/\sqrt{1-r^2}$	Convert $r$ to $d$ (equal groups)
$CL = \Phi(d/\sqrt{2})$	Common Language Effect Size

Non-Parametric Formulas

Formula	Description
$U_1 = n_1 n_2 + n_1(n_1+1)/2 - W_1$	Mann-Whitney $U$ statistic
$z = (U - n_1 n_2/2)/\sqrt{n_1 n_2(N+1)/12}$	Mann-Whitney $z$ -approximation
$r_{rb} = 1 - 2U/(n_1 n_2)$	Rank-biserial correlation (Mann-Whitney)
$W^+ = \sum_{d_i>0} R_i$	Wilcoxon positive rank sum
$z = (W^+ - n'(n'+1)/4)/\sqrt{n'(n'+1)(2n'+1)/24}$	Wilcoxon $z$ -approximation
$r_W = z/\sqrt{n'}$	Effect size for Wilcoxon test

Test Selection Guide

Design	Normal?	Equal Variances?	Recommended Test
1 group vs. known value	✅	—	One-sample t-test
1 group vs. known value	❌	—	Wilcoxon signed-rank
2 independent groups	✅	Equal or unknown	Welch's t-test
2 independent groups	✅	Known unequal	Welch's t-test
2 independent groups	❌	—	Mann-Whitney U
2 related groups	✅ (differences)	—	Paired t-test
2 related groups	❌ (differences)	—	Wilcoxon signed-rank
$> 2$ groups	✅	Equal	One-way ANOVA
$> 2$ groups	✅	Unequal	Welch's ANOVA
$> 2$ groups	❌	—	Kruskal-Wallis

Cohen's Benchmarks for t-Test Effect Sizes

Label	$\lvert d \rvert$	$\lvert r \rvert$	Power needed ( $n$ /group)
Small	$0.20$	$0.10$	394
Medium	$0.50$	$0.24$	64
Large	$0.80$	$0.37$	26
Very large	$1.20$	$0.51$	12
Huge	$2.00$	$0.71$	5

All power figures assume $\alpha = .05$ , two-tailed, 80% power, equal group sizes.

Degrees of Freedom Reference

Test	df
One-sample t-test	$n - 1$
Independent t-test (Student's)	$n_1 + n_2 - 2$
Independent t-test (Welch's)	Welch-Satterthwaite (always $\leq n_1+n_2-2$ )
Paired t-test	$n - 1$ (where $n$ = number of pairs)

Assumption Checks Reference

Assumption	Test	Software Function	Action if Violated
Normality	Shapiro-Wilk	`shapiro.test()`	Mann-Whitney / Wilcoxon
Normality	Q-Q plot	`qqnorm()`	Assess visually
Equal variances	Levene's	`leveneTest()`	Welch's t-test
Equal variances	Brown-Forsythe	`bf.test()`	Welch's t-test
Outliers	$z$ -score, boxplot	`boxplot()`	Investigate; trimmed mean
Independence	Design review	—	Multilevel model

Confidence Interval Interpretation

CI Property	Interpretation
Entirely above zero	Effect is significantly positive at the chosen $\alpha$
Entirely below zero	Effect is significantly negative at the chosen $\alpha$
Contains zero	Effect is not statistically significant
Narrow CI	Precise estimate (large $n$ )
Wide CI	Imprecise estimate (small $n$ ) — interpret point estimate cautiously
90% CI within equivalence bounds	Equivalence demonstrated (TOST)

APA 7th Edition Reporting Templates

One-sample: $t(\nu) =$ [value], $p =$ [value], $d =$ [value], 95% CI [LB, UB].

Independent samples (Welch's): $t_W(\nu_W) =$ [value], $p =$ [value], $d =$ [value], 95% CI [LB, UB].

Paired samples: $t(\nu) =$ [value], $p =$ [value], $d_z =$ [value], 95% CI [LB, UB].

Mann-Whitney: $U =$ [value], $z =$ [value], $p =$ [value], $r_{rb} =$ [value].

Wilcoxon signed-rank: $W =$ [value], $z =$ [value], $p =$ [value], $r_W =$ [value].

Required Sample Size Quick Reference

Two-sided $\alpha = .05$ , two independent equal groups:

Power	$d = 0.20$	$d = 0.35$	$d = 0.50$	$d = 0.80$	$d = 1.00$
0.70	310	102	50	20	14
0.80	394	130	64	26	17
0.90	527	174	85	34	22
0.95	651	215	105	42	27

All $n$ values are per group. Double for total $N$ .

t-Test Reporting Checklist

Item	Required
t-statistic with sign	✅ Always
Degrees of freedom	✅ Always
Exact p-value (or $p < .001$ )	✅ Always
Mean and SD for each group	✅ Always
95% CI for mean difference	✅ Always
Cohen's $d$ or Hedges' $g$	✅ Always
95% CI for effect size	✅ Always
Sample sizes for each group	✅ Always
Whether Student's or Welch's was used	✅ For independent t-tests
Levene's test result	✅ For independent t-tests
Normality check result	✅ When $n < 30$ per group
Which $d$ variant was used ( $d_z$ , $d_{av}$ , etc.)	✅ For paired designs
Power or sensitivity analysis	✅ For null or inconclusive results
Equivalence test if claiming null	✅ Always for null results
Pre-registration of one-tailed hypotheses	✅ If one-tailed test used

This tutorial provides a comprehensive foundation for understanding, conducting, and reporting t-tests and their alternatives within the DataStatPro application. For further reading, consult Gravetter & Wallnau's "Statistics for the Behavioral Sciences" (10th ed.), Field's "Discovering Statistics Using IBM SPSS Statistics" (5th ed., 2018), Wilcox's "Introduction to Robust Estimation and Hypothesis Testing" (4th ed., 2017), and Lakens's "Equivalence Tests: A Practical Primer for t-Tests, Correlations, and Meta-Analyses" (Social Psychological and Personality Science, 2017). For the recommendation to default to Welch's t-test, see Delacre, Lakens, and Leys (2017) in "Why Psychologists Should by Default Use Welch's t-test Instead of Student's t-test" (International Review of Social Psychology). For feature requests or support, contact the DataStatPro team.

T-Tests and Alternatives

t-Tests and Alternatives: Zero to Hero Tutorial

Table of Contents

1. Prerequisites and Background Concepts

1.1 Populations, Samples, and Parameters

1.2 The Sampling Distribution of the Mean

1.3 The t-Distribution

1.4 Hypothesis Testing Framework

1.5 Type I and Type II Errors

1.6 One-Tailed vs. Two-Tailed Tests

1.7 Confidence Intervals and Their Relationship to t-Tests

2. What is a t-Test?

2.1 The Core Idea

2.2 When to Use a t-Test

2.3 The Three Versions of the t-Test

2.4 The t-Test in Context

2.5 Statistical Significance vs. Practical Significance

3. The Mathematics Behind t-Tests

3.1 The One-Sample t-Statistic

3.2 The Independent Samples t-Statistic (Student's)

3.3 Welch's t-Statistic — Unequal Variances

3.4 The Paired Samples t-Statistic

3.5 The p-value

3.6 Computing Effect Sizes from t-Statistics

3.7 The Non-Central t-Distribution and Exact CIs for ddd

3.8 Statistical Power of the t-Test

4. Assumptions of t-Tests

4.1 Normality of the Sampling Distribution

4.2 Homogeneity of Variance (for Independent Samples t-Test)

4.3 Independence of Observations

4.4 Scale of Measurement

4.5 Random Sampling

4.6 Absence of Influential Outliers

4.7 Assumption Summary Table

5. Types of t-Tests

5.1 Decision Flowchart for Test Selection

5.2 Choosing Between Student's and Welch's t-Test

6. Using the t-Test Calculator Component

Step-by-Step Guide

7. One-Sample t-Test

7.1 Purpose and Design

7.2 Full Procedure

Step 1 — Compute the sample mean and SD

Step 2 — Compute the standard error

Step 3 — Compute the t-statistic

Step 4 — Determine degrees of freedom

Step 5 — Compute the p-value

Step 6 — Compute the 95% CI for μ\muμ

Step 7 — Compute Cohen's ddd

7.3 Interpreting the One-Sample t-Test

8. Independent Samples t-Test

8.1 Purpose and Design

8.2 Full Procedure (Student's)

Step 1 — Compute summary statistics

Step 2 — Check variance homogeneity (Levene's test)

Step 3 — Compute pooled standard deviation

Step 4 — Compute the t-statistic

Step 5 — Degrees of freedom and p-value

Step 6 — 95% CI for the mean difference

Step 7 — Effect sizes

8.3 APA Reporting Template

9. Paired Samples t-Test

9.1 Purpose and Design

9.2 Full Procedure

Step 1 — Compute difference scores

Step 2 — Compute mean and SD of differences

Step 3 — Compute the standard error of the mean difference

Step 4 — Compute the t-statistic

Step 5 — Degrees of freedom and p-value

Step 6 — 95% CI for the mean difference

Step 7 — Effect sizes

9.3 Comparing Paired and Independent t-Tests for the Same Data

10. Welch's t-Test — Unequal Variances

10.1 Why Welch's is Preferred

10.2 Full Procedure

Step 1 — Compute group summary statistics

Step 2 — Compute separate variance estimates

Step 3 — Compute Welch's t-statistic

Step 4 — Compute Welch-Satterthwaite degrees of freedom

Step 5 — p-value

3.7 The Non-Central t-Distribution and Exact CIs for $d$

Step 6 — Compute the 95% CI for $\mu$

Step 7 — Compute Cohen's $d$

Step 7 — Effect size (Glass's $\Delta$ or Welch's $d$ )

Step 4 — Compute z-approximation (for $n > 10$ )

Step 3 — z-approximation (for $n' > 10$ )

12.7 Effect Sizes for t-Tests: Choosing the Right $d$ Variant

Mistake 9: Using the Wrong $d$ Variant and Comparing Across Designs