Independent Samples t-Test: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of two-group comparison all the way through advanced implementation, Welch's correction, effect size estimation, reporting, and practical usage within the DataStatPro application. Whether you are encountering the independent samples t-test for the first time or deepening your understanding of between-group inference, this guide builds your knowledge systematically from the ground up.

Prerequisites and Background Concepts
What is the Independent Samples t-Test?
The Mathematics Behind the Independent Samples t-Test
Assumptions of the Independent Samples t-Test
Student's vs. Welch's t-Test
Using the Independent Samples t-Test Calculator Component
Step-by-Step Procedure
Interpreting the Output
Effect Sizes
Confidence Intervals
Advanced Topics
Worked Examples
Common Mistakes and How to Avoid Them
Troubleshooting
Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

1.1 Between-Subjects vs. Within-Subjects Designs

A between-subjects design assigns different participants to different conditions. Each participant contributes exactly one score to the analysis. This is contrasted with within-subjects (repeated measures) designs where participants appear in multiple conditions.

The independent samples t-test is the appropriate test for comparing two independent groups in a between-subjects design.

1.2 The Standard Error of the Difference Between Means

When we compare two independent sample means $\bar{x}_1$ and $\bar{x}_2$ , we are interested in the difference $\bar{x}_1 - \bar{x}_2$ . The sampling variability of this difference has a standard error:

$SE_{\bar{x}_1 - \bar{x}_2} = \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}$

When $\sigma_1^2 = \sigma_2^2 = \sigma^2$ (equal variances), this simplifies to:

$SE = \sigma\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}$

Since $\sigma$ is unknown, we estimate it from the data using the pooled standard deviation, yielding the estimated standard error.

1.3 The Pooled Variance

When the two populations share a common variance $\sigma^2$ , the pooled variance combines the within-group variance estimates from both groups:

$s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}$

This is a weighted average of $s_1^2$ and $s_2^2$ , where larger groups receive more weight. The pooled estimate is more stable than either group's individual estimate.

1.4 Variance Homogeneity and its Consequences

The assumption of equal population variances ( $\sigma_1^2 = \sigma_2^2$ ) is crucial for the pooled t-test. When this assumption is violated:

With equal group sizes: the t-test is robust (Type I error stays near $\alpha$ ).
With unequal group sizes AND unequal variances: the t-test can have severely inflated or deflated Type I error rates.

This motivates Welch's t-test (Section 5), which does not assume equal variances.

1.5 Effect Sizes for Group Comparisons

A statistically significant result from an independent t-test tells you that the means differ beyond chance. Effect sizes quantify how much they differ in standardised units:

Cohen's $d$ : Mean difference in pooled standard deviation units.
Hedges' $g$ : Bias-corrected Cohen's $d$ .
Glass's $\Delta$ : Uses only one group's SD (usually control).

1.6 The Relationship Between t and F

For exactly two groups, the independent samples t-test and one-way ANOVA yield identical p-values: $F = t^2$ . The t-test is simpler and preferred for two-group comparisons; ANOVA generalises to three or more groups.

2. What is the Independent Samples t-Test?

2.1 The Core Question

The independent samples t-test answers: "Do two independent, unrelated groups have the same population mean?" or equivalently, "Is the observed mean difference between two groups larger than we would expect from random sampling variability alone?"

2.2 The Two Versions

Version	Assumption	Preferred When
Student's t-test	Equal population variances ( $\sigma_1^2 = \sigma_2^2$ )	Confirmed equal variances; historical compatibility
Welch's t-test	Unequal population variances allowed	Default recommendation; any situation

The modern consensus: Use Welch's t-test as the default. When variances are truly equal, Welch's loses negligible power. When variances are unequal, Welch's maintains correct Type I error whereas Student's does not.

2.3 When to Use the Independent Samples t-Test

Condition	Requirement
Number of groups	Exactly two
Relationship between groups	Independent (different participants)
Outcome variable	Continuous (interval or ratio scale)
Distribution	Approximately normal (or $n \geq 30$ per group)
Variances	Equal (Student's) or potentially unequal (Welch's)

2.4 Real-World Applications

Field	Research Question
Clinical	Does CBT reduce anxiety more than a control condition?
Education	Do students taught by Method A score higher than those taught by Method B?
Marketing	Do customers rate Brand A higher than Brand B?
Medicine	Does Drug A lower blood pressure more than a placebo?
Organisational	Do remote workers report higher job satisfaction than office workers?
Neuroscience	Do patients with depression have different cortisol levels than healthy controls?
Sport	Do athletes trained with Method X have faster sprint times than those trained with Method Y?

3. The Mathematics Behind the Independent Samples t-Test

3.1 Student's t-Statistic (Equal Variances)

$t = \frac{\bar{x}_1 - \bar{x}_2}{s_p\sqrt{\dfrac{1}{n_1} + \dfrac{1}{n_2}}}$

Where the pooled standard deviation is:

$s_p = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}}$

Degrees of freedom: $\nu = n_1 + n_2 - 2$

3.2 Welch's t-Statistic (Unequal Variances)

$t_W = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}}}$

Welch-Satterthwaite degrees of freedom:

$\nu_W = \frac{\left(\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}\right)^2}{\dfrac{(s_1^2/n_1)^2}{n_1-1} + \dfrac{(s_2^2/n_2)^2}{n_2-1}}$

$\nu_W$ is generally non-integer and always $\leq n_1 + n_2 - 2$ (fewer or equal df than Student's, making Welch's more conservative when variances are equal).

3.3 The p-Value

Two-tailed:

$p = 2 \times P(T_\nu \geq |t_{obs}|)$

One-tailed (upper):

$p = P(T_\nu \geq t_{obs})$

3.4 Confidence Intervals

Student's 95% CI for $\mu_1 - \mu_2$ :

$(\bar{x}_1 - \bar{x}_2) \pm t_{\alpha/2,\; n_1+n_2-2} \cdot s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}$

Welch's 95% CI for $\mu_1 - \mu_2$ :

$(\bar{x}_1 - \bar{x}_2) \pm t_{\alpha/2,\; \nu_W} \cdot \sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}$

3.5 Cohen's $d$ — Standardised Mean Difference

$d = \frac{\bar{x}_1 - \bar{x}_2}{s_p}$

Hedges' $g$ (bias-corrected):

$g = d \times J, \qquad J = 1 - \frac{3}{4(n_1+n_2-2)-1}$

Glass's $\Delta$ (control group SD as standardiser):

$\Delta = \frac{\bar{x}_1 - \bar{x}_2}{s_{control}}$

Average SD standardiser (when neither group is a natural reference):

$d_{av} = \frac{\bar{x}_1 - \bar{x}_2}{(s_1+s_2)/2}$

3.6 Computing $d$ from the t-Statistic

$d = t\sqrt{\frac{n_1+n_2}{n_1 n_2}} = \frac{t\sqrt{n_1+n_2}}{\sqrt{n_1 n_2}}$

For equal group sizes ( $n_1 = n_2 = n$ ): $d = t\sqrt{2/n}$

3.7 Exact CI for $d$ via Non-Central t-Distribution

The t-statistic follows a non-central t-distribution under $H_1$ with non-centrality:

$\lambda = d\sqrt{\frac{n_1 n_2}{n_1+n_2}}$

Exact 95% CI for $d$ : invert this numerically (computed automatically by DataStatPro).

Approximate CI (adequate for $n > 20$ per group):

$d \pm 1.96 \times \sqrt{\frac{n_1+n_2}{n_1 n_2} + \frac{d^2}{2(n_1+n_2-2)}}$

3.8 Common Language Effect Size

$CL = \Phi\!\left(\frac{d}{\sqrt{2}}\right)$

Interpretation: the probability that a randomly selected person from Group 1 scores higher than a randomly selected person from Group 2.

$d$	$CL$	Interpretation
0.00	50.0%	No difference
0.20	55.6%	Small
0.50	63.8%	Medium
0.80	71.4%	Large
1.00	76.0%
1.50	85.6%	Very large

3.9 Statistical Power

For equal group sizes, the non-centrality parameter:

$\lambda = d\sqrt{\frac{n_1 n_2}{n_1+n_2}} = d\sqrt{n/2}$ (for $n_1 = n_2 = n$ )

Required $n$ per group for power $1-\beta$ , two-sided $\alpha$ :

$n \approx \frac{2(z_{1-\alpha/2} + z_{1-\beta})^2}{d^2}$

For $\alpha = .05$ , power $= 0.80$ : $n \approx 15.68/d^2$

Required $n$ per group:

$d$	Power = 0.80	Power = 0.90	Power = 0.95
0.20	394	527	651
0.35	130	174	215
0.50	64	85	105
0.80	26	34	42
1.00	17	22	27
1.50	8	11	13

4. Assumptions of the Independent Samples t-Test

4.1 Normality

Data in each group should be approximately normally distributed. The test is robust to mild non-normality, especially when $n \geq 20$ per group.

How to check: Shapiro-Wilk (per group), Q-Q plots, histograms, skewness/kurtosis.

When violated: Use Mann-Whitney U test (non-parametric alternative).

4.2 Homogeneity of Variance (for Student's t)

Student's t-test requires $\sigma_1^2 = \sigma_2^2$ .

How to check:

Levene's test ( $H_0$ : equal variances): preferred, robust to non-normality.
Brown-Forsythe test: more robust for non-normal data.
Variance ratio: if $s^2_{max}/s^2_{min} > 4$ , heterogeneity is substantial.
$F$ -test ( $s_1^2/s_2^2$ ): sensitive to non-normality — use with caution.

When violated: Use Welch's t-test — the recommended default for all independent samples comparisons regardless of Levene's test result.

4.3 Independence of Observations

All observations within and across groups must be independent. No participant should contribute scores to both groups.

Common violations:

Using the same participant in both "groups" (should use paired t-test).
Clustered data (e.g., students in the same classroom).
Family or sibling pairs.

When violated: Use paired t-test (if within-subjects) or multilevel models (if clustered).

4.4 Independence Between Groups

The two groups themselves must be independent. Their scores should not be systematically related (e.g., no matching, no family relationships between groups).

4.5 Interval Scale of Measurement

The DV must be measured on at least an interval scale.

When violated: Use Mann-Whitney U test.

4.6 Absence of Severe Outliers

Outliers distort both $\bar{x}$ and $s_p$ , biasing the t-statistic.

How to check: Boxplots per group; $|z_i| > 3$ per group.

When outliers present: Investigate; report with and without; consider Welch's t-test (more robust) or Mann-Whitney U.

4.7 Assumption Summary

Assumption	Student's	Welch's	How to Check	Remedy
Normality per group	✅	✅	Shapiro-Wilk, Q-Q	Mann-Whitney U
Equal variances	✅	❌	Levene's	Use Welch's
Independence within groups	✅	✅	Design review	Multilevel model
Independence between groups	✅	✅	Design review	Paired t-test
Interval scale	✅	✅	Measurement theory	Mann-Whitney U
No severe outliers	✅	✅	Boxplots	Investigate; robust test

5. Student's vs. Welch's t-Test

5.1 Performance Under Different Conditions

Simulation studies (Ruxton, 2006; Delacre et al., 2017) consistently show:

Condition	Student's Type I Error	Welch's Type I Error
Equal $n$ , equal $\sigma$	≈ $\alpha$	≈ $\alpha$
Equal $n$ , unequal $\sigma$	≈ $\alpha$ (robust)	≈ $\alpha$
Unequal $n$ , equal $\sigma$	≈ $\alpha$	≈ $\alpha$ (slightly conservative)
Unequal $n$ , unequal $\sigma$ (larger $n$ in larger $\sigma$ group)	$< \alpha$ (anti-conservative)	≈ $\alpha$
Unequal $n$ , unequal $\sigma$ (larger $n$ in smaller $\sigma$ group)	$> \alpha$ (liberal)	≈ $\alpha$

5.2 Power Comparison

When variances are truly equal:

Student's has slightly higher power than Welch's (by approximately 0.2–1%).
This tiny advantage does not justify the risk of inflated Type I error when variances are unequal.

When variances are unequal:

Welch's maintains appropriate Type I error; Student's does not.
Welch's has higher valid power.

Recommendation: Always use Welch's t-test as the default. DataStatPro reports both but highlights Welch's results.

5.3 The Decision Framework

For an independent samples comparison:
├── Default: Use Welch's t-test (regardless of Levene's result)
└── If comparability with historical Student's results is needed:
    ├── Levene's p > .05: Either test is acceptable
    └── Levene's p ≤ .05: Use Welch's (do NOT use Student's)

💡 The practice of running Levene's test first and then "choosing" Student's vs. Welch's based on the result (the "pre-test" approach) leads to inflated Type I error because the selection itself is data-driven. Simply using Welch's universally avoids this problem.

6. Using the Independent Samples t-Test Calculator Component

Step-by-Step Guide

Step 1 — Select the Test

Navigate to Statistical Tests → t-Tests → Independent Samples t-Test.

Step 2 — Input Method

Raw data: Upload data with a group indicator variable and outcome variable. DataStatPro auto-identifies groups and runs all assumption checks.
Summary statistics: Enter $n_j$ , $\bar{x}_j$ , $s_j$ for each group.
t-statistic + df: Enter $t$ , $\nu$ , $n_1$ , $n_2$ from a published result.

Step 3 — Specify the Comparison

Designate which group is "Group 1" (the reference or treatment group).
The sign of $d$ will be positive when Group 1 > Group 2.
Label groups clearly for interpretable output.

Step 4 — Select Variance Assumption

Welch's (recommended): No equal variance assumption.
Student's (equal variances): Uses pooled SD.
Both (default): Computes both and flags discrepancies.

Step 5 — Select Effect Size Standardiser

When variances are unequal, DataStatPro offers:

Pooled SD (Cohen's $d$ ) — standard but potentially misleading.
Control group SD (Glass's $\Delta$ ) — recommended for treatment-control designs.
Average SD ( $d_{av}$ ) — recommended when neither group is a reference.
All three — displayed together for full reporting.

Step 6 — Select Display Options

✅ Full results table (both Student's and Welch's).
✅ Group descriptive statistics (mean, SD, SE, 95% CI).
✅ Levene's and Brown-Forsythe test results.
✅ Shapiro-Wilk normality test per group.
✅ Cohen's $d$ , Hedges' $g$ , Glass's $\Delta$ with exact 95% CIs.
✅ Common Language Effect Size (CL) and $U_3$ statistic.
✅ Two overlapping distribution curves with shaded difference region.
✅ Power analysis and required $n$ for 80/90/95% power.
✅ Equivalence test (TOST) for demonstrating practical equivalence.
✅ APA 7th edition results paragraph.

Step 7 — Run the Analysis

Click "Run Independent t-Test". All results, plots, and the APA paragraph are generated automatically.

7. Step-by-Step Procedure

7.1 Full Manual Procedure (Welch's t-Test)

Step 1 — State Hypotheses

$H_0: \mu_1 = \mu_2 \qquad H_1: \mu_1 \neq \mu_2$ (two-tailed)

Or equivalently: $H_0: \mu_1 - \mu_2 = 0$ vs. $H_1: \mu_1 - \mu_2 \neq 0$

Step 2 — Check Assumptions

Shapiro-Wilk per group (or Q-Q plots).
Levene's test for variance homogeneity.
Boxplots for outliers.
Confirm design independence.

Step 3 — Compute Summary Statistics

$\bar{x}_j = \frac{1}{n_j}\sum_{i=1}^{n_j} x_{ji}, \qquad s_j = \sqrt{\frac{1}{n_j-1}\sum_{i=1}^{n_j}(x_{ji}-\bar{x}_j)^2}$

Step 4 — Compute Standard Error Components

$v_j = \frac{s_j^2}{n_j}, \qquad j \in \{1, 2\}$

$SE_W = \sqrt{v_1 + v_2}$

Step 5 — Compute t-Statistic

$t_W = \frac{\bar{x}_1 - \bar{x}_2}{SE_W}$

Step 6 — Compute Welch-Satterthwaite df

$\nu_W = \frac{(v_1+v_2)^2}{v_1^2/(n_1-1) + v_2^2/(n_2-1)}$

Round down to the nearest integer.

Step 7 — Compute p-Value

$p = 2 \times P(T_{\nu_W} \geq |t_W|)$

Reject $H_0$ if $p \leq \alpha$ .

Step 8 — Compute 95% CI for $\mu_1 - \mu_2$

$(\bar{x}_1 - \bar{x}_2) \pm t_{\alpha/2,\; \nu_W} \times SE_W$

Step 9 — Compute Effect Sizes

$s_p = \sqrt{\frac{(n_1-1)s_1^2+(n_2-1)s_2^2}{n_1+n_2-2}}$

$d = (\bar{x}_1-\bar{x}_2)/s_p$

$g = d\times(1-3/(4(n_1+n_2-2)-1))$

$CL = \Phi(d/\sqrt{2})$

Step 10 — Interpret and Report

Use APA template from Section 15.

8. Interpreting the Output

8.1 Reading the Results Table

Output	What It Tells You
$t$ -statistic	How many SEs the mean difference is from zero
df (Welch)	Effective degrees of freedom (accounts for unequal variances)
p-value	Probability of this or more extreme difference under $H_0$
Mean difference	Raw unstandardised difference $\bar{x}_1 - \bar{x}_2$
95% CI for difference	Range of plausible values for $\mu_1 - \mu_2$
Cohen's $d$	Standardised effect in SD units
95% CI for $d$	Precision of the effect size estimate
Levene's $p$	Evidence against equal variances

8.2 The Direction of the Effect

The sign of $t$ and $d$ indicates direction:

Positive $d$ : Group 1 mean is higher than Group 2 mean.
Negative $d$ : Group 1 mean is lower than Group 2 mean.

Always state which group is higher in words — signs alone can be misinterpreted.

8.3 When Student's and Welch's Give Different Conclusions

Disagreement between the two tests signals that variances are unequal AND sample sizes differ. In this case:

Trust Welch's result — it maintains correct Type I error.
Note the discrepancy in the results section.
Report only Welch's in the primary results.

8.4 Cohen's $d$ Benchmarks

| $|d|$ | Cohen Label | CL (%) | $U_3$ (%) | | :----- | :---------- | :----- | :-------- | | 0.20 | Small | 55.6 | 57.9 | | 0.50 | Medium | 63.8 | 69.1 | | 0.80 | Large | 71.4 | 78.8 | | 1.00 | | 76.0 | 84.1 | | 1.20 | Very large | 80.2 | 88.5 | | 2.00 | Huge | 92.1 | 97.7 |

9. Effect Sizes

9.1 Choosing the Right Standardiser

Scenario	Recommended Effect Size	Standardiser
Equal variances, no reference group	Cohen's $d$	Pooled SD
Unequal variances, no reference group	$d_{av}$	Average SD
Treatment vs. control design	Glass's $\Delta$	Control group SD
Small samples ( $n < 20$ )	Hedges' $g$	Pooled SD (bias-corrected)
Meta-analysis or cross-study comparison	Hedges' $g$	Pooled SD (bias-corrected)

9.2 Variance Overlap Statistics

Statistic	Formula	Interpretation
$U_1$	$2\Phi(d/2) - 1$	Proportion of distributions NOT overlapping
$U_2$	$\Phi(d/2)$	Proportion of Group 2 exceeded by Group 1 median
$U_3$	$\Phi(d)$	Proportion of Group 2 below the Group 1 mean

Example for $d = 0.80$ :

$U_3 = \Phi(0.80) = 0.788 = 78.8\%$

Interpretation: 78.8% of Group 2 participants score below the mean of Group 1.

9.3 Effect Sizes for Unequal Variances

When Levene's test is significant ( $\sigma_1^2 \neq \sigma_2^2$ ), the choice of standardiser matters:

Glass's $\Delta$ standardises by the control/reference group SD:

$\Delta = \frac{\bar{x}_{treatment} - \bar{x}_{control}}{s_{control}}$

Interpretation: The treatment group mean is $\Delta$ standard deviation units above the control group distribution — directly interpretable in terms of how many control-group SDs the treatment group has moved.

When variance ratio $> 4$ : Strongly prefer Glass's $\Delta$ or $d_{av}$ over Cohen's $d$ (which uses the pooled SD and is misleading when variances differ substantially).

10. Confidence Intervals

10.1 CI for the Mean Difference (Unstandardised)

The 95% CI for $\mu_1 - \mu_2$ provides the most directly interpretable estimate in the original measurement units:

$(\bar{x}_1 - \bar{x}_2) \pm t_{\alpha/2,\; \nu_W} \times \sqrt{s_1^2/n_1 + s_2^2/n_2}$

Interpretation rules:

CI Outcome	Conclusion
Entirely positive	Group 1 is significantly higher than Group 2
Entirely negative	Group 1 is significantly lower than Group 2
Contains zero	Not significant at $\alpha$
Narrow CI	Precise estimate of the mean difference
Wide CI	Imprecise; large $n$ needed for better precision

10.2 CI for Cohen's $d$

Approximate 95% CI:

$d \pm 1.96\sqrt{\frac{n_1+n_2}{n_1 n_2}+\frac{d^2}{2(n_1+n_2-2)}}$

Exact CI: Uses non-central t-distribution (DataStatPro default).

10.3 Precision as a Function of $n$

For equal group sizes and $d = 0.50$ :

$n$ per group	Approx CI Width for $d$
10	1.80
20	1.28
50	0.81
100	0.57
200	0.40
500	0.25

11. Advanced Topics

11.1 Equivalence Testing for Independent Groups

When claiming two groups are practically equivalent (e.g., two interventions are equally effective), use the TOST procedure:

Specify equivalence bounds $\pm\Delta$ in raw mean difference units.

90% CI for $(\bar{x}_1 - \bar{x}_2)$ must fall within $(-\Delta, +\Delta)$ .

Or equivalently, specify $d_{equiv}$ (the standardised equivalence margin) and test whether the 90% CI for $d$ falls within $(-d_{equiv}, d_{equiv})$ .

11.2 Bootstrap Confidence Intervals

When normality is violated and samples are small, bootstrap CIs for the mean difference and Cohen's $d$ are more trustworthy than t-distribution-based CIs:

Draw $B = 10{,}000$ bootstrap samples (with replacement) from each group.
Compute $\bar{x}_1^* - \bar{x}_2^*$ for each bootstrap sample.
95% CI: 2.5th and 97.5th percentiles of the bootstrap distribution.

DataStatPro computes bootstrap CIs automatically when raw data are provided.

11.3 Bayesian Independent Samples t-Test

$BF_{10}$ quantifies evidence for $H_1: \mu_1 \neq \mu_2$ vs. $H_0: \mu_1 = \mu_2$ , computed from $t$ and $\nu$ (or $n_1$ , $n_2$ ) using the Rouder et al. (2009) default prior. Particularly valuable for null results — can provide positive evidence that the two groups are equivalent.

11.4 Unequal Sample Sizes and Optimal Allocation

When one group is cheaper or easier to sample, unequal allocation can improve statistical power for a fixed total $N$ . For two groups with costs $c_1$ and $c_2$ per participant, optimal allocation:

$\frac{n_1}{n_2} = \sqrt{\frac{\sigma_1^2 / c_1}{\sigma_2^2 / c_2}}$

When costs are equal: $n_1/n_2 = \sigma_1/\sigma_2$ — allocate more participants to the more variable group.

11.5 Heterogeneity of Variance: When It Matters Substantively

Beyond the technical issue of test validity, unequal variances have substantive implications: if a treatment not only changes the mean but also changes the variability (e.g., a drug works for some patients but not others), the variance difference is itself a scientifically important finding. Always report and discuss unequal variances when they are substantial.

12. Worked Examples

Example 1: CBT vs. Waitlist — Anxiety Scores

A clinical trial randomises $n_1 = 35$ participants to CBT and $n_2 = 35$ to a waitlist control. Anxiety is measured post-treatment (GAD-7; range 0–21).

Group	$n$	Mean	SD
CBT	35	$6.8$	$3.4$
Waitlist	35	$12.1$	$4.8$

Levene's test: $F(1, 68) = 4.82$ , $p = .032$ → unequal variances → use Welch's.

Welch's t-statistic:

$SE_W = \sqrt{3.4^2/35 + 4.8^2/35} = \sqrt{11.56/35 + 23.04/35} = \sqrt{0.330 + 0.658} = \sqrt{0.988} = 0.994$

$t_W = (6.8 - 12.1)/0.994 = -5.3/0.994 = -5.332$

Welch-Satterthwaite df:

$v_1 = 11.56/35 = 0.330, \quad v_2 = 23.04/35 = 0.658$

$\nu_W = (0.330+0.658)^2/(0.330^2/34 + 0.658^2/34) = (0.988)^2/(0.1089/34 + 0.4330/34)$

$= 0.976/(0.00320 + 0.01273) = 0.976/0.01594 = 61.2$

Rounded: $\nu_W = 61$ .

p-value: $p = 2 \times P(T_{61} \geq 5.332) < .001$

95% CI for mean difference:

$t_{.025,61} = 2.000$

$(6.8 - 12.1) \pm 2.000 \times 0.994 = -5.3 \pm 1.988 = [-7.288, -3.312]$

Effect sizes:

$s_p = \sqrt{(34 \times 3.4^2 + 34 \times 4.8^2)/68} = \sqrt{(392.84 + 783.36)/68} = \sqrt{1176.2/68} = \sqrt{17.30} = 4.159$

$d = (6.8 - 12.1)/4.159 = -5.3/4.159 = -1.274$ (Large)

Glass's $\Delta$ (standardised by waitlist SD):

$\Delta = -5.3/4.8 = -1.104$ (Large)

Hedges' $g$ : $g = 1.274 \times (1 - 3/(4 \times 68 - 1)) = 1.274 \times 0.989 = 1.260$

$CL = \Phi(1.274/\sqrt{2}) = \Phi(0.901) = 0.816$ → CBT participants have lower anxiety than 81.6% of waitlist participants.

APA write-up: "Due to significant variance heterogeneity (Levene's $F(1, 68) = 4.82$ , $p = .032$ ), Welch's t-test was applied. CBT participants ( $M = 6.8$ , $SD = 3.4$ ) showed significantly lower post-treatment anxiety than waitlist controls ( $M = 12.1$ , $SD = 4.8$ ), $t_W(61.2) = -5.33$ , $p < .001$ , $d = -1.27$ [95% CI: $-1.72$ , $-0.82$ ]. This represents a large treatment effect. CBT participants scored lower than 81.6% of waitlist participants (CL = 81.6%). The mean difference of 5.3 GAD-7 points [95% CI: 3.31, 7.29] exceeds the clinically meaningful threshold of 4 points."

Example 2: Reaction Times — Experimental vs. Control

An experimental psychologist compares reaction times (ms) between two attention conditions: focused ( $n_1 = 25$ ) and divided ( $n_2 = 30$ ).

Group	$n$	Mean (ms)	SD
Focused	25	$312.4$	$38.2$
Divided	30	$364.8$	$42.7$

Levene's test: $F(1, 53) = 0.44$ , $p = .51$ → variances not significantly different. Use Welch's (recommended default regardless):

$v_1 = 38.2^2/25 = 1459.24/25 = 58.37$

$v_2 = 42.7^2/30 = 1823.29/30 = 60.78$

$SE_W = \sqrt{58.37 + 60.78} = \sqrt{119.15} = 10.916$

$t_W = (312.4 - 364.8)/10.916 = -52.4/10.916 = -4.800$

$\nu_W = (58.37+60.78)^2/(58.37^2/24 + 60.78^2/29) = (119.15)^2/(141.89 + 127.31) = 14196.7/269.20 = 52.7$

$p = 2 \times P(T_{52} \geq 4.800) < .001$

95% CI:

$(312.4-364.8) \pm 2.007 \times 10.916 = -52.4 \pm 21.9 = [-74.3, -30.5]$

Cohen's $d$ :

$s_p = \sqrt{(24\times38.2^2 + 29\times42.7^2)/53} = \sqrt{(35007+52841)/53} = \sqrt{1658.5} = 40.72$

$d = -52.4/40.72 = -1.287$ (Large)

APA write-up: "Welch's independent samples t-test revealed that focused attention participants ( $M = 312.4$ ms, $SD = 38.2$ ms) had significantly faster reaction times than divided attention participants ( $M = 364.8$ ms, $SD = 42.7$ ms), $t_W(52.7) = -4.80$ , $p < .001$ , $d = -1.29$ [95% CI: $-1.76$ , $-0.80$ ]. The mean difference of 52.4 ms [95% CI: 30.5, 74.3 ms] represents a large effect of attention condition."

13. Common Mistakes and How to Avoid Them

Mistake 1: Using Student's Instead of Welch's as the Default

Problem: Defaulting to Student's t-test without considering whether the equal-variance assumption holds. When groups differ in both size and variance, Student's t-test produces invalid p-values.

Solution: Use Welch's t-test as the universal default for independent samples comparisons. The power cost when variances are truly equal is negligible.

Mistake 2: Running the Independent t-Test on Paired Data

Problem: Treating matched pairs or pre-post measurements as independent groups. This inflates the error term (ignores within-person correlation) and substantially reduces power.

Solution: Before choosing a test, ask: "Did the same participants contribute to both groups?" If yes, use the paired t-test.

Mistake 3: Not Reporting Glass's $\Delta$ When Variances Are Unequal

Problem: Reporting Cohen's $d$ (using pooled SD) when $\sigma_1^2 \neq \sigma_2^2$ . The pooled SD is a blend of two different distributions — not an appropriate standardiser for either group.

Solution: When Levene's is significant, report Glass's $\Delta$ (using the control group SD) or $d_{av}$ (average of both SDs) alongside Cohen's $d$ .

Mistake 4: Conflating Statistical Significance with Practical Importance

Problem: Reporting $p < .001$ and concluding the effect is "large." With $n = 500$ per group, a difference of 0.5 points on a 100-point scale produces $p < .001$ with $d = 0.05$ — trivially small.

Solution: Always report Cohen's $d$ with its 95% CI. Interpret the magnitude in the context of the measurement scale and the research domain.

Mistake 5: Ignoring the CI for the Mean Difference

Problem: Reporting only $t$ and $p$ without the 95% CI for $\mu_1 - \mu_2$ . The CI provides the most directly actionable information — the range of plausible values for the true mean difference in the original units.

Solution: Always report the 95% CI for the mean difference in the abstract or results section. In clinical research, compare this CI to established minimal clinically important differences (MCIDs).

Mistake 6: Making Multiple Independent t-Tests Instead of ANOVA

Problem: Comparing three or more groups with all possible pairwise t-tests, inflating the familywise error rate.

Solution: Use one-way ANOVA (or Welch's ANOVA) followed by appropriate post-hoc tests when comparing three or more groups.

Mistake 7: Not Checking Outliers Before Running the Test

Problem: A single extreme value can drastically shift the mean and inflate the SD within a small group, producing either a falsely significant or falsely non-significant result.

Solution: Always inspect boxplots per group. Investigate outliers and report analyses with and without them. Welch's t-test is more robust to outliers than Student's when outliers affect variance.

14. Troubleshooting

Problem	Likely Cause	Solution
Student's and Welch's give very different $p$ -values	Unequal variances with unequal $n$	Trust Welch's; report Levene's result
Welch's df is very small	One group has very small $n$ or near-zero variance	Check data; use exact permutation test
$d$ is positive but $t$ is negative	Group labelling: Group 2 > Group 1	Relabel or state direction explicitly
Levene's is significant but $n$ s are equal	Even with equal $n$ , if difference is very large, consider Glass's $\Delta$	Report both $d$ and $\Delta$ ; note variance heterogeneity
$p$ -value is significant but CI for $d$ includes zero	Rounding error or very wide CI	Use exact CI from non-central $t$ ; check calculations
Bootstrap CI disagrees with $t$ -distribution CI	Non-normality in small sample	Trust bootstrap CI; note non-normality
Large $d$ but non-significant $p$	Underpowered study	Report power; conduct sensitivity analysis; plan larger replication
Very wide CI for $d$	Small $n$ per group	Report as genuine uncertainty; plan adequately powered study
Effect size changes substantially with vs. without outlier	Outlier has large leverage	Report both analyses; consider robust test

15. Quick Reference Cheat Sheet

Core Equations

Formula	Description
$t = (\bar{x}_1-\bar{x}_2)/(s_p\sqrt{1/n_1+1/n_2})$	Student's t-statistic
$s_p = \sqrt{[(n_1-1)s_1^2+(n_2-1)s_2^2]/(n_1+n_2-2)}$	Pooled SD
$t_W = (\bar{x}_1-\bar{x}_2)/\sqrt{s_1^2/n_1+s_2^2/n_2}$	Welch's t-statistic
$\nu_W = (v_1+v_2)^2/(v_1^2/(n_1-1)+v_2^2/(n_2-1))$	Welch-Satterthwaite df
$\nu_{Student} = n_1+n_2-2$	Student's df
$d = (\bar{x}_1-\bar{x}_2)/s_p$	Cohen's $d$
$d = t\sqrt{(n_1+n_2)/(n_1 n_2)}$	$d$ from $t$ -statistic
$g = d\times(1-3/(4(n_1+n_2-2)-1))$	Hedges' $g$
$\Delta = (\bar{x}_1-\bar{x}_2)/s_{control}$	Glass's $\Delta$
$CL = \Phi(d/\sqrt{2})$	Common Language Effect Size
$U_3 = \Phi(	d
$n \approx 15.68/d^2$	Required $n$ /group (80% power, $\alpha=.05$ )

Variance Standardiser Selection

Condition	Use
Equal variances, no reference	Cohen's $d$ (pooled SD)
Unequal variances, treatment vs. control	Glass's $\Delta$ (control SD)
Unequal variances, no reference	$d_{av}$ (average SD)
Small $n$ (any)	Hedges' $g$
Meta-analysis	Hedges' $g$

APA 7th Edition Reporting Templates

Welch's (recommended): "[Group 1] ( $M =$ [value], $SD =$ [value], $n =$ [value]) and [Group 2] ( $M =$ [value], $SD =$ [value], $n =$ [value]) were compared using Welch's independent samples t-test. [Levene's test result here if relevant.] The test revealed [a significant / no significant] difference, $t_W(\nu_W) =$ [value], $p =$ [value], $d =$ [value] [95% CI: LB, UB]. The mean difference was [value] [original units] [95% CI: LB, UB]."

Student's (when variances confirmed equal): "... $t(n_1+n_2-2) =$ [value], $p =$ [value], $d =$ [value] [95% CI: LB, UB]."

With Glass's $\Delta$ : "... Glass's $\Delta =$ [value] [95% CI: LB, UB] (standardised by the control group SD)."

Reporting Checklist

Item	Required
t-statistic with sign	✅ Always
Degrees of freedom (specify Welch or Student)	✅ Always
Exact p-value	✅ Always
Means and SDs for both groups	✅ Always
Sample sizes for both groups	✅ Always
95% CI for mean difference	✅ Always
Cohen's $d$ or Hedges' $g$ with 95% CI	✅ Always
Which test used (Student's vs. Welch's)	✅ Always
Levene's test result	✅ Always for independent designs
Normality check per group	✅ When $n < 30$ per group
Glass's $\Delta$	✅ When variances are unequal
CL effect size	Recommended
Power analysis	✅ For null or underpowered results
Equivalence test	✅ When claiming equivalence

This tutorial provides a comprehensive foundation for understanding, conducting, and reporting independent samples t-tests within the DataStatPro application. For further reading, see Ruxton (2006) "The unequal variance t-test is an underused alternative" (Behavioral Ecology), Delacre, Lakens & Leys (2017) "Why Psychologists Should by Default Use Welch's t-Test" (International Review of Social Psychology), and Lakens (2013) "Calculating and Reporting Effect Sizes" (Frontiers in Psychology). For feature requests or support, contact the DataStatPro team.

Independent Samples t-Test