t-Tests and Alternatives: Zero to Hero Tutorial
This comprehensive tutorial takes you from the foundational concepts of hypothesis testing all the way through advanced t-test variants, non-parametric alternatives, interpretation, reporting, and practical usage within the DataStatPro application. Whether you are encountering t-tests for the first time or seeking a deeper understanding of when and how to apply parametric and non-parametric tests for comparing means, this guide builds your knowledge systematically from the ground up.
Table of Contents
- Prerequisites and Background Concepts
- What is a t-Test?
- The Mathematics Behind t-Tests
- Assumptions of t-Tests
- Types of t-Tests
- Using the t-Test Calculator Component
- One-Sample t-Test
- Independent Samples t-Test
- Paired Samples t-Test
- Welch's t-Test — Unequal Variances
- Non-Parametric Alternatives
- Advanced Topics
- Worked Examples
- Common Mistakes and How to Avoid Them
- Troubleshooting
- Quick Reference Cheat Sheet
1. Prerequisites and Background Concepts
Before diving into t-tests, it is essential to be comfortable with the following foundational statistical concepts. Each is briefly reviewed below.
1.1 Populations, Samples, and Parameters
A population is the complete set of individuals or observations of interest. A sample is a subset drawn from the population. Parameters describe population characteristics (e.g., , ), while statistics describe sample characteristics (e.g., , ).
The t-test is an inferential procedure — it uses sample statistics to draw conclusions about unknown population parameters. The fundamental question in every t-test is: "Is the difference between observed means large enough to conclude that the true population means differ?"
1.2 The Sampling Distribution of the Mean
If we repeatedly drew samples of size from a population with mean and standard deviation , the distribution of sample means would itself be a distribution — the sampling distribution of the mean. By the Central Limit Theorem (CLT):
The standard error of the mean is:
When the population is unknown (as in virtually all real applications), it is estimated by the sample standard deviation , giving the estimated standard error:
This substitution is what necessitates the use of the -distribution rather than the standard normal distribution.
1.3 The t-Distribution
The Student's t-distribution was derived by William Sealy Gosset (publishing under the pseudonym "Student") in 1908. It arises whenever we estimate a normally distributed population's mean using a small sample and an unknown variance.
The t-distribution is characterised by a single parameter: degrees of freedom . As , the t-distribution converges to the standard normal .
Key properties:
- Symmetric and bell-shaped, centred at 0.
- Has heavier tails than the standard normal (more probability in the extremes).
- Heavier tails for smaller — reflecting greater uncertainty from estimating .
- Requires looking up critical values for hypothesis testing.
Critical values for common levels:
| (df) | (two-tailed ) | (two-tailed ) | (two-tailed ) |
|---|---|---|---|
| 5 | 2.571 | 4.032 | 8.610 |
| 10 | 2.228 | 3.169 | 4.587 |
| 20 | 2.086 | 2.845 | 3.850 |
| 30 | 2.042 | 2.750 | 3.646 |
| 60 | 2.000 | 2.660 | 3.460 |
| 120 | 1.980 | 2.617 | 3.373 |
| 1.960 | 2.576 | 3.291 |
1.4 Hypothesis Testing Framework
Every t-test operates within the Neyman-Pearson hypothesis testing framework:
Step 1 — State the hypotheses:
- (null hypothesis): The parameter equals a specified value (e.g., ).
- (alternative hypothesis): The parameter differs from that value.
Step 2 — Choose : The significance level is the acceptable Type I error rate (conventionally ). It is the probability of rejecting when it is true.
Step 3 — Compute the test statistic: The t-statistic measures how many standard errors the observed result is from the null hypothesis value.
Step 4 — Compute the p-value: The probability of observing a t-statistic at least as extreme as the one obtained, assuming is true.
Step 5 — Make a decision: Reject if ; fail to reject if .
Step 6 — Compute and report the effect size with CI: Statistical significance alone is insufficient. Always accompany the t-test result with Cohen's (or equivalent) and its 95% confidence interval.
1.5 Type I and Type II Errors
| Decision | True | False |
|---|---|---|
| Reject | Type I error () | Correct (Power = ) |
| Fail to Reject | Correct () | Type II error () |
- Type I error (): Concluding there is an effect when there is none (false positive).
- Type II error (): Missing a true effect (false negative).
- Power (): The probability of correctly detecting a true effect.
1.6 One-Tailed vs. Two-Tailed Tests
A two-tailed test places the rejection region in both tails of the distribution and is appropriate when the direction of the effect is not specified in advance:
A one-tailed test places the entire rejection region in one tail, appropriate only when a directional prediction is made before data collection on strong theoretical grounds:
⚠️ One-tailed tests should be pre-registered and theoretically justified before data collection. Using a one-tailed test post-hoc to achieve significance is p-hacking. In the absence of a strong directional prediction, always use a two-tailed test.
1.7 Confidence Intervals and Their Relationship to t-Tests
A confidence interval for the mean difference is directly related to the two-tailed t-test at significance level : the null hypothesis is rejected at level if and only if lies outside the CI.
The CI provides strictly more information than the p-value — it communicates both the direction and precision of the estimate and enables assessment of practical significance.
2. What is a t-Test?
2.1 The Core Idea
A t-test is a parametric inferential statistical test used to determine whether there is a statistically significant difference between means. The general form of the t-statistic is:
The denominator — the standard error — is the key: it scales the observed difference by the sampling variability, allowing us to determine whether the difference is larger than what we would typically expect from sampling variation alone.
2.2 When to Use a t-Test
A t-test is appropriate when:
- The outcome variable is continuous (interval or ratio scale).
- You are comparing one or two group means.
- The data are approximately normally distributed (or is large enough for CLT).
- Observations within each group are independent (for independent t-tests).
2.3 The Three Versions of the t-Test
| t-Test Type | Research Question | Example |
|---|---|---|
| One-sample | Does a sample mean differ from a known/hypothesised value? | Is average exam score different from 70? |
| Independent samples | Do two unrelated groups have different means? | Do males and females differ on anxiety? |
| Paired samples | Do two related measurements differ within the same units? | Does anxiety change from pre- to post-treatment? |
2.4 The t-Test in Context
The t-test is one member of a broader family of inferential tests:
| Situation | Test |
|---|---|
| One group vs. known value (normal data) | One-sample t-test |
| Two independent groups (normal, equal variances) | Student's independent t-test |
| Two independent groups (normal, unequal variances) | Welch's t-test |
| Two related groups (normal data) | Paired samples t-test |
| Two independent groups (non-normal or ordinal data) | Mann-Whitney U test |
| Two related groups (non-normal or ordinal data) | Wilcoxon signed-rank test |
| One group vs. known value (non-normal) | Wilcoxon signed-rank (one-sample) |
| More than two groups (normal data) | One-way ANOVA (F-test) |
| More than two groups (non-normal) | Kruskal-Wallis test |
2.5 Statistical Significance vs. Practical Significance
A t-test answers: "Is the observed mean difference larger than expected by chance?" It does not answer: "Is the difference large enough to matter in practice?"
With large samples, trivially small differences become statistically significant. A study comparing two teaching methods with per group might find , , for a mean difference of 0.3 points on a 100-point scale — significant but practically meaningless.
Always report:
- The t-statistic and p-value (statistical significance).
- Cohen's or equivalent effect size (practical significance).
- The 95% CI for the mean difference and for the effect size.
3. The Mathematics Behind t-Tests
3.1 The One-Sample t-Statistic
The one-sample t-test tests whether a sample mean differs significantly from a hypothesised population mean :
Where:
- = sample mean
- = hypothesised population mean (null value)
- = sample standard deviation
- = sample size
Under , this statistic follows a t-distribution with degrees of freedom.
The 95% CI for the population mean:
3.2 The Independent Samples t-Statistic (Student's)
The independent samples t-test (Student's version) tests whether two population means are equal, assuming homogeneity of variance:
Where the pooled standard deviation is:
Degrees of freedom:
The 95% CI for the mean difference :
3.3 Welch's t-Statistic — Unequal Variances
Welch's t-test does not assume equal population variances. It computes a separate variance estimate for each group:
The degrees of freedom are approximated by the Welch-Satterthwaite equation:
Note: is generally non-integer and is typically rounded down. Welch's df are always (i.e., always fewer or equal df than Student's t-test, making it a more conservative test).
The 95% CI for the mean difference:
3.4 The Paired Samples t-Statistic
The paired t-test treats the data as a set of difference scores computed for each pair. It tests whether the mean difference is significantly different from zero:
Where:
- = mean of the difference scores
- = SD of the difference scores
- = number of pairs
Degrees of freedom:
The 95% CI for the mean difference:
The relationship between the paired t-statistic, the correlation between paired measurements, and the independent samples t-statistic:
This shows that when (paired measurements are positively correlated), the paired test has a smaller denominator (less error variance) and thus greater statistical power than the independent samples test for the same data.
3.5 The p-value
The p-value is computed from the t-statistic and degrees of freedom using the cumulative distribution function (CDF) of the t-distribution:
Two-tailed p-value:
One-tailed p-value (upper tail):
One-tailed p-value (lower tail):
Where is the CDF of the t-distribution with degrees of freedom.
3.6 Computing Effect Sizes from t-Statistics
When raw data are unavailable, effect sizes can be computed directly from the reported t-statistic:
Cohen's from independent samples t-test:
For equal group sizes ():
Cohen's from one-sample or paired t-test:
Pearson from any t-statistic:
Where is the degrees of freedom. Note: this is equivalent to the point-biserial correlation between the binary group variable and the continuous outcome.
3.7 The Non-Central t-Distribution and Exact CIs for
Under the alternative hypothesis (when a true effect exists), the t-statistic does not follow a central t-distribution — it follows a non-central t-distribution with non-centrality parameter:
This non-centrality parameter links the population effect size to the expected t-statistic. Exact 95% CIs for Cohen's invert this relationship numerically (no closed form exists) — a computation performed automatically by DataStatPro.
3.8 Statistical Power of the t-Test
Power is the probability that the t-test correctly rejects when a true effect exists:
Where is the non-central t-distribution with non-centrality parameter:
(independent) or (one-sample or paired)
For the independent samples t-test with equal groups, the approximate required sample size for power at two-sided level :
| Power = 0.80 (/group) | Power = 0.90 (/group) | Power = 0.95 (/group) | |
|---|---|---|---|
| 0.20 (small) | 394 | 527 | 651 |
| 0.50 (medium) | 64 | 85 | 105 |
| 0.80 (large) | 26 | 34 | 42 |
| 1.00 | 17 | 22 | 27 |
| 1.20 | 12 | 16 | 20 |
4. Assumptions of t-Tests
4.1 Normality of the Sampling Distribution
The t-test assumes that the sampling distribution of the mean difference is normal. This is satisfied when either:
- The population from which data are drawn is normally distributed, OR
- The sample size is sufficiently large for the CLT to ensure approximate normality of (generally per group, though skewed distributions may require larger ).
How to check:
- Shapiro-Wilk test for (most powerful normality test for small samples).
- Kolmogorov-Smirnov or Lilliefors test for .
- Q-Q (quantile-quantile) plots: points should fall approximately on the diagonal line.
- Histograms and density plots: assess approximate bell shape.
- Skewness () and kurtosis ().
Robustness: The t-test is remarkably robust to mild non-normality, especially for larger samples. For moderate non-normality with per group, the t-test's Type I error rate remains close to the nominal .
When violated: Use the Mann-Whitney U test (independent) or Wilcoxon signed-rank test (paired) as non-parametric alternatives. Consider data transformation (log, square root) if the distribution is strongly skewed.
4.2 Homogeneity of Variance (for Independent Samples t-Test)
Student's independent t-test assumes that the two populations have equal variances (). This assumption is required for the pooled standard deviation to be a valid common estimator.
How to check:
- Levene's test (preferred — robust to non-normality): : equal variances.
- Brown-Forsythe test: more robust variant of Levene's for non-normal data.
- Variance ratio rule of thumb: if , heterogeneity is potentially problematic.
- -test of equality of variances (sensitive to non-normality — use with caution).
⚠️ A statistically significant Levene's test does not automatically invalidate Student's t-test for large equal-sized samples (the test is robust). However, when groups are unequal in size AND have unequal variances, Student's t-test can be severely anti-conservative (inflated Type I error). In this case, always use Welch's t-test.
When violated: Use Welch's t-test, which does not assume equal variances and is generally recommended as the default for independent samples comparisons (see Section 10).
4.3 Independence of Observations
Within each group, all observations must be independent — the score of one participant must not influence the score of any other. This is an assumption about the study design, not about the data, and cannot be tested statistically.
Common violations:
- Students sampled from the same classroom (classroom effect).
- Patients sampled from the same hospital ward.
- Animals from the same litter.
- Repeated measurements on the same participant (use paired t-test instead).
- Family members in the same study.
When violated: For clustered data, use multilevel models. For repeated measures within the same participant, use the paired t-test or repeated measures ANOVA.
4.4 Scale of Measurement
t-Tests assume the dependent variable is measured on at least an interval scale — that is, the differences between values are meaningful and equal across the scale.
When violated: If the outcome is ordinal (ranked categories) or continuous but severely non-normal, use non-parametric alternatives (Mann-Whitney U, Wilcoxon signed-rank).
4.5 Random Sampling
For inferential conclusions to generalise to the population, the sample should be randomly selected from the population of interest. In practice, many research samples are convenience samples; this limits the generalisability of conclusions but does not invalidate the mathematical procedure of the t-test itself.
4.6 Absence of Influential Outliers
Extreme outliers can dramatically distort the mean and standard deviation, leading to inflated or deflated t-statistics. The t-test is sensitive to outliers, particularly in small samples.
How to check:
- Boxplots: inspect for values more than beyond the quartiles.
- Standardised scores: flag as potential outliers.
- Grubbs' test or generalised ESD method for formal outlier detection.
When outliers are present: Investigate whether outliers represent data entry errors, measurement errors, or genuine extreme values. Report analyses with and without outliers. Consider using the trimmed mean t-test or a robust alternative.
4.7 Assumption Summary Table
| Assumption | One-Sample | Independent | Paired | How to Check | Remedy if Violated |
|---|---|---|---|---|---|
| Normality | ✅ | ✅ | ✅ (differences) | Shapiro-Wilk, Q-Q | Mann-Whitney / Wilcoxon |
| Equal variances | — | ✅ | — | Levene's test | Welch's t-test |
| Independence | ✅ | ✅ (within groups) | ✅ (between pairs) | Design check | Multilevel models |
| Interval scale | ✅ | ✅ | ✅ | Measurement theory | Non-parametric test |
| No severe outliers | ✅ | ✅ | ✅ | Boxplots, -scores | Trimmed mean / robust test |
5. Types of t-Tests
5.1 Decision Flowchart for Test Selection
The following logic guides selection of the appropriate t-test or alternative:
Is the outcome variable continuous (interval/ratio)?
├── NO → Use chi-squared / Fisher's exact (categorical outcomes)
└── YES → How many groups?
├── MORE THAN 2 → Use ANOVA (or Kruskal-Wallis)
└── 1 OR 2 → Are observations independent or paired?
├── PAIRED (same units, two conditions)
│ ├── Normal differences? → Paired t-test
│ └── Non-normal? → Wilcoxon signed-rank
└── INDEPENDENT (different participants)
├── One group vs. known value?
│ ├── Normal? → One-sample t-test
│ └── Non-normal? → Wilcoxon signed-rank (one-sample)
└── Two independent groups
├── Normal + equal variances → Student's t-test
├── Normal + unequal variances → Welch's t-test ✅ (recommended default)
└── Non-normal or ordinal → Mann-Whitney U
5.2 Choosing Between Student's and Welch's t-Test
A persistent question in applied statistics is whether to use Student's t-test (assuming equal variances) or Welch's t-test (not assuming equal variances) for independent samples.
The consensus recommendation: Use Welch's t-test as the default for independent samples comparisons:
| Scenario | Student's t-test | Welch's t-test |
|---|---|---|
| Equal , equal | ✅ Correct size | ✅ Correct size |
| Equal , unequal | ⚠️ Slightly liberal | ✅ Correct size |
| Unequal , equal | ✅ Correct size | ✅ Slightly conservative |
| Unequal , unequal | ❌ Severely liberal | ✅ Correct size |
Simulation studies (Ruxton, 2006; Delacre et al., 2017) consistently show that Welch's t-test controls Type I error across all conditions, whereas Student's t-test fails when and are both unequal. The loss of power from using Welch's when variances are truly equal is negligible.
💡 The recommendation to default to Welch's t-test is supported by simulation evidence and is increasingly standard practice. DataStatPro reports both Student's and Welch's results by default, with Welch's highlighted as the recommended result.
6. Using the t-Test Calculator Component
The t-Test Calculator component in DataStatPro provides a comprehensive tool for running, diagnosing, visualising, and reporting t-tests and their alternatives.
Step-by-Step Guide
Step 1 — Select the Test Type
Choose from the "Test Type" dropdown:
- One-Sample t-Test: Compare a sample mean to a known or hypothesised value.
- Independent Samples t-Test: Compare means from two unrelated groups.
- Paired Samples t-Test: Compare means from two related measurements.
- Welch's t-Test: Independent samples without the equal variance assumption.
- Mann-Whitney U Test: Non-parametric independent samples comparison.
- Wilcoxon Signed-Rank Test: Non-parametric paired or one-sample comparison.
Step 2 — Input Method
Choose how to provide the data:
- Raw data: Upload or paste the dataset directly. DataStatPro computes all statistics automatically, checks assumptions, and flags potential issues.
- Summary statistics: Enter , , and for each group. Full assumption checks are not available but all test statistics and effect sizes are computed.
- Test statistic + df: Enter the -statistic and degrees of freedom to compute p-values, effect sizes, and CIs from a published result.
💡 When using raw data, DataStatPro automatically runs Shapiro-Wilk tests for normality and Levene's test for equality of variances, and displays the results alongside the main output with colour-coded warnings for violations.
Step 3 — Specify the Null Hypothesis Value
- One-sample: Enter (default: 0). This is the population mean under .
- Independent/Paired: Default is . Specify a non-zero value for non-inferiority or superiority testing (e.g., ).
Step 4 — Select the Alternative Hypothesis
- Two-tailed (default): — most common, requires no directional prediction.
- Upper one-tailed: — pre-registered directional prediction.
- Lower one-tailed: — pre-registered directional prediction.
Step 5 — Choose the Significance Level
Select (default: ). DataStatPro also provides results for and simultaneously for reference.
Step 6 — Select the Variance Assumption
For independent samples tests:
- Equal variances (Student's): Uses pooled SD; .
- Unequal variances (Welch's) — Recommended: Uses Welch-Satterthwaite df.
- Auto: Run both; flag discrepancies based on Levene's test result.
Step 7 — Select Display Options
Choose which outputs to display:
- ✅ t-statistic, df, p-value, and decision.
- ✅ Means, SDs, and standard errors for each group.
- ✅ Mean difference with 95% CI (exact).
- ✅ Cohen's (or ) and Hedges' with 95% CI.
- ✅ Common Language Effect Size (CL %).
- ✅ Assumption test results (Shapiro-Wilk, Levene's).
- ✅ Distribution visualisation (two overlapping density curves with shaded overlap).
- ✅ Effect size visualisation (Cohen's diagram with , , ).
- ✅ Power analysis and required for 80%, 90%, 95% power.
- ✅ APA-style results paragraph.
- ✅ Equivalence test (TOST) for assessing negligibility of the effect.
Step 8 — Run the Analysis
Click "Run t-Test". DataStatPro will:
- Compute the t-statistic, degrees of freedom, and p-value.
- Construct the 95% CI for the mean difference.
- Compute Cohen's , Hedges' , and their exact CIs.
- Run all selected assumption tests and display warnings.
- Generate all selected visualisations.
- Generate an APA 7th edition-compliant results paragraph.
7. One-Sample t-Test
7.1 Purpose and Design
The one-sample t-test answers the question: "Is the mean of my sample significantly different from a specific, theoretically or practically meaningful value ?"
Common applications:
- Testing whether a sample's mean IQ differs from the population mean of 100.
- Testing whether a sample's mean reaction time differs from a published normative value.
- Quality control: testing whether a machine produces items with a target weight.
- Testing whether a clinical sample's mean score differs from the healthy population norm.
7.2 Full Procedure
Given: A sample of observations with mean and standard deviation . Test vs. .
Step 1 — Compute the sample mean and SD
Step 2 — Compute the standard error
Step 3 — Compute the t-statistic
Step 4 — Determine degrees of freedom
Step 5 — Compute the p-value
Compare to . Reject if .
Step 6 — Compute the 95% CI for
Step 7 — Compute Cohen's
Hedges' (bias-corrected):
7.3 Interpreting the One-Sample t-Test
| Result | Interpretation |
|---|---|
| and CI excludes | Reject : sample mean differs significantly from |
| and CI includes | Fail to reject : insufficient evidence of a difference |
| Large , | Significant AND practically meaningful departure from |
| Small , (large ) | Significant but practically negligible departure from |
| Large , (small ) | Non-significant due to low power; effect may be real but undetected |
8. Independent Samples t-Test
8.1 Purpose and Design
The independent samples t-test answers: "Do two independent groups have the same population mean?" It requires that the two groups are composed of entirely different participants with no systematic pairing or matching.
Common applications:
- Comparing test scores between a treatment group and a control group.
- Comparing anxiety levels between males and females.
- Comparing response times between two experimental conditions (between-subjects design).
- Comparing customer satisfaction between two product versions.
8.2 Full Procedure (Student's)
Given: Group 1 with observations (, ) and Group 2 with observations (, ). Test .
Step 1 — Compute summary statistics
Step 2 — Check variance homogeneity (Levene's test)
Run Levene's test. If , favour Welch's t-test (Section 10). Regardless, reporting both is best practice.
Step 3 — Compute pooled standard deviation
Step 4 — Compute the t-statistic
Step 5 — Degrees of freedom and p-value
Step 6 — 95% CI for the mean difference
Step 7 — Effect sizes
Common Language Effect Size:
8.3 APA Reporting Template
"An independent samples t-test revealed [a significant / no significant] difference in [DV] between [Group 1] (, ) and [Group 2] (, ), , , [95% CI: , ]. This represents a [small / medium / large] effect according to Cohen's (1988) benchmarks."
Example: "An independent samples Welch's t-test revealed a significant difference in anxiety scores between the CBT group (, ) and the waitlist control group (, ), , , [95% CI: 0.87, 1.88]. This represents a large treatment effect."
9. Paired Samples t-Test
9.1 Purpose and Design
The paired samples t-test (also: dependent samples, matched pairs, or repeated measures t-test) answers: "Do two related measurements differ significantly from each other?"
When observations are paired:
- Pre-post designs: The same participant measured before and after an intervention.
- Matched pairs: Participants matched on key characteristics (age, sex, IQ) and randomised to different conditions.
- Within-subjects designs: Each participant experiences both conditions.
- Natural pairs: Twins, left hand vs. right hand, matched siblings.
Advantage over independent t-test: By comparing within-person differences, the paired design removes between-person variability from the error term, substantially increasing power when the within-person correlation is positive.
9.2 Full Procedure
Given: pairs of observations .
Step 1 — Compute difference scores
Step 2 — Compute mean and SD of differences
Step 3 — Compute the standard error of the mean difference
Step 4 — Compute the t-statistic
Step 5 — Degrees of freedom and p-value
Step 6 — 95% CI for the mean difference
Step 7 — Effect sizes
Cohen's (most commonly reported for paired designs):
Cohen's (repeated measures , accounting for the pre-post correlation):
Where and is the correlation between the two measurements. Note that is more comparable to from independent samples designs than is.
Cohen's (standardised by the average SD):
9.3 Comparing Paired and Independent t-Tests for the Same Data
When data are paired (pre-post), computing the incorrect independent t-test is a serious error. The relationship between the two t-statistics is:
More precisely, the t-statistics are related through the within-pair correlation :
When (typical for repeated measures): , so — the paired test is more powerful. When (independence), both tests are equivalent. When (rare), the independent test is more powerful.
⚠️ Never apply an independent samples t-test to paired data. Doing so ignores the within-pair correlation, produces an inflated standard error, and loses statistical power. Conversely, applying a paired t-test to genuinely independent data violates the independence assumption of the difference scores.
10. Welch's t-Test — Unequal Variances
10.1 Why Welch's is Preferred
Welch's t-test (1947) is a modification of Student's t-test that does not assume equal population variances. It is the recommended default for independent samples comparisons for three reasons:
- Robustness: It maintains correct Type I error rates regardless of whether variances are equal or unequal.
- Negligible power loss: When variances are truly equal, Welch's test loses very little power compared to Student's.
- Correct coverage: The CI from Welch's has the correct nominal coverage probability across all variance ratio conditions.
10.2 Full Procedure
Step 1 — Compute group summary statistics
and
Step 2 — Compute separate variance estimates
Step 3 — Compute Welch's t-statistic
Step 4 — Compute Welch-Satterthwaite degrees of freedom
Round down to the nearest integer for conservative inference.
Step 5 — p-value
Step 6 — 95% CI for mean difference
Step 7 — Effect size (Glass's or Welch's )
When variances are unequal, the appropriate standardiser for Cohen's is debated. Options include:
Pooled SD (ignores heterogeneity — caution):
Glass's (control group SD as standardiser — recommended for treatment/control):
Average SD (unbiased when neither group is the reference):
💡 DataStatPro reports all three standardisers alongside Welch's t-test, with Glass's highlighted when one group is a designated control, and highlighted when neither group is a natural reference.
10.3 Student's vs. Welch's: A Direct Comparison
| Property | Student's t-test | Welch's t-test |
|---|---|---|
| Assumes equal variances | ✅ Yes | ❌ No |
| df | Welch-Satterthwaite (always Student's) | |
| Type I error (equal , unequal ) | ≈ nominal | ≈ nominal |
| Type I error (unequal , unequal ) | ❌ Inflated | ✅ Nominal |
| Power (equal variances) | Marginally higher | ≈ equivalent |
| Recommendation | Avoid as default | ✅ Recommended default |
11. Non-Parametric Alternatives
11.1 When to Use Non-Parametric Tests
Non-parametric tests (also called distribution-free tests) are appropriate when:
- Data are ordinal (ranks, Likert items treated as ordinal).
- Data are severely non-normally distributed and sample sizes are small.
- There are extreme outliers that cannot be removed and that distort parametric statistics.
- The research question concerns ranks or medians rather than means.
Trade-off: Non-parametric tests are more robust to assumption violations but have lower statistical power than their parametric counterparts when parametric assumptions ARE met. When normality holds, using a non-parametric test discards information.
💡 Non-parametric does not mean "assumption-free." The Mann-Whitney U test assumes that the two distributions have the same shape (just shifted); violation of this shape assumption means U tests the combined null of equal location AND equal shape, not just equal medians.
11.2 Mann-Whitney U Test (Non-Parametric Independent Samples)
The Mann-Whitney U test (also Wilcoxon rank-sum test) is the non-parametric alternative to the independent samples t-test. It tests whether the distributions of two independent groups are identical (or, under the shape assumption, whether one group tends to have higher ranks than the other).
Procedure:
Step 1 — Rank all observations
Combine all observations and assign ranks from 1 (smallest) to . For tied values, assign the average of the tied ranks.
Step 2 — Compute the rank sums
Check:
Step 3 — Compute U statistics
Check:
The test statistic is .
Step 4 — Compute z-approximation (for )
Under , with continuity correction:
For ties, the variance requires a correction factor:
Where is the number of observations in the -th tied group.
Step 5 — Effect size: Rank-biserial correlation
Or equivalently:
Interpretation: means that 75% of observations in Group 1 exceed those in Group 2.
Cohen's benchmarks for (same as ):
| Label | |
|---|---|
| Small | |
| Medium | |
| Large |
11.3 Wilcoxon Signed-Rank Test (Non-Parametric Paired)
The Wilcoxon signed-rank test is the non-parametric alternative to the paired t-test. It tests whether the distribution of difference scores is symmetric about zero.
Procedure:
Step 1 — Compute and rank absolute differences
Compute . Remove pairs where . Let = number of non-zero differences.
Rank from 1 (smallest) to (largest), assigning average ranks for ties.
Step 2 — Sum positive and negative ranks
Check:
The test statistic is .
Step 3 — z-approximation (for )
With tie correction:
Step 4 — Effect size
Or, the matched-pairs rank-biserial correlation:
11.4 One-Sample Wilcoxon Signed-Rank Test
The one-sample version tests : the population median equals . Compute and apply the Wilcoxon signed-rank procedure as above.
11.5 Comparing Parametric and Non-Parametric Tests
| Property | t-Test (parametric) | Mann-Whitney / Wilcoxon (non-parametric) |
|---|---|---|
| Tests | Mean difference | Distribution shift (median/rank) |
| Assumes normality | ✅ Yes | ❌ No |
| Sensitive to outliers | ✅ Yes | ❌ No (rank-based) |
| Power (when normal) | ✅ Higher | ✅ 95% efficiency of t-test |
| Power (when non-normal) | ❌ Lower | ✅ Can exceed t-test |
| Effect size | Cohen's , Hedges' | Rank-biserial |
| Handles ordinal data | ❌ Questionable | ✅ Appropriate |
| Interpretability | Mean difference | Probability of superiority |
⚠️ The Asymptotic Relative Efficiency (ARE) of the Mann-Whitney U test relative to the t-test is for normal data — meaning you only need about 5% more observations with Mann-Whitney to achieve the same power as the t-test. This near-equality of efficiency makes Mann-Whitney a safe choice when normality is questionable.
11.6 Brunner-Munzel Test — Handling Unequal Shapes
When the two distributions have different shapes (not just different locations), the Mann-Whitney test actually tests a compound null of equal location AND equal spread. The Brunner-Munzel test (Brunner & Munzel, 2000) is a robust alternative that tests only the stochastic equality of the two groups without the shape assumption:
The test statistic uses ranked data with separate within-group rankings:
Where are internal group ranks. DataStatPro reports the Brunner-Munzel test as an option under the Non-Parametric menu when the distribution shape assumption may be violated.
12. Advanced Topics
12.1 Robust t-Tests — Trimmed Means and Winsorisation
Trimmed mean t-tests use the -trimmed mean (removing the top and bottom proportion of observations) as the measure of central tendency. Yuen's (1974) -test for trimmed means:
With 20% trimming from each tail:
Where are the Winsorised sum of squared deviations, and .
Trimmed mean t-tests are substantially more powerful than rank-based tests for heavy-tailed symmetric distributions while maintaining good Type I error control.
12.2 Bootstrap t-Tests
The bootstrap t-test (Efron & Tibshirani, 1994) makes no parametric distributional assumptions. It constructs the null distribution of the t-statistic empirically:
Procedure:
- Compute the observed t-statistic .
- Centre both samples on a common mean: .
- Draw bootstrap samples (typically ) from the centred samples with replacement and compute for each.
- The p-value is the proportion of bootstrap t-statistics exceeding .
Percentile bootstrap CI for the mean difference:
Resample from the original data and compute the mean difference for each bootstrap sample. The 95% CI is the 2.5th and 97.5th percentiles of the bootstrap distribution.
The bootstrap is particularly valuable for small, non-normal samples where both parametric and asymptotic approximations may be poor.
12.3 Bayesian t-Tests
The Bayesian t-test (Rouder et al., 2009; Jeffreys, 1961) quantifies evidence for both (no effect) and (an effect exists) using the Bayes Factor ():
represents how many times more likely the data are under than .
For the default Bayesian t-test (Rouder et al., 2009), the prior on effect size under is a Cauchy distribution with scale :
The default scale represents a "medium" effect size prior.
Interpreting Bayes Factors:
| Evidence for | |
|---|---|
| Extreme | |
| Very strong | |
| Strong | |
| Moderate | |
| Anecdotal | |
| No evidence (equal) | |
| Anecdotal for | |
| Moderate for | |
| Strong for | |
| Very strong for |
Advantages of Bayesian t-tests:
- Quantify evidence for (null results can be informative).
- Avoid the optional stopping problem (p-values are invalid for sequential testing).
- Provide a single, direct statement about the relative evidence for each hypothesis.
Limitations: Sensitive to the choice of prior. Results should always be reported with the prior specification and checked for sensitivity to alternative priors.
12.4 Equivalence Testing with TOST
Standard null hypothesis testing only allows rejection of . When the goal is to demonstrate absence of a meaningful effect (e.g., showing that a generic drug is bioequivalent to a brand-name drug), the Two One-Sided Tests (TOST) procedure is required.
Specify equivalence bounds (e.g., , corresponding to a "trivially small" effect).
TOST procedure:
Test using upper one-tailed test. Test using lower one-tailed test.
Equivalence is concluded at level when both one-tailed tests reject their respective nulls — equivalently, when the CI (for ) for the mean difference falls entirely within .
💡 Note that TOST uses a 90% CI (not 95%) when , because each one-tailed test is at the level. The 90% CI corresponds to two one-tailed 5% tests.
Minimum detectable equivalence with per group:
12.5 Sequential t-Tests and Alpha Spending
Traditional t-tests do not allow for interim analyses — looking at the data multiple times inflates the Type I error rate. Sequential approaches address this:
Sequential Probability Ratio Test (SPRT): Compute a likelihood ratio after each observation. Stop when (accept ) or (reject ), where and .
Alpha spending functions (O'Brien-Fleming, Pocock): Pre-specify how the total budget is distributed across planned interim and final analyses.
Bayesian sequential testing: Use Bayes Factors to monitor evidence continuously. Unlike frequentist sequential testing, Bayesian sequential testing is valid at any stopping point without an alpha correction.
12.6 Multiple Comparisons and t-Tests
When multiple t-tests are conducted within the same study, the familywise error rate (FWER) — the probability of at least one Type I error — inflates:
Where is the number of tests. For independent tests at : .
Correction methods:
| Method | Adjusted | Properties |
|---|---|---|
| Bonferroni | Conservative; controls FWER; simple | |
| Holm | Sequential Bonferroni | Less conservative than Bonferroni |
| Benjamini-Hochberg | Controls FDR | Less conservative; for exploratory work |
| Šidák | Slightly less conservative than Bonferroni |
⚠️ Corrections for multiple comparisons should be planned before data collection and applied to the entire family of tests. Post-hoc correction of selected tests is not valid. When tests are planned contrasts from a theoretically derived framework, no correction may be necessary — this should be justified explicitly.
12.7 Effect Sizes for t-Tests: Choosing the Right Variant
Multiple variants of Cohen's exist for the paired design, and choosing the wrong one leads to incomparable effect sizes across studies:
| Variant | Formula | Denominator | Comparability |
|---|---|---|---|
| SD of differences | Paired designs only | ||
| Average of group SDs | Comparable to between-subjects | ||
| Adjusted for correlation | Corrects for non-independence | ||
| (Glass) | Pre-test (baseline) SD | Treatment-control; Glass's |
Which to use:
- Report when comparing paired designs to other paired designs.
- Report when comparing a paired design to a between-subjects design.
- Always specify which variant was used.
12.8 Reporting t-Tests According to APA 7th Edition
The APA Publication Manual (7th ed.) requires:
- Test statistic: value
- p-value: value (report exact value; use only when below .001)
- Effect size with 95% CI: Cohen's [LB, UB] or Hedges'
- Group means and standard deviations
- Whether Welch's correction was applied (for independent t-tests)
- Whether the CI is for the mean difference or the standardised effect size
Full APA template:
"[Group 1] (, , ) and [Group 2] (, , ) were compared using [Student's / Welch's] independent samples t-test. The test revealed [a significant / no significant] mean difference, , , [95% CI: , ], indicating a [small / medium / large] effect."
13. Worked Examples
Example 1: One-Sample t-Test — Comparing Response Time to a Normative Standard
A cognitive neuroscience researcher measures simple reaction times (in ms) for adults diagnosed with ADHD. The published population norm for neurotypical adults is ms. The researcher tests whether the ADHD sample has a significantly different mean reaction time.
Data summary:
Step 1 — Standard error:
Step 2 — t-statistic:
Step 3 — Degrees of freedom:
Step 4 — p-value:
Step 5 — 95% CI for :
Step 6 — Cohen's :
Hedges' :
95% CI for (approximate):
Common Language Effect Size:
Summary:
| Statistic | Value | Interpretation |
|---|---|---|
| (two-tailed) | Significant at | |
| Mean difference | ms | ADHD group is 31.4 ms slower |
| 95% CI (ms) | Excludes 0; significant | |
| Cohen's | Medium-large effect | |
| 95% CI for | From small to very large — wide CI | |
| Hedges' | Minimal bias correction | |
| CL | ADHD group exceeds norm of time |
APA write-up: "Adults with ADHD ( ms, ms) showed significantly longer reaction times than the neurotypical normative mean of 250 ms, , , [95% CI: 0.29, 1.18]. This represents a medium-to-large deviation from the normative standard. The 95% CI for the mean difference was [13.7, 49.1] ms."
Example 2: Welch's Independent Samples t-Test — Sleep Duration by Shift Type
A workplace health researcher compares average nightly sleep duration (hours) between day-shift () and night-shift () nurses.
Summary statistics:
| Group | Mean (hrs) | SD | |
|---|---|---|---|
| Day shift | 40 | 7.21 | 1.02 |
| Night shift | 35 | 5.84 | 1.73 |
Levene's test: , — significant heterogeneity of variances. → Use Welch's t-test.
Step 1 — Variance estimates:
Step 2 — Welch's t-statistic:
Step 3 — Welch-Satterthwaite df:
Rounded down: .
Step 4 — p-value:
Step 5 — 95% CI:
Step 6 — Effect sizes:
Cohen's :
Glass's (using night-shift SD as the standardiser — the "comparison" group):
Average SD:
Summary:
| Statistic | Value |
|---|---|
| Levene's | , (unequal variances confirmed) |
| (two-tailed) | |
| Mean difference | hrs (day > night) |
| 95% CI (hrs) | |
| Cohen's | (Large) |
| Glass's | (Large) |
| (Large) |
APA write-up: "Day-shift nurses ( hrs, ) slept significantly longer than night-shift nurses ( hrs, ). Due to significant variance heterogeneity (Levene's , ), Welch's t-test was applied. Results indicated a significant difference, , , [95% CI: 0.54, 1.43], representing a large effect. Night-shift nurses slept on average 1.37 hours less per night [95% CI: 0.70, 2.04]."
Example 3: Paired Samples t-Test — Pre-Post Mindfulness Intervention
A clinical psychologist tests whether an 8-week mindfulness-based stress reduction (MBSR) programme reduces perceived stress. Perceived Stress Scale (PSS-10; range 0–40) scores are recorded before and after the programme for participants.
Summary statistics:
| Measurement | Mean | SD | (pre-post) |
|---|---|---|---|
| Pre-MBSR | 24.7 | 5.8 | |
| Post-MBSR | 18.3 | 5.1 | |
| Differences () | 6.4 | 4.1 |
Step 1 — t-statistic:
Step 2 — Degrees of freedom and p-value:
Step 3 — 95% CI for mean difference:
Step 4 — Effect sizes:
Note the difference:
- : appropriate for within-study power analysis and comparison to other paired studies.
- : better for comparing to between-subjects studies using the same scale.
- : corrects for the dependency structure; most generalisable.
Comparison: what if the independent t-test had been (incorrectly) applied?
The paired test () is substantially more powerful than the incorrect independent test () — reflecting the benefit of removing between-person variance through pairing.
Summary:
| Statistic | Value |
|---|---|
| (two-tailed) | |
| Mean reduction | PSS points |
| 95% CI for difference | |
| (if independent, incorrect) |
APA write-up: "Perceived stress scores decreased significantly from pre-MBSR (, ) to post-MBSR (, ), , , [95% CI: 0.99, 2.11]. The mean reduction of 6.4 PSS points (95% CI: [4.48, 8.32]) represents a large within-person effect."
Example 4: Mann-Whitney U Test — Non-Parametric Independent Comparison
A researcher compares pain ratings (0–10 scale, ordinal) between two physiotherapy protocols. Shapiro-Wilk tests indicate non-normality in both groups. Group 1 (Protocol A, ): ratings . Group 2 (Protocol B, ): ratings .
Step 1 — Rank all observations:
Combined sorted values: 2, 3, 3, 4, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 9
| Value | Freq | Ranks | Avg Rank | Group |
|---|---|---|---|---|
| 2 | 1 | 1 | 1.0 | A |
| 3 | 2 | 2–3 | 2.5 | A, A |
| 4 | 1 | 4 | 4.0 | A |
| 5 | 2 | 5–6 | 5.5 | A, A |
| 6 | 3 | 7–9 | 8.0 | A, B, B |
| 7 | 3 | 10–12 | 11.0 | A, B, B |
| 8 | 2 | 13–14 | 13.5 | B, B |
| 9 | 1 | 15 | 15.0 | B |
Step 2 — Rank sums:
Check: ✅
Step 3 — U statistics:
Check: ✅
Test statistic:
Step 4 — z-approximation:
(without tie correction)
Step 5 — Rank-biserial correlation:
Or: (from )
Interpretation: Protocol B produces substantially higher pain ratings — indicates a large effect (Protocol A ranks lower/better with probability ).
Summary:
| Statistic | Value |
|---|---|
| (approximate) | |
| (two-tailed) | |
| (Large) | |
| Median Protocol A | |
| Median Protocol B |
APA write-up: "Due to non-normal distributions, a Mann-Whitney U test was conducted. Protocol A () produced significantly lower pain ratings than Protocol B (), , , , , indicating a large effect."
14. Common Mistakes and How to Avoid Them
Mistake 1: Using the Independent t-Test for Paired Data
Problem: Treating pre-post measurements or matched pairs as independent samples. This ignores the within-person correlation, inflates the error term, and substantially reduces power.
Solution: Identify the study design before analysis. If each participant contributes two scores (repeated measures, matched pairs), use the paired t-test. Check whether the data file has one row per participant (paired) vs. one row per observation (independent).
Mistake 2: Defaulting to Student's t-Test Without Checking Variance Equality
Problem: SPSS, Excel, and older textbooks default to Student's t-test. When groups differ in sample size AND variance, Student's t-test can have a severely inflated Type I error rate.
Solution: Always use Welch's t-test as the default for independent samples. The cost in power when variances are equal is negligible, whereas the benefit when variances are unequal is substantial. Report Welch's results; note if Levene's test is significant.
Mistake 3: Interpreting a Non-Significant p-Value as Evidence of No Effect
Problem: Concluding that means . A non-significant result means the data are insufficient to reject — it does NOT mean the null hypothesis is true.
Solution: Report the 95% CI for the mean difference alongside the p-value. A wide CI that spans from negative to positive values reflects uncertainty, not evidence of zero effect. To positively establish absence of a meaningful effect, use equivalence testing (TOST) with prespecified bounds.
Mistake 4: Reporting Only p-Values Without Effect Sizes
Problem: Reporting , without Cohen's conveys nothing about the magnitude of the effect. With , the same p-value might correspond to (trivial); with , it might correspond to (large).
Solution: Always report Cohen's (or Hedges' ) and its 95% CI alongside every t-test. DataStatPro computes these automatically.
Mistake 5: Switching to One-Tailed Tests After Seeing the Data
Problem: Observing that Group 1 > Group 2, then switching to a one-tailed test to achieve when the two-tailed result was . This is p-hacking and inflates the Type I error to approximately .
Solution: Directional hypotheses must be pre-registered before data collection and must be based on strong theoretical or prior empirical grounds. If in doubt, use a two-tailed test.
Mistake 6: Applying t-Tests to Likert Items Without Justification
Problem: Treating 5-point Likert items as interval-scale data and applying t-tests. Strictly, Likert items are ordinal — the intervals between adjacent scale points are not necessarily equal.
Solution: For a single Likert item, use the Mann-Whitney U (independent) or Wilcoxon signed-rank (paired) test. For a Likert scale (composite of multiple items), the summed score is typically treated as approximately interval, and t-tests are generally considered acceptable. Clearly state this assumption.
Mistake 7: Ignoring Outliers Before Running the t-Test
Problem: The t-test uses means, which are highly sensitive to outliers, especially in small samples. A single extreme value can drastically alter the t-statistic and p-value.
Solution: Always inspect data with boxplots and -scores before running a t-test. Investigate outliers (data entry error? valid extreme value?). Report analyses with and without outliers. Consider using trimmed mean t-tests or the Mann-Whitney test when outliers cannot be removed.
Mistake 8: Confusing Statistical Power with the Probability the Null is False
Problem: Interpreting power as meaning "there is an 80% probability the null hypothesis is false, given I found ." Power is a property of the study design computed before data collection — it is the probability of getting a significant result IF a true effect of size exists.
Solution: Understand that power is computed under and is not a posterior probability about . The probability that a significant result reflects a true effect (positive predictive value) also depends on the prior probability of being true.
Mistake 9: Using the Wrong Variant and Comparing Across Designs
Problem: Reporting from a paired design and comparing it to from an independent samples study as if they are the same quantity. depends on the pre-post correlation and is typically larger than for the same mean difference.
Solution: When comparing effect sizes across designs, convert all effect sizes to a common metric. Use for paired designs when comparing to between-subjects studies. Always specify which variant of was computed.
Mistake 10: Running Multiple t-Tests Instead of ANOVA
Problem: Comparing three groups (A, B, C) with three separate t-tests (A vs. B, A vs. C, B vs. C) inflates the familywise error rate to instead of the nominal .
Solution: When comparing more than two groups, use one-way ANOVA (or Kruskal-Wallis for non-parametric data) followed by appropriate post-hoc tests (Tukey HSD, Bonferroni, Games-Howell for unequal variances). Reserve t-tests for pre-planned pairwise contrasts with appropriate alpha correction.
15. Troubleshooting
| Problem | Likely Cause | Solution |
|---|---|---|
| t-statistic is extremely large () | Very large or data entry error | Check for duplicate entries, errors; report effect size — even large may indicate a small |
| or exactly 0 | Floating point overflow; identical group means | Check that both groups have variance; verify data coding |
| Welch's df is very small () | One group has very small or near-zero variance | Check data; use exact permutation test for very small |
| Student's and Welch's give very different results | Unequal variances with unequal | Levene's test is likely significant; use Welch's result |
| Paired t-test gives larger than expected | High pre-post correlation (good — this is the efficiency gain) | Report as normal; note the within-person correlation |
| Shapiro-Wilk is significant but is large | Power of normality test increases with ; minor deviations become significant | With , CLT usually ensures valid t-test; inspect Q-Q plots and skewness |
| Mann-Whitney gives different conclusion than t-test | Distribution is non-normal and sample is small | For non-normal data, trust Mann-Whitney; report both with a note on assumption violation |
| Effect size CI is very wide | Small sample size | Report the wide CI — it is informative about low precision; conduct a priori power analysis for future study |
| Cohen's is much larger than | High pre-post correlation ( is large) | Both are correct; specify which was computed and when each is appropriate |
| Equivalence test fails despite small | Equivalence bounds are too tight for the sample size | Either increase or widen the equivalence bounds with justification |
| Negative -value or reported | Software error or data corruption | Re-check data file; rerun analysis in DataStatPro |
| One-tailed is larger than two-tailed | Effect is in the direction opposite to | The one-tailed test is not significant in the predicted direction; the effect is in the wrong direction |
| Bootstrap CI does not include 0 but t-test | Small sample; bootstrap and t-test diverge for highly non-normal data | Investigate distribution; report both with rationale for preferred method |
| computed from and seems too small | Correct — from is the point-biserial correlation, not Cohen's | Use to convert to Cohen's |
| Bayes Factor is not decisive () | Data provide no evidence in either direction; study is underpowered | Collect more data; report BF as evidence of insensitivity; avoid interpreting as supporting either hypothesis |
16. Quick Reference Cheat Sheet
Core t-Test Formulas
| Formula | Description |
|---|---|
| One-sample t-statistic | |
| Independent samples (Student's) | |
| Pooled standard deviation | |
| Welch's t-statistic | |
| Welch-Satterthwaite df | |
| Paired t-statistic | |
| Standard error of the mean | |
| Confidence interval for mean | |
| Two-tailed p-value |
Effect Size Formulas for t-Tests
| Formula | Description |
|---|---|
| Cohen's (independent) | |
| Cohen's (paired) | |
| Cohen's (paired, comparable to between) | |
| (corrected for dependency) | |
| Glass's | |
| Hedges' (bias-corrected) | |
| Point-biserial from | |
| from independent | |
| from paired/one-sample | |
| Convert to (equal groups) | |
| Common Language Effect Size |
Non-Parametric Formulas
| Formula | Description |
|---|---|
| Mann-Whitney statistic | |
| Mann-Whitney -approximation | |
| Rank-biserial correlation (Mann-Whitney) | |
| Wilcoxon positive rank sum | |
| Wilcoxon -approximation | |
| Effect size for Wilcoxon test |
Test Selection Guide
| Design | Normal? | Equal Variances? | Recommended Test |
|---|---|---|---|
| 1 group vs. known value | ✅ | — | One-sample t-test |
| 1 group vs. known value | ❌ | — | Wilcoxon signed-rank |
| 2 independent groups | ✅ | Equal or unknown | Welch's t-test |
| 2 independent groups | ✅ | Known unequal | Welch's t-test |
| 2 independent groups | ❌ | — | Mann-Whitney U |
| 2 related groups | ✅ (differences) | — | Paired t-test |
| 2 related groups | ❌ (differences) | — | Wilcoxon signed-rank |
| groups | ✅ | Equal | One-way ANOVA |
| groups | ✅ | Unequal | Welch's ANOVA |
| groups | ❌ | — | Kruskal-Wallis |
Cohen's Benchmarks for t-Test Effect Sizes
| Label | Power needed (/group) | ||
|---|---|---|---|
| Small | 394 | ||
| Medium | 64 | ||
| Large | 26 | ||
| Very large | 12 | ||
| Huge | 5 |
All power figures assume , two-tailed, 80% power, equal group sizes.
Degrees of Freedom Reference
| Test | df |
|---|---|
| One-sample t-test | |
| Independent t-test (Student's) | |
| Independent t-test (Welch's) | Welch-Satterthwaite (always ) |
| Paired t-test | (where = number of pairs) |
Assumption Checks Reference
| Assumption | Test | Software Function | Action if Violated |
|---|---|---|---|
| Normality | Shapiro-Wilk | shapiro.test() | Mann-Whitney / Wilcoxon |
| Normality | Q-Q plot | qqnorm() | Assess visually |
| Equal variances | Levene's | leveneTest() | Welch's t-test |
| Equal variances | Brown-Forsythe | bf.test() | Welch's t-test |
| Outliers | -score, boxplot | boxplot() | Investigate; trimmed mean |
| Independence | Design review | — | Multilevel model |
Confidence Interval Interpretation
| CI Property | Interpretation |
|---|---|
| Entirely above zero | Effect is significantly positive at the chosen |
| Entirely below zero | Effect is significantly negative at the chosen |
| Contains zero | Effect is not statistically significant |
| Narrow CI | Precise estimate (large ) |
| Wide CI | Imprecise estimate (small ) — interpret point estimate cautiously |
| 90% CI within equivalence bounds | Equivalence demonstrated (TOST) |
APA 7th Edition Reporting Templates
One-sample: [value], [value], [value], 95% CI [LB, UB].
Independent samples (Welch's): [value], [value], [value], 95% CI [LB, UB].
Paired samples: [value], [value], [value], 95% CI [LB, UB].
Mann-Whitney: [value], [value], [value], [value].
Wilcoxon signed-rank: [value], [value], [value], [value].
Required Sample Size Quick Reference
Two-sided , two independent equal groups:
| Power | |||||
|---|---|---|---|---|---|
| 0.70 | 310 | 102 | 50 | 20 | 14 |
| 0.80 | 394 | 130 | 64 | 26 | 17 |
| 0.90 | 527 | 174 | 85 | 34 | 22 |
| 0.95 | 651 | 215 | 105 | 42 | 27 |
All values are per group. Double for total .
t-Test Reporting Checklist
| Item | Required |
|---|---|
| t-statistic with sign | ✅ Always |
| Degrees of freedom | ✅ Always |
| Exact p-value (or ) | ✅ Always |
| Mean and SD for each group | ✅ Always |
| 95% CI for mean difference | ✅ Always |
| Cohen's or Hedges' | ✅ Always |
| 95% CI for effect size | ✅ Always |
| Sample sizes for each group | ✅ Always |
| Whether Student's or Welch's was used | ✅ For independent t-tests |
| Levene's test result | ✅ For independent t-tests |
| Normality check result | ✅ When per group |
| Which variant was used (, , etc.) | ✅ For paired designs |
| Power or sensitivity analysis | ✅ For null or inconclusive results |
| Equivalence test if claiming null | ✅ Always for null results |
| Pre-registration of one-tailed hypotheses | ✅ If one-tailed test used |
This tutorial provides a comprehensive foundation for understanding, conducting, and reporting t-tests and their alternatives within the DataStatPro application. For further reading, consult Gravetter & Wallnau's "Statistics for the Behavioral Sciences" (10th ed.), Field's "Discovering Statistics Using IBM SPSS Statistics" (5th ed., 2018), Wilcox's "Introduction to Robust Estimation and Hypothesis Testing" (4th ed., 2017), and Lakens's "Equivalence Tests: A Practical Primer for t-Tests, Correlations, and Meta-Analyses" (Social Psychological and Personality Science, 2017). For the recommendation to default to Welch's t-test, see Delacre, Lakens, and Leys (2017) in "Why Psychologists Should by Default Use Welch's t-test Instead of Student's t-test" (International Review of Social Psychology). For feature requests or support, contact the DataStatPro team.