One-Sample t-Test: Zero to Hero Tutorial
This comprehensive tutorial takes you from the foundational concepts of single-group inference all the way through advanced interpretation, reporting, assumption checking, and practical usage within the DataStatPro application. Whether you are encountering the one-sample t-test for the first time or deepening your understanding of comparing a sample to a known standard, this guide builds your knowledge systematically from the ground up.
Table of Contents
- Prerequisites and Background Concepts
- What is a One-Sample t-Test?
- The Mathematics Behind the One-Sample t-Test
- Assumptions of the One-Sample t-Test
- Variants of the One-Sample t-Test
- Using the One-Sample t-Test Calculator Component
- Step-by-Step Procedure
- Interpreting the Output
- Effect Sizes for the One-Sample t-Test
- Confidence Intervals
- Advanced Topics
- Worked Examples
- Common Mistakes and How to Avoid Them
- Troubleshooting
- Quick Reference Cheat Sheet
1. Prerequisites and Background Concepts
Before diving into the one-sample t-test, it is essential to be comfortable with the following foundational statistical concepts. Each is briefly reviewed below.
1.1 The Concept of Statistical Inference
Statistical inference is the process of drawing conclusions about a population from a sample. In the one-sample t-test context:
- Population parameter of interest: (the true population mean).
- Sample statistic: (the sample mean, our best estimate of ).
- Question asked: "Is it plausible that the true population mean equals some specific, theoretically or practically meaningful value ?"
1.2 The Null and Alternative Hypotheses
Every t-test operates within the hypothesis testing framework:
- (null hypothesis): There is no difference between the population mean and the hypothesised value. .
- (alternative hypothesis): The population mean differs from the hypothesised
value. The alternative can be:
- Two-tailed: (no directional prediction)
- Upper one-tailed: (predict the mean is higher)
- Lower one-tailed: (predict the mean is lower)
1.3 The Standard Error of the Mean
When we draw a sample of size from a population with standard deviation , the sample mean varies from sample to sample. The standard error of the mean (SEM) quantifies this variability:
In practice, is unknown and estimated by the sample standard deviation :
A larger produces a smaller SEM, meaning our estimate of becomes more precise as sample size increases. This is why large samples detect even trivially small deviations from .
1.4 The t-Distribution
When the population is unknown and estimated from the data, the test statistic does not follow the standard normal distribution — it follows the t-distribution with degrees of freedom (df).
The t-distribution is:
- Symmetric and bell-shaped, centred at zero.
- Heavier-tailed than the standard normal (more probability in the extremes).
- Parameterised by degrees of freedom .
- Approaches the standard normal as .
The heavier tails reflect additional uncertainty from estimating with — particularly important in small samples.
1.5 The p-Value and Significance Level
The p-value is the probability of obtaining a test statistic at least as extreme as the observed value, assuming is true. It answers: "How surprising is this sample result if the null hypothesis were correct?"
The significance level (conventionally ) is the threshold below which we consider the result sufficiently surprising to reject .
⚠️ A small p-value does NOT mean the null is false, the effect is large, or the finding is important. It only means the data are inconsistent with at level . Always accompany p-values with effect sizes and confidence intervals.
1.6 The Central Limit Theorem
The Central Limit Theorem (CLT) states that for sufficiently large , the sampling distribution of is approximately normal regardless of the shape of the population distribution:
This guarantees that the one-sample t-test is robust to non-normality for large samples (generally ). For smaller samples, the normality of the population itself is important.
1.7 Point Estimates and Interval Estimates
- A point estimate () is a single best guess for the population parameter.
- A confidence interval (CI) provides a range of plausible values for the parameter.
A 95% CI means: if we repeated the study many times, 95% of the resulting intervals would contain the true . CIs communicate both the location and precision of the estimate — always report them alongside the t-test result.
2. What is a One-Sample t-Test?
2.1 The Core Question
The one-sample t-test is a parametric inferential test that determines whether the mean of a single sample differs significantly from a known, hypothesised, or theoretically meaningful population value .
Unlike two-sample tests that compare two groups, the one-sample t-test compares one group to a fixed reference point. The reference point is not estimated from the data — it is specified in advance based on:
- A published population norm (e.g., IQ = 100).
- A theoretical prediction (e.g., reaction time = 250 ms).
- A clinical threshold (e.g., PHQ-9 score = 10 for moderate depression).
- A quality control standard (e.g., tablet weight = 500 mg).
- A chance level (e.g., proportion correct = 0.50 in a binary task).
2.2 The General Logic
The test measures how far the sample mean is from , standardised by the estimated standard error of the mean:
A large indicates that the sample mean is many standard error units away from — unlikely to occur by chance if is true.
2.3 When to Use the One-Sample t-Test
The one-sample t-test is appropriate when:
| Condition | Requirement |
|---|---|
| Research design | Single sample compared to a fixed standard |
| Outcome variable | Continuous (interval or ratio scale) |
| Distribution | Approximately normal (or via CLT) |
| Reference value | Known , specified before data collection |
| Observations | Independent of each other |
2.4 Real-World Applications
| Field | Research Question | |
|---|---|---|
| Clinical Psychology | Does a clinical sample's depression score differ from the population norm? | Published PHQ-9 norm |
| Cognitive Neuroscience | Is the reaction time of ADHD patients different from the normative 250 ms? | ms |
| Education | Does the class mean exam score differ from the national average of 70%? | |
| Quality Control | Does the mean tablet weight differ from the target of 500 mg? | mg |
| Sport Science | Does the team's mean VO max differ from elite athlete norms? | Published norm |
| Nutrition | Does average daily caloric intake differ from the recommended 2000 kcal? | kcal |
| Finance | Does the mean return of a fund differ from the benchmark return of 8%? | |
| Public Health | Is mean blood pressure in a community sample different from the clinical threshold? | mmHg |
2.5 Distinguishing from Related Tests
| Situation | Correct Test |
|---|---|
| One sample vs. known value | One-sample t-test |
| Two independent groups | Independent samples t-test |
| Two related measurements | Paired samples t-test |
| One sample, non-normal, small | Wilcoxon signed-rank test (one-sample) |
| Proportion vs. known value | One-proportion z-test |
| Variance vs. known value | Chi-squared test for variance |
3. The Mathematics Behind the One-Sample t-Test
3.1 The t-Statistic
Given a sample of observations drawn from a population with unknown mean and unknown standard deviation , the one-sample t-statistic is:
Where:
Under , the statistic follows a t-distribution with degrees of freedom.
3.2 Degrees of Freedom
The degrees of freedom represent the number of independent pieces of information available to estimate the standard deviation. We lose 1 degree of freedom because computing requires first estimating from the same data.
Smaller → heavier tails → more conservative critical values → harder to achieve significance. This appropriately penalises small samples for the additional uncertainty in estimating .
3.3 Computing the p-Value
The p-value is computed from the cumulative distribution function (CDF) of the t-distribution:
Two-tailed (default):
Upper one-tailed ():
Lower one-tailed ():
3.4 Critical Values
The decision rule compares the observed to the critical value :
(two-tailed)
Reject if .
Common critical values (, two-tailed ):
| 5 | 4 | 2.776 |
| 10 | 9 | 2.262 |
| 15 | 14 | 2.145 |
| 20 | 19 | 2.093 |
| 30 | 29 | 2.045 |
| 50 | 49 | 2.010 |
| 100 | 99 | 1.984 |
| 1.960 |
3.5 The 95% Confidence Interval for
The CI for the population mean is:
This interval is dual to the hypothesis test: is rejected at level if and only if falls outside the CI.
3.6 Cohen's — Effect Size for the One-Sample t-Test
The standardised effect size expresses how many standard deviation units the sample mean departs from :
This is directly analogous to a z-score, but standardised by the sample SD rather than the population SD.
Note that and are related:
This relationship reveals that is a joint function of effect size () and sample size (). A small can produce a large with a large enough .
3.7 Hedges' — Bias-Corrected Effect Size
Cohen's is slightly positively biased (overestimates the true effect) in small samples. Hedges' applies a correction factor :
More precisely:
The correction is negligible for but can be substantial () for . Hedges' is the preferred effect size for small samples and meta-analysis.
3.8 Exact Confidence Interval for via the Non-Central t-Distribution
Under , the t-statistic follows a non-central t-distribution with non-centrality parameter:
The exact 95% CI for inverts this relationship numerically: find and such that:
Then:
An approximate 95% CI (adequate for ):
3.9 Statistical Power
Power is the probability of correctly rejecting when a true effect of size exists:
Required sample size for desired power at two-sided :
For and power : , so .
Required for common effect sizes:
| Cohen's | Power = 0.80 | Power = 0.90 | Power = 0.95 |
|---|---|---|---|
| 0.20 (small) | 198 | 264 | 327 |
| 0.50 (medium) | 33 | 44 | 54 |
| 0.80 (large) | 14 | 18 | 22 |
| 1.00 | 9 | 12 | 15 |
| 1.50 | 5 | 7 | 9 |
4. Assumptions of the One-Sample t-Test
4.1 Normality of the Data (or Sampling Distribution)
The one-sample t-test assumes that either:
- The population from which the sample is drawn is normally distributed, OR
- The sample size is sufficiently large () for the CLT to ensure approximately normal sampling distribution of .
How to check:
| Method | Details |
|---|---|
| Shapiro-Wilk test | Most powerful normality test for . : data are normal. → no evidence of non-normality |
| Kolmogorov-Smirnov | For , use Lilliefors correction |
| Q-Q plot | Plot sample quantiles vs. theoretical normal quantiles; points should fall on the diagonal |
| Histogram + density | Should be approximately bell-shaped |
| Skewness | $ |
| Kurtosis | $ |
Robustness: The t-test is remarkably robust to mild-to-moderate non-normality, especially for . Symmetric non-normal distributions cause few problems even for small . Severe skewness with small is the primary concern.
When violated: Use the Wilcoxon signed-rank test (one-sample version) as the non-parametric alternative. Consider log or square-root transformation for right-skewed data.
4.2 Independence of Observations
Each observation must be independent — the value of one participant's score must not influence another's. This is a design assumption, not testable statistically.
Common violations:
- Scores from participants who discussed the task with each other.
- Multiple measurements from the same participant treated as independent.
- Cluster sampling without accounting for cluster structure.
When violated: Use mixed models or multilevel approaches that explicitly model the dependency structure.
4.3 Interval or Ratio Scale of Measurement
The dependent variable must be measured on at least an interval scale — that is, the numerical differences between values must be meaningful and equal.
When violated: If the data are ordinal (ranks, Likert items treated as ordinal), use the Wilcoxon signed-rank test. Continuous but severely non-normal data may also warrant a non-parametric approach.
4.4 The Reference Value Must Be Pre-Specified
The hypothesised value must be specified before examining the data. Choosing based on the observed (e.g., setting from a pilot study and then testing the same data) is circular and invalidates the test.
4.5 No Extreme Outliers
Extreme outliers distort both the mean () and the standard deviation (), potentially inflating or deflating the t-statistic.
How to check:
- Boxplots: values beyond from the quartiles are extreme.
- Standardised scores: are statistical outliers at .
- Grubbs' test for a single outlier in normally distributed data.
When outliers present: Investigate the cause (data entry error? valid extreme value?). Report analyses with and without the outlier. Consider the trimmed mean t-test or Wilcoxon signed-rank as robust alternatives.
4.6 Assumption Summary
| Assumption | How to Check | Remedy if Violated |
|---|---|---|
| Normality | Shapiro-Wilk, Q-Q plot | Wilcoxon signed-rank; transform data |
| Independence | Design review | Mixed models |
| Interval scale | Measurement theory | Wilcoxon signed-rank |
| Pre-specified | Research protocol | Re-specify with new data |
| No severe outliers | Boxplots, -scores | Investigate; trimmed mean t-test |
5. Variants of the One-Sample t-Test
5.1 Standard One-Sample t-Test
The classic form described throughout this tutorial: compare to a fixed value assuming approximately normal data.
5.2 One-Sample Wilcoxon Signed-Rank Test
The non-parametric alternative when normality cannot be assumed. Tests whether the population median (not mean) equals . Procedure:
- Compute for each observation.
- Remove ; let = number of non-zero differences.
- Rank from 1 to .
- Compute and .
- Test statistic: (or use with the -approximation).
Effect size:
5.3 z-Test for the Mean (Known )
When the population standard deviation is known (rare in practice), use the one-sample z-test instead:
This follows the standard normal distribution exactly, without the need for the t-distribution. This situation arises in standardised testing (where is known from large normative samples) or simulation studies.
5.4 Equivalence Testing (One-Sample TOST)
Rather than testing , equivalence testing asks whether is close enough to to be considered practically equivalent. The Two One-Sided Tests (TOST) procedure:
Specify equivalence bounds (e.g., units).
Test both:
- (lower bound)
- (upper bound)
Equivalence is concluded when both null hypotheses are rejected — equivalently, when the 90% CI for falls entirely within .
5.5 Bayesian One-Sample t-Test
The Bayesian one-sample t-test computes a Bayes Factor quantifying evidence for (effect exists) vs. (no effect):
Under the Rouder et al. (2009) default prior ():
can be computed from and numerically. indicates moderate evidence for ; indicates moderate evidence for .
5.6 Trimmed Mean t-Test (Robust Variant)
When outliers are present, the trimmed mean t-test uses the -trimmed mean (removing the top and bottom proportion of observations) rather than the arithmetic mean. With 20% trimming:
The test statistic uses the Winsorised standard deviation. This is substantially more robust to outliers and heavy tails while retaining reasonable power.
6. Using the One-Sample t-Test Calculator Component
The One-Sample t-Test Calculator in DataStatPro provides a comprehensive tool for running, diagnosing, and reporting one-sample tests.
Step-by-Step Guide
Step 1 — Select the Test
Navigate to Statistical Tests → t-Tests → One-Sample t-Test.
Step 2 — Input Method
Choose how to provide data:
- Raw data: Paste or upload a column of values. DataStatPro computes all summary statistics and runs assumption checks automatically.
- Summary statistics: Enter , , and directly.
- t-statistic + df: Enter a published and to compute p-values and effect sizes from a reported result.
Step 3 — Specify the Hypothesised Value
Enter — the value you are testing against. Default is . Common values:
- for IQ or standardised test score comparisons.
- for testing whether a mean change differs from zero.
- for testing whether a proportion differs from chance.
Step 4 — Select the Alternative Hypothesis
- Two-tailed (default): .
- Upper one-tailed: .
- Lower one-tailed: .
⚠️ One-tailed tests require a strong, pre-registered directional prediction. Selecting one-tailed post-hoc to achieve significance is p-hacking.
Step 5 — Set Significance Level and Confidence Level
Default: , 95% CI. DataStatPro also displays results for and simultaneously.
Step 6 — Select Display Options
- ✅ t-statistic, df, p-value (exact), and decision.
- ✅ Sample mean, SD, SEM, and 95% CI for .
- ✅ Cohen's and Hedges' with exact 95% CI.
- ✅ Common Language Effect Size.
- ✅ Assumption check results (Shapiro-Wilk, outlier detection).
- ✅ Sampling distribution diagram showing the observed and critical region.
- ✅ Effect size diagram (distribution of sample vs. ).
- ✅ Power analysis: current power and required for 80%, 90%, 95% power.
- ✅ Equivalence test (TOST) panel.
- ✅ Bayesian t-test (Bayes Factor ).
- ✅ APA 7th edition results paragraph (auto-generated).
Step 7 — Run the Analysis
Click "Run One-Sample t-Test". DataStatPro will:
- Compute , , , , , and exact p-value.
- Construct the 95% CI for .
- Compute Cohen's and Hedges' with exact CIs.
- Run Shapiro-Wilk normality test and flag outliers.
- Generate all selected visualisations.
- Output an APA-compliant results paragraph.
7. Step-by-Step Procedure
7.1 Full Manual Procedure
Step 1 — State the Hypotheses
(two-tailed)
Specify based on theory, norms, or a substantive threshold.
Step 2 — Check Assumptions
- Verify interval/ratio scale.
- Run Shapiro-Wilk test or inspect Q-Q plot.
- Identify and investigate outliers with boxplots and -scores.
- Confirm independence of observations by reviewing study design.
Step 3 — Compute Summary Statistics
Step 4 — Compute the t-Statistic
Step 5 — Determine Degrees of Freedom
Step 6 — Compute the p-Value
(two-tailed)
Compare to . Reject if .
Step 7 — Construct the 95% CI for
Find (e.g., for 95% CI):
Step 8 — Compute Effect Size
Hedges' :
Step 9 — Compute Approximate 95% CI for
Step 10 — Interpret and Report
Use the APA reporting template in Section 15. Always report , , , , , 95% CI for , Cohen's or Hedges' , and 95% CI for the effect size.
8. Interpreting the Output
8.1 The t-Statistic
| Relative to | Interpretation |
|---|---|
| Fail to reject ; result not significant at | |
| Reject ; result significant at | |
| Large with large | Can be significant even for tiny |
| Small with small | May be non-significant even for large (low power) |
8.2 The p-Value
| p-Value | Conventional Interpretation |
|---|---|
| No evidence against | |
| Marginal evidence (trend) | |
| Significant at | |
| Significant at | |
| Significant at |
⚠️ These thresholds are arbitrary conventions, not natural boundaries. A result with is not meaningfully more "significant" than . Focus on effect sizes and CIs, not arbitrary thresholds.
8.3 The 95% Confidence Interval
| CI Outcome | Interpretation |
|---|---|
| CI excludes | Reject ; plausibly differs from |
| CI includes | Fail to reject |
| Narrow CI | Precise estimate of ; adequate sample size |
| Wide CI | Imprecise estimate; consider increasing |
8.4 Cohen's — Magnitude Interpretation
Cohen's (1988) benchmarks:
| Verbal Label | Overlap Between Distributions | |
|---|---|---|
| No effect | ||
| Small | ||
| Medium | ||
| Large | ||
| Very large | ||
| Huge |
Sawilowsky's (2009) extended benchmarks:
| Label | |
|---|---|
| Tiny | |
| Very small | |
| Small | |
| Medium | |
| Large | |
| Very large | |
| Huge |
⚠️ Cohen himself warned against mechanical application of these benchmarks. They were "offered as conventions of last resort." Always contextualise effect sizes within your specific research domain and compare to typical effect sizes in your field.
9. Effect Sizes for the One-Sample t-Test
9.1 Cohen's (One-Sample)
Interpretation: The sample mean is standard deviations above (positive) or below (negative) the hypothesised value .
9.2 Hedges'
Preferred over for small samples () and meta-analysis.
9.3 Point-Biserial Correlation
Equivalent to the correlation between a binary "is vs. norm" variable and the continuous outcome. Ranges from 0 to 1; no directionality.
Convert to : (assuming equal split)
9.4 Common Language Effect Size (CL)
The probability that a randomly selected individual from the population has a score above :
for large
Where is the standard normal CDF.
→ no effect; → ; → .
9.5 Effect Size Summary Table
| Effect Size | Formula | Range | Interpretation |
|---|---|---|---|
| Cohen's | SD units above/below | ||
| Hedges' | Bias-corrected; preferred for small | ||
| Correlation-like; no direction | |||
| CL | (approx) | Prob. of exceeding |
10. Confidence Intervals
10.1 CI for the Population Mean
This directly addresses the primary research question by providing a range of plausible values for the true population mean.
10.2 CI Width as a Function of Sample Size
For :
| Approx CI Width | Interpretation | |
|---|---|---|
| 5 | Very imprecise | |
| 10 | Imprecise | |
| 20 | Moderate | |
| 50 | Good | |
| 100 | High precision | |
| 200 | Very high precision |
10.3 CI for Cohen's
Approximate (adequate for ):
Exact: Uses the non-central t-distribution (computed automatically by DataStatPro).
10.4 Relationship Between CI and Hypothesis Test
The CI and two-tailed hypothesis test are algebraically equivalent:
- is rejected at lies outside the CI.
- The CI provides more information: it shows not just whether is excluded but also the entire range of plausible values given the data.
11. Advanced Topics
11.1 Multiple One-Sample Tests on the Same Dataset
When several one-sample t-tests are conducted on data from the same participants (e.g., testing each of 10 subscales against their respective norms), the familywise error rate inflates:
For tests: .
Correction strategies:
- Bonferroni: Adjust . Simple but conservative.
- Holm-Bonferroni: Sequential Bonferroni — less conservative than Bonferroni.
- Benjamini-Hochberg: Controls the False Discovery Rate (FDR) — appropriate for exploratory research.
Report all tests with both original and adjusted p-values.
11.2 Sensitivity Analysis: Minimum Detectable Effect
Given a fixed sample size , the minimum detectable effect (MDE) at power 80% and :
| (80% power) | |
|---|---|
| 10 | 0.885 |
| 20 | 0.626 |
| 30 | 0.511 |
| 50 | 0.396 |
| 100 | 0.280 |
| 200 | 0.198 |
If exceeds the smallest effect of practical interest, the study is adequately powered. If not, acknowledge that the study may miss practically important effects.
11.3 Comparing the One-Sample t-Test to the Paired t-Test
The paired t-test (Section on Paired t-Test) is mathematically equivalent to a one- sample t-test applied to the difference scores , testing . Understanding this equivalence clarifies when each is appropriate:
- One-sample t-test: Compare a sample mean to an externally specified value .
- Paired t-test: Compare a sample mean of differences to zero (or a specified value).
11.4 Bayesian One-Sample t-Test
The Bayes Factor from the Rouder et al. (2009) default prior ():
This integral has no closed form but is computed numerically by DataStatPro.
Interpreting :
| Evidence for over | |
|---|---|
| Extreme | |
| Very strong | |
| Strong | |
| Moderate | |
| Anecdotal | |
| No evidence | |
| Evidence for |
Key advantage: provides positive evidence for the null hypothesis — something p-values cannot do.
11.5 TOST Equivalence Testing for the One-Sample t-Test
To establish that is practically equivalent to (e.g., that a new scale yields scores equivalent to an established norm):
- Specify (the equivalence margin — the maximum acceptable deviation from ).
- Test with upper one-tailed t-test.
- Test with lower one-tailed t-test.
- Equivalence concluded if both tests are significant (i.e., 90% CI for falls within ).
12. Worked Examples
Example 1: IQ in a Clinical Sample
A neuropsychologist measures IQ scores () in adults diagnosed with early-stage Alzheimer's disease. The population mean for neurotypical adults is .
Data summary: , ,
Step 1 — Hypotheses: vs.
Step 2 — Standard error:
Step 3 — t-statistic:
Step 4 — df and p-value:
;
Step 5 — 95% CI for :
Step 6 — Cohen's :
;
Hedges' :
95% CI for (approximate):
95% CI:
Summary:
| Statistic | Value | Interpretation |
|---|---|---|
| (two-tailed) | Highly significant | |
| points below norm | ||
| 95% CI for | Excludes 100 | |
| Cohen's | Large effect | |
| 95% CI for | Entirely below zero | |
| Hedges' | Large (bias-corrected) |
APA write-up: "A one-sample t-test revealed that the Alzheimer's group (, , ) had significantly lower IQ scores than the normative mean of 100, , , [95% CI: , ]. The 95% CI for the mean, , excluded the normative value. This represents a large deviation from the normative standard."
Example 2: Quality Control — Tablet Weight
A pharmaceutical quality control analyst measures the weight (mg) of tablets from a production batch. The target weight is mg.
Data summary: , mg, mg
Step 1 — Hypotheses: vs.
Step 2 — SE and t:
Step 3 — df and p-value:
;
Step 4 — 95% CI:
Step 5 — Effect size:
(small-medium effect)
Step 6 — Equivalence test:
Regulatory limit: mg acceptable. Test whether .
90% CI for :
Since : the batch is within equivalence bounds even though it differs significantly from exactly 500 mg.
Interpretation: The batch mean is statistically significantly below 500 mg, but the deviation is within the acceptable regulatory range — the batch meets quality standards. This illustrates how statistical significance (the mean is not exactly 500 mg) and practical significance (the deviation is within tolerance) can diverge.
APA write-up: "A one-sample t-test revealed that the mean tablet weight ( mg, mg, ) was significantly lower than the target of 500 mg, , , [95% CI: , ]. However, an equivalence test (TOST, mg) demonstrated that the batch mean was within the acceptable regulatory range [90% CI: 495.5, 499.1] [495, 505], indicating the batch meets quality specifications despite the statistically significant deviation from the nominal target."
Example 3: Exam Scores vs. National Average
A teacher believes their class performs above the national average of . They measure students.
Data summary: , ,
Directional hypothesis: vs. (one-tailed, pre-registered)
t-statistic:
p-value (upper one-tailed):
95% CI for (two-tailed, for reference):
Cohen's : (small-medium)
APA write-up: "A one-tailed one-sample t-test (pre-registered directional hypothesis) indicated that the class mean exam score (, , ) was significantly above the national average of 68%, , , [95% CI: , ]. The class outperformed the national average by a small-to-medium margin."
13. Common Mistakes and How to Avoid Them
Mistake 1: Choosing Based on the Sample Mean
Problem: Examining the data, seeing , and then testing because it seems close. Selecting based on the observed data is circular and inflates Type I error.
Solution: Always specify before data collection, based on theory, published norms, or a substantive threshold. Pre-register the hypothesis if possible.
Mistake 2: Interpreting a Non-Significant Result as "No Difference from "
Problem: Concluding that means . A non-significant result means insufficient evidence against , not evidence that is true. With , almost no test will be significant, regardless of the true effect.
Solution: Report the 95% CI for alongside the p-value. A wide CI spanning across reflects uncertainty, not zero effect. Use equivalence testing (TOST) to positively establish equivalence.
Mistake 3: Using a One-Tailed Test Post-Hoc
Problem: Running a two-tailed test, getting , and switching to a one-tailed test to achieve . This doubles the effective Type I error rate.
Solution: Directional hypotheses must be pre-registered. If the research question genuinely allows only one direction (and this was decided before data collection), the one-tailed test is appropriate. Otherwise, use the two-tailed test.
Mistake 4: Reporting Only the p-Value Without Effect Size
Problem: ", " tells the reader nothing about how large the deviation from is in meaningful units. A study with could produce this result for — a trivially small effect.
Solution: Always report Cohen's (or Hedges' ) with its 95% CI and interpret the magnitude relative to the research context.
Mistake 5: Using the One-Sample t-Test for Paired Data
Problem: Computing the mean of pre-scores and the mean of post-scores and testing each separately against . This ignores the within-person correlation and misses the point — the interest is in the change, not the absolute level.
Solution: For pre-post designs, use the paired t-test with difference scores , testing (which is itself a one-sample t-test on the differences).
Mistake 6: Ignoring Outliers in Small Samples
Problem: With , a single outlier can shift substantially and either inflate or deflate the t-statistic dramatically.
Solution: Always inspect data with boxplots and -scores before running the test. Report the analysis with and without outliers, and consider the Wilcoxon signed-rank test or trimmed mean t-test as robust alternatives.
Mistake 7: Not Reporting the 95% CI for
Problem: The p-value and alone give an incomplete picture. The CI for directly answers the question: "What values of the true population mean are plausible given these data?" Without it, readers cannot assess the precision of the estimate.
Solution: Always report the 95% CI for alongside the hypothesis test results.
14. Troubleshooting
| Likely Cause | Solution | |
|---|---|---|
| -statistic is extremely large ($ | t | > 10$) |
| exactly | exactly, or floating point issue | Check data; add more decimal places |
| Shapiro-Wilk with large | Small, inconsequential non-normality detected (test has high power for large ) | Inspect Q-Q plot; with , minor non-normality rarely affects t-test validity |
| CI for is very wide | Small | Report the wide CI — it conveys genuine uncertainty; conduct power analysis for future study |
| Cohen's is large but | Small (low power) | Study is underpowered; may reflect a real but undetected effect |
| from differs from | Rounding in intermediate steps | Compute directly from summary statistics for accuracy |
| One-tailed is larger than two-tailed | Effect is in the wrong direction | Report the result as going against the directional hypothesis; consider reporting two-tailed results |
| Negative when a positive effect is expected | Check direction of vs. | Verify formula; report sign with direction (e.g., sample mean is below/above ) |
| Equivalence test fails despite small | Equivalence bounds are too narrow for the sample size | Increase or use wider, substantively justified equivalence bounds |
| Hedges' and Cohen's differ substantially | Very small () — bias correction is large | Report (preferred); note that estimates are unstable for very small samples |
15. Quick Reference Cheat Sheet
Core Equations
| Formula | Description |
|---|---|
| One-sample t-statistic | |
| Degrees of freedom | |
| Two-tailed p-value | |
| 95% CI for | |
| Cohen's | |
| Cohen's from -statistic | |
| Hedges' (bias-corrected) | |
| Point-biserial from | |
| SE of Cohen's | |
| Required for 80% power, |
Decision Guide
| Condition | Recommended Test |
|---|---|
| Normal data, known | One-sample t-test |
| Non-normal data or ordinal | Wilcoxon signed-rank (one-sample) |
| Testing equivalence to | TOST equivalence test |
| Known population | One-sample z-test |
| Quantifying evidence for | Bayesian t-test (Bayes Factor) |
Cohen's Benchmarks
| Label | |
|---|---|
| Small | |
| Medium | |
| Large |
Required Sample Size
| Power = 0.80 | Power = 0.90 | |
|---|---|---|
| 0.20 | 198 | 264 |
| 0.50 | 33 | 44 |
| 0.80 | 14 | 18 |
| 1.00 | 9 | 12 |
Assumes , two-tailed.
APA 7th Edition Reporting Templates
Standard: "A one-sample t-test indicated that [sample description] ( [value], [value], [value]) [significantly / did not significantly] differ from [the reference value / normative mean of ], [value], [value], [value] [95% CI: LB, UB]. The 95% CI for the mean, , [excluded / included] [value]."
With Hedges' : "... [value] [95% CI: LB, UB] (bias-corrected for small sample)."
With equivalence test: "A TOST equivalence test with bounds demonstrated [equivalence / non-equivalence] to [reference value], 90% CI for : ."
With Bayesian t-test: "The Bayesian t-test yielded [value], indicating [moderate / strong / extreme] evidence for [the alternative / the null] hypothesis."
Reporting Checklist
| Item | Required |
|---|---|
| t-statistic with sign | ✅ Always |
| Degrees of freedom | ✅ Always |
| Exact p-value | ✅ Always |
| Sample mean, SD, and | ✅ Always |
| 95% CI for | ✅ Always |
| Cohen's or Hedges' | ✅ Always |
| 95% CI for effect size | ✅ Always |
| Hypothesised value (stated explicitly) | ✅ Always |
| Alternative hypothesis direction | ✅ Always |
| Normality check result | ✅ When |
| Outlier check result | ✅ When |
| Hedges' instead of | ✅ When |
| TOST equivalence test | ✅ When claiming null result |
| Bayes Factor | Recommended for null results |
| Power analysis | ✅ For underpowered or null results |
This tutorial provides a comprehensive foundation for understanding, conducting, and reporting one-sample t-tests within the DataStatPro application. For further reading, consult Gravetter & Wallnau's "Statistics for the Behavioral Sciences" (10th ed.), Cohen's "Statistical Power Analysis for the Behavioral Sciences" (2nd ed., 1988), Lakens's "Equivalence Tests: A Practical Primer" (Social Psychological and Personality Science, 2017), and Rouder et al.'s "Bayesian t-Tests" (Psychonomic Bulletin & Review, 2009). For feature requests or support, contact the DataStatPro team.