Sample Size and Power Analysis: Zero to Hero Tutorial
This comprehensive tutorial takes you from the foundational concepts of statistical power and sample size determination all the way through advanced interpretation, reporting, assumption checking, and practical usage within the DataStatPro application. Whether you are planning a new study for the first time or deepening your understanding of how to design adequately powered research, this guide builds your knowledge systematically from the ground up.
Table of Contents
- Prerequisites and Background Concepts
- What are Sample Size and Power Analysis?
- The Mathematics Behind Power Analysis
- Considerations and Planning Checklist
- Power Analysis for Common Statistical Tests
- Using the Sample Size and Power Analysis Calculator Component
- Step-by-Step Procedure
- Interpreting the Output
- Visualising Power and Sample Size
- Sensitivity Analysis and Robustness Checks
- Advanced Topics
- Worked Examples
- Common Mistakes and How to Avoid Them
- Troubleshooting
- Quick Reference Cheat Sheet
1. Prerequisites and Background Concepts
Before diving into sample size and power analysis, it is essential to be comfortable with the following foundational statistical concepts. Each is briefly reviewed below.
1.1 Hypothesis Testing Framework
All power analyses are grounded in the hypothesis testing framework. A statistical test evaluates the evidence in a dataset against a null hypothesis:
- Null hypothesis (): The default position — typically, that no effect exists, no difference is present, or variables are unassociated.
- Alternative hypothesis (): The research hypothesis — that an effect exists, a difference is present, or variables are associated.
The test produces a test statistic (e.g., , , , ) and a corresponding p-value. If , we reject in favour of .
1.2 The Four Outcomes of a Hypothesis Test
Every hypothesis test results in one of four possible outcomes, two of which are correct decisions and two of which are errors:
| is TRUE | is FALSE | |
|---|---|---|
| Fail to reject | ✅ Correct decision (True negative) | ❌ Type II error (False negative) |
| Reject | ❌ Type I error (False positive) | ✅ Correct decision (True positive) |
The probabilities associated with each outcome:
| Outcome | Symbol | Definition |
|---|---|---|
| Type I error rate (false positive rate) | ||
| Type II error rate (false negative rate) | ||
| Significance level | Controlled by the researcher; conventionally | |
| Statistical power | ||
| Specificity |
1.3 The Significance Level ()
The significance level is the maximum acceptable probability of a Type I error — that is, the probability of declaring a significant result when the null hypothesis is actually true. The researcher chooses before collecting data.
Conventional values:
| Context | |
|---|---|
| Standard in most social, behavioural, and health sciences | |
| More stringent; clinical trials, policy-relevant decisions | |
| Very stringent; genomics, physics, large-scale testing | |
| Sometimes used in exploratory research or small pilot studies |
1.4 The p-Value
The p-value is the probability of observing a test statistic as extreme as or more extreme than the one obtained, assuming is true:
A small p-value (below ) means the observed result is unlikely under and constitutes evidence against . Crucially, the p-value does not tell you the probability that is true, nor does it measure the size or practical importance of an effect.
1.5 Effect Size
An effect size is a standardised, scale-free measure of the magnitude of a phenomenon. It is the single most important input to any power analysis. Common effect size measures by test type:
| Test | Effect Size Measure | Symbol | Range |
|---|---|---|---|
| t-test (two groups) | Cohen's | ||
| ANOVA (multiple groups) | Cohen's | ||
| Correlation | Pearson correlation | ||
| Chi-square test | Cramér's (or ) | , | |
| Regression (multiple) | Cohen's | ||
| Proportion test | Cohen's (arcsine difference) | ||
| Repeated measures | Cohen's or | — | — |
1.6 Cohen's Conventions for Effect Size
Jacob Cohen (1988) proposed widely used benchmarks for effect size magnitudes across common statistical tests. These are conventions of last resort — domain knowledge always supersedes them:
| Test | Small | Medium | Large |
|---|---|---|---|
| t-test () | 0.20 | 0.50 | 0.80 |
| ANOVA () | 0.10 | 0.25 | 0.40 |
| Correlation () | 0.10 | 0.30 | 0.50 |
| Chi-square () | 0.10 | 0.30 | 0.50 |
| Regression () | 0.02 | 0.15 | 0.35 |
| Proportion test () | 0.20 | 0.50 | 0.80 |
1.7 The Normal and Non-Central Distributions
Power analysis relies on understanding how test statistics are distributed under two scenarios:
- Under : The test statistic follows a central distribution (e.g., standard normal , central t, central , central F).
- Under : The test statistic follows a non-central distribution — the same family but shifted by a non-centrality parameter , which quantifies the true size of the effect and the sample size:
Power is the probability that a non-centrally distributed test statistic exceeds the critical value derived from the central distribution.
1.8 Directionality: One-Tailed vs. Two-Tailed Tests
The directionality of a test affects the critical region and therefore the power:
- Two-tailed test: The critical region is split across both tails of the distribution. Rejects for both very large and very small test statistics. Critical value: (e.g., for ).
- One-tailed test: The critical region is entirely in one tail. More powerful for detecting effects in the predicted direction but cannot detect effects in the opposite direction. Critical value: (e.g., for ).
For a given effect size and sample size, a one-tailed test has greater power than a two-tailed test. However, one-tailed tests require a strong directional a priori justification and are vulnerable to criticism if the actual effect is in the opposite direction.
⚠️ Most journals and reporting guidelines recommend two-tailed tests unless there is a strong, pre-registered, directional theoretical justification. DataStatPro defaults to two-tailed tests for all procedures.
2. What are Sample Size and Power Analysis?
2.1 The Core Questions
Power analysis and sample size determination address four deeply interconnected questions in research design:
- Power analysis (post-hoc): Given my sample size, effect size, and , what was the probability of detecting a true effect?
- Sample size calculation (a priori): Given a desired power, effect size, and , how many participants do I need?
- Minimum detectable effect: Given my sample size and , what is the smallest effect I have adequate power to detect?
- Sensitivity analysis: Given my sample size and , how does power vary across a range of plausible effect sizes?
These four questions form the four modes of power analysis. A priori sample size calculation — performed before data collection — is the most important and is the focus of most of this tutorial.
2.2 Why Power Analysis Matters
| Consequence of Ignoring Power | Effect |
|---|---|
| Underpowered study | High probability of Type II error; genuine effects missed; wasted resources |
| Overpowered study | Resources wasted; trivially small, practically meaningless effects declared significant |
| Post-hoc power gaming | Misleading; observed power with observed effect size is 50% when |
| Non-replicable findings | Underpowered studies produce inflated effect size estimates (the "Winner's Curse") |
| Ethical implications | Exposing participants to risk or burden without adequate chance of meaningful results |
| Grant and ethics requirements | Most funding bodies and ethics committees require a priori power justification |
2.3 The Four Elements of Power Analysis
Every power analysis involves exactly four quantities, any one of which can be computed from the other three:
Power (1 - β) ←─────────────────────────────┐
│ │
Sample size (N) ─────────────────────► │
↕ │
Effect size (ES) ───────────────────► │
│ │
Significance level (α) ─────────────► │
└───────────────────────────────────┘
Specify any three → solve for the fourth.
2.4 Four Modes of Power Analysis
| Mode | Fixed Inputs | Solved Output | When Used |
|---|---|---|---|
| A priori | , , | Before data collection (study planning) | |
| Post-hoc | , , | After data collection (result interpretation) | |
| Criterion | , , | Rare; sometimes used in quality control | |
| Sensitivity | , , | Before or after collection; what can I detect? |
⚠️ Post-hoc power analysis computed using the observed effect size is widely regarded as uninformative and should be avoided. When , observed power will typically be low — but this is a mathematical consequence of the non-significant result, not an independent finding. Instead of post-hoc power, report the 95% confidence interval for the effect size and a sensitivity power analysis.
2.5 Desired Power: Choosing
The conventional target for statistical power is 0.80 (80%), implying a 20% Type II error rate. Higher power targets are increasingly recommended:
| Power () | Context | |
|---|---|---|
| Minimum conventional standard (Cohen, 1988) | ||
| Recommended for clinical trials; replication studies | ||
| High-stakes research; confirmatory studies | ||
| Safety-critical or regulatory contexts |
The ratio of Type I to Type II error rates is also informative:
- At and power : ratio (Type II errors are 4× more likely than Type I errors).
- At and power : ratio (equal error rates).
2.6 Real-World Applications
| Field | Context | Typical Power Target |
|---|---|---|
| Clinical Trials | RCT comparing drug vs. placebo | – |
| Psychology | Between-subjects experiment | – |
| Education Research | Intervention effectiveness study | |
| Epidemiology | Case-control study; cohort study | – |
| Genomics / GWAS | Association study () | |
| Marketing Research | A/B test for conversion rate | – |
| Quality Control | Detecting process shift | – |
| Pilot Studies | Feasibility; parameter estimation | – |
3. The Mathematics Behind Power Analysis
3.1 The General Power Framework
For any test with test statistic and critical value :
Under , follows a non-central distribution with non-centrality parameter . Power is the probability that this non-centrally distributed test statistic exceeds the critical value determined under .
The critical value is determined by and the test type:
- Two-tailed: (e.g., for )
- One-tailed: (e.g., for )
3.2 Power for the One-Sample z-Test
The simplest case: testing whether a population mean equals a known value , with known population SD and sample size .
Non-centrality parameter:
Where is Cohen's for the one-sample case.
Power (two-tailed):
For practical purposes (when is not near zero):
Where is the standard normal CDF.
Sample size formula (solving for ):
Where is the critical value for the desired power (e.g., for ; for ; for ).
3.3 Power for the Two-Sample Independent t-Test
Testing whether two population means differ (), with equal group sizes and pooled SD .
Cohen's (standardised mean difference):
Non-centrality parameter:
Sample size per group (approximate):
Unequal group sizes: Let (allocation ratio ). The total sample size is minimised when (equal groups). For unequal allocation:
⚠️ Unequal group sizes are less efficient than equal groups for a given total . Unless there is a compelling reason (e.g., one group is more expensive to recruit, or a 2:1 allocation is ethically required), equal groups maximise power per participant.
3.4 Power for the Paired t-Test
Testing whether the mean of paired differences equals zero.
Cohen's (based on the SD of differences):
Where and is the correlation between paired measurements.
The relationship to Cohen's for independent groups:
This shows that paired designs are more powerful when — the higher the correlation between paired measurements, the greater the efficiency gain over an independent design.
Sample size (number of pairs):
3.5 Power for One-Way ANOVA
Testing whether group means are equal (: at least two means differ).
Cohen's (standardised SD of group means):
Where is the SD of the group means and is the common within-group SD.
Relationship to (eta-squared, the proportion of variance explained):
Non-centrality parameter:
Under , the F-statistic follows a non-central F distribution with numerator df , denominator df , and non-centrality parameter .
Sample size per group (equal groups):
Where is the non-centrality parameter needed to achieve the desired power at the specified , , and (solved iteratively as depends on ).
3.6 Power for Pearson Correlation
Testing whether the population correlation ().
Effect size: The population correlation coefficient itself.
Fisher's z-transformation: To stabilise the variance:
Non-centrality parameter:
Where .
Sample size:
3.7 Power for the Chi-Square Test of Association
Testing whether two categorical variables are independent. See also the chi-square tutorial for additional detail.
Effect size — Cohen's :
For tables, (phi coefficient); for larger tables, is related to Cramér's :
Non-centrality parameter:
Under , follows a non-central chi-square distribution with and non-centrality parameter .
Sample size:
3.8 Power for Proportion Tests
3.8.1 One-Sample Proportion Test
Testing vs. .
Cohen's (arcsine difference):
Sample size:
3.8.2 Two-Sample Proportion Test
Testing vs. .
3.9 Power for Multiple Regression
Testing whether a set of predictors explains a meaningful proportion of variance in an outcome, or whether a specific predictor contributes above and beyond others.
Cohen's (for the overall model or incremental ):
For testing an increment in when adding new predictors:
Non-centrality parameter:
Under , the F-statistic for testing predictors follows a non-central F distribution with and (where is the total number of predictors in the full model).
Sample size:
3.10 Summary of Key Sample Size Formulae
| Test | Effect Size | Sample Size Formula |
|---|---|---|
| One-sample z/t | ||
| Two-sample t (equal groups) | ||
| Paired t | ||
| Pearson correlation | () | |
| One proportion | ||
| Two proportions | ||
| Chi-square | ||
| ANOVA | Iterative (non-central F) | |
| Regression () |
3.11 The Power Curve
The power curve plots power () as a function of one of the four analysis inputs while holding the others constant. The most common power curve plots:
- Power vs. : Shows how power increases as sample size grows. For most tests, power rises steeply at first and then plateaus. Used to identify the point of diminishing returns.
- Power vs. effect size: Shows how power changes across a range of effect sizes for a fixed . Used for sensitivity analysis.
- Power vs. : Shows the trade-off between Type I and Type II error rates.
The power curve always satisfies:
- as or effect size (cannot do better than chance).
- as or effect size .
4. Considerations and Planning Checklist
4.1 Specifying the Effect Size: The Critical Decision
The effect size is the single most consequential input to a power analysis. Poorly specified effect sizes lead to either seriously underpowered or wastefully overpowered studies. Use the following hierarchy of evidence for effect size specification:
| Priority | Source | Description |
|---|---|---|
| 1 | Domain-specific minimum effect of interest (SESOI) | The smallest effect that would be practically meaningful. Defined by theory, clinical guidelines, or cost-benefit analysis. |
| 2 | Prior studies or meta-analyses | Effect sizes from published research on the same or very similar questions. Apply a discount for publication bias. |
| 3 | Pilot study | A small preliminary study; note that pilot effect size estimates are imprecise and should be used with caution. |
| 4 | Expert opinion or theoretical prediction | Informed estimates from domain experts or mathematical models. |
| 5 | Cohen's conventions | Use as a last resort only. Small/medium/large benchmarks as described in Section 1.6. |
⚠️ Do not use the effect size from a pilot study directly. Pilot effect sizes are estimated from small samples and are highly unstable. The pilot effect size will often overestimate the true effect (publication bias in miniature). Use pilot data to confirm feasibility and estimate nuisance parameters (e.g., SD, ICC), not to determine the effect size for the power calculation.
4.2 Choosing the Significance Level ()
The choice of should be deliberate and justified:
| Context | Recommended | Rationale |
|---|---|---|
| Standard social/behavioural science | Convention; acceptable Type I:II error ratio | |
| Clinical trial (efficacy) | or (one-sided) | Regulatory convention |
| Safety outcomes | or smaller | Consequences of false positives are severe |
| Exploratory / hypothesis-generating | Higher sensitivity acceptable | |
| Multiple primary outcomes | (Bonferroni) | Controlling familywise error rate |
| Genomics / GWAS | Multiple testing across millions of SNPs | |
| Equivalence testing | (but applied differently) | TOST framework |
4.3 Choosing the Power Target ()
The power target should balance the cost of missing a real effect against the cost of increasing sample size:
| Consider Higher Power When | Consider Lower Power When |
|---|---|
| The cost of a false negative is high (clinical safety) | Resources are very limited |
| The study is confirmatory and pre-registered | The study is exploratory |
| The effect is expected to be small | The effect is expected to be large |
| The study aims to replicate prior findings | A pilot study to assess feasibility |
| Regulatory approval depends on the result | Multiple outcomes with confirmatory follow-up planned |
4.4 Identifying the Primary Outcome and Test
Power analysis is conducted for a single primary hypothesis and its associated primary test. Secondary hypotheses should have their own power analyses if they are to be formally tested.
Before beginning the analysis, clearly specify:
- What is the primary outcome variable (and its scale)?
- What is the primary comparison or test (e.g., mean difference, correlation)?
- What is the statistical test that will be applied?
- Is the test one-tailed or two-tailed?
- What are the key assumptions (e.g., equal variances, paired vs. independent)?
4.5 Accounting for Anticipated Attrition and Missing Data
In longitudinal studies or clinical trials, participants drop out or produce missing data. The required sample size at enrollment must account for this:
Where is the expected attrition rate (as a proportion).
Example: A study needs completers and expects 15% attrition:
For multi-wave longitudinal studies, apply the attrition correction at each wave or use the cumulative attrition rate.
4.6 Accounting for Stratification and Clustering
In studies with complex sampling designs:
- Stratified designs: Power analysis proceeds separately within strata, then sample sizes are combined.
- Clustered designs (e.g., school classes, clinical sites): The design effect (DEFF) inflates the required sample size to account for within-cluster correlation (intraclass correlation, ICC):
Where is the average cluster size and is the intraclass correlation coefficient. For clustered randomised trials (CRTs):
| ICC | |||
|---|---|---|---|
| 0.01 | DEFF = 1.09 | 1.19 | 1.29 |
| 0.05 | DEFF = 1.45 | 1.95 | 2.45 |
| 0.10 | DEFF = 1.90 | 2.90 | 3.90 |
| 0.20 | DEFF = 2.80 | 4.80 | 6.80 |
4.7 Multiple Testing Corrections
When testing hypotheses simultaneously, control the familywise error rate (FWER) or false discovery rate (FDR):
Bonferroni correction (FWER):
For each test, use in the power analysis. This increases the required sample size substantially for large .
Holm-Bonferroni (less conservative): Apply a sequential correction; compute power for the -th most significant test using .
Benjamini-Hochberg (FDR): Controls the expected proportion of false positives among significant results. Less conservative than Bonferroni for large-scale testing.
4.8 Reporting the Power Analysis: Documentation Standards
A complete power analysis report must include:
| Element | Description |
|---|---|
| Analysis type | A priori, post-hoc, sensitivity, or criterion |
| Statistical test | Exact test and variant used |
| Effect size and justification | Value, measure used, and source/rationale |
| Significance level () | Value and directionality (one- or two-tailed) |
| Desired power () | Value and rationale |
| Computed sample size | Total and per-group if applicable |
| Attrition/non-compliance adjustment | If applicable |
| Design effect | If clustered or stratified |
| Multiple testing correction | If multiple primary outcomes |
| Software and version | e.g., DataStatPro v4.2 |
5. Power Analysis for Common Statistical Tests
5.1 One-Sample t-Test
Research question: Does the population mean differ from a known or hypothesised value ?
Effect size:
Key inputs:
- : Hypothesised value under
- : Expected true mean under
- : Population or estimated SD
Power formula: Based on non-central t-distribution with and non-centrality parameter .
5.2 Two-Sample Independent t-Test
Research question: Do two independent group means differ?
Effect size:
Key inputs:
- , : Expected means for each group
- : Pooled within-group SD
- : Allocation ratio (default , equal groups)
Assumptions for power analysis:
- Equal variances (if Welch correction is planned, add to )
- Normally distributed outcomes within groups
5.3 Paired t-Test
Research question: Does the mean of within-subject or matched-pair differences differ from zero?
Effect size:
Key additional input:
- : Expected correlation between paired measurements (used to derive from and )
Efficiency gain over independent t-test:
A within-subjects design with requires approximately half the total participants of an independent-groups design for the same power.
5.4 One-Way ANOVA (Fixed Effects)
Research question: Do group means differ?
Effect size:
Key inputs:
- : Number of groups
- : Expected group means (or , the SD of means)
- : Common within-group SD
- : Equal or specified group sizes
Important: ANOVA power analysis requires specifying the pattern of means (which groups differ by how much), not just the overall effect size. The same can arise from very different mean patterns.
5.5 Factorial ANOVA
Research question: Do main effects and/or interactions exist in a factorial design?
Each effect (main effect A, main effect B, interaction A×B) has its own effect size and its own power analysis. Key additional considerations:
- Power for interaction effects is typically much lower than for main effects of the same nominal magnitude.
- Interaction effect sizes should be estimated directly (not derived from main effects).
- For a factorial design, the interaction effect size for a crossover interaction is often set to half the main effect size as a conservative estimate.
5.6 Repeated Measures ANOVA
Research question: Does a measured variable change across time points or conditions within subjects?
Effect size: Cohen's based on within-person variance.
Key additional input:
- : Average correlation among repeated measures (intraclass correlation). Higher → greater power advantage of repeated measures over independent groups.
Non-centrality parameter adjustment for repeated measures:
5.7 Pearson Correlation
Research question: Is there a linear relationship between two continuous variables?
Effect size: (population correlation coefficient)
Sample size:
Where .
Note: Power for correlation tests is low for small correlations. Detecting with 80% power at requires .
5.8 Multiple Regression
Research question: Does a set of predictors explain variance in the outcome? Or does adding predictors significantly improve prediction?
Effect size: (overall model) or (incremental)
Key inputs:
- : Number of predictors being tested
- : Total predictors in the full model
- or : Expected variance explained
Important: Power in multiple regression depends on the number of predictors being tested (), not on the total model. Testing a single predictor in a model with many covariates uses ; testing the full model uses .
5.9 Chi-Square Test of Association
Research question: Are two categorical variables associated?
Effect size: Cohen's (related to Cramér's : )
Key inputs:
- Table dimensions: rows, columns
- Expected cell proportions under
Important: For tables, and the formula simplifies to the two-proportion case. For larger tables, specifying the full expected cell proportion matrix provides the most accurate power estimate.
5.10 Test Type Comparison Table
| Test | Effect Size | Non-Central Dist. | Equal Groups Optimal? | |
|---|---|---|---|---|
| One-sample t | Non-central t | N/A | ||
| Two-sample t | Non-central t | Yes | ||
| Paired t | Non-central t | N/A | ||
| One-way ANOVA | ; | Non-central F | Yes | |
| Correlation | Non-central t | N/A | ||
| Multiple regression | ; | Non-central F | N/A | |
| Chi-square | Non-central | Yes | ||
| Proportion (one) | — | Normal | N/A | |
| Proportion (two) | — | Normal | Yes |
6. Using the Sample Size and Power Analysis Calculator Component
The Sample Size and Power Analysis Calculator in DataStatPro provides a comprehensive tool for conducting, visualising, and reporting power analyses for all common statistical tests.
Step-by-Step Guide
Step 1 — Navigate to the Component
Go to Study Design → Sample Size and Power Analysis.
Step 2 — Select the Analysis Mode
Choose one of the four analysis modes:
- A priori: Compute required sample size.
- Post-hoc: Compute achieved power.
- Sensitivity: Compute minimum detectable effect size.
- Criterion: Compute required (advanced use).
Step 3 — Select the Statistical Test
Choose the test family and specific test from the hierarchical menu:
- Mean Tests
- One-Sample t-Test
- Two-Sample Independent t-Test
- Paired t-Test
- One-Way ANOVA
- Factorial ANOVA (Two-Way, Three-Way)
- Repeated Measures ANOVA
- Association Tests
- Pearson Correlation
- Spearman Correlation
- Multiple Regression (Linear)
- Categorical Tests
- Chi-Square Test of Association
- Chi-Square Goodness-of-Fit
- One-Sample Proportion Test
- Two-Sample Proportion Test
- Survival Analysis
- Log-Rank Test
- Cox Proportional Hazards
- Advanced
- Equivalence Test (TOST)
- Non-Inferiority Test
- Clustered Design (CRT)
- Generic Non-Central Distribution
Step 4 — Specify Effect Size
Choose your effect size specification method:
- Direct entry: Enter the effect size measure directly (e.g., ).
- From parameters: Enter the raw parameters (e.g., , , ) and DataStatPro computes the effect size automatically.
- From proportions: Enter and for proportion tests; DataStatPro computes Cohen's automatically.
- From expected table: Enter the full expected cell proportion matrix for chi-square tests.
- Effect size calculator: Use DataStatPro's built-in effect size converter to transform between , , , , , OR, and RR.
Step 5 — Specify Remaining Parameters
Depending on the analysis mode, enter the known quantities:
- Significance level (): Default ; specify or if needed.
- Desired power (): Default ; options , , , , , or custom.
- Directionality: Two-tailed (default) or one-tailed.
- Number of groups / predictors: As applicable to the selected test.
- Allocation ratio : For two-group tests (default ).
- ICC and cluster size: For clustered designs.
- Attrition rate: For enrollment adjustment.
Step 6 — Set Display Options
- ✅ Primary result: Required (or power, or MDE) with exact formula.
- ✅ Enrollment (attrition-adjusted).
- ✅ Per-group breakdown.
- ✅ Power curve: Power vs. for current effect size and .
- ✅ Sensitivity curve: Power vs. effect size for current and .
- ✅ Power contour plot: Power as a function of both and effect size.
- ✅ Non-centrality parameter and critical value.
- ✅ Type I error rate (), Type II error rate (), power ().
- ✅ Summary table: Power at , , .
- ✅ Design effect and ICC-adjusted (for clustered designs).
- ✅ APA 7th edition power analysis paragraph (auto-generated).
Step 7 — Run the Analysis
Click "Compute Sample Size / Power". DataStatPro will:
- Convert effect size inputs to the required format (apply transformations if needed).
- Solve for the requested output using exact non-central distribution methods.
- Apply attrition, ICC, and multiple testing corrections if specified.
- Generate power curve, sensitivity curve, and contour plot.
- Produce the APA-compliant power analysis reporting paragraph.
7. Step-by-Step Procedure
7.1 Full Manual Procedure for A Priori Sample Size Calculation
Step 1 — Identify the Primary Research Question
State the primary outcome, the comparison of interest, and the direction of the hypothesised effect. Confirm the appropriate statistical test.
Step 2 — Choose the Significance Level and Directionality
State and justify the choice. Specify whether the test is one-tailed or two-tailed, with justification.
Identify (two-tailed) or (one-tailed) from the standard normal distribution:
| (two-tailed) | (one-tailed) | |
|---|---|---|
| 1.645 | 1.282 | |
| 1.960 | 1.645 | |
| 2.241 | 1.960 | |
| 2.576 | 2.326 | |
| 3.291 | 3.090 |
Step 3 — Choose the Power Target
State the desired power and justify the choice. Identify :
| Power () | |
|---|---|
| 0.524 | |
| 0.842 | |
| 1.036 | |
| 1.282 | |
| 1.645 | |
| 2.326 |
Step 4 — Specify and Justify the Effect Size
State the effect size measure, its numerical value, and the source or rationale. Convert raw parameters to a standardised effect size using the appropriate formula (Section 3).
Step 5 — Apply the Sample Size Formula
Substitute the values of , , and the effect size into the appropriate formula from Section 3.10. Round up to the nearest whole number.
⚠️ Always round the required UP, never down. Rounding down results in a study with slightly less power than targeted.
Step 6 — Adjust for Unequal Groups (If Applicable)
For two-group designs with unequal allocation (ratio ):
Verify that the total achieves the target power with exact non-central distribution methods.
Step 7 — Adjust for Attrition
Step 8 — Adjust for Clustering (If Applicable)
Number of clusters:
Step 9 — Adjust for Multiple Testing (If Applicable)
Replace with in the sample size formula (Bonferroni), where is the number of primary hypotheses.
Step 10 — Verify with Power Curve
Using the computed , confirm the achieved power with exact non-central distribution calculations. Plot the power curve to show power at values of above and below the target. Confirm the achieved power is at or above the target.
Step 11 — Conduct Sensitivity Analysis
Report the minimum detectable effect size (MDE) at the computed :
(two-sample t, per group)
This tells stakeholders the smallest effect the study is designed to detect.
Step 12 — Document and Report
Compile all inputs and outputs into a complete power analysis report (APA format provided in Section 15). Retain all working for audit and reproducibility.
8. Interpreting the Output
8.1 Reading the Required Sample Size
| Output Feature | Interpretation |
|---|---|
| Total | Minimum valid observations needed to complete the analysis |
| Per-group | Number needed in each arm (for multi-group tests) |
| Enrollment | Inflate total by attrition rate; participants to recruit |
| Exact achieved power | Power at the computed (rounded up) ; should be target |
| Power at | Confirms one fewer participant would fall below the target power |
8.2 Understanding the Power Curve
| Feature of the Power Curve | Meaning |
|---|---|
| Steep rise at low | Each additional participant greatly increases power in this range |
| Plateau at high | Diminishing returns; additional participants add little power |
| Power curve above target line | Current meets or exceeds the power requirement |
| Power curve crossing 0.50 | This yields coin-flip odds of detecting the true effect |
| Power at | Equals (the false positive rate); cannot do worse than chance |
8.3 Sensitivity Output: Minimum Detectable Effect
The minimum detectable effect (MDE) is the smallest effect the study has the specified power to detect:
| MDE Interpretation | Action |
|---|---|
| MDE SESOI | Study is adequately powered to detect the smallest meaningful effect |
| MDE SESOI | Study is precisely powered; just barely detects the minimum meaningful effect |
| MDE SESOI | Study is underpowered; cannot reliably detect the minimum meaningful effect |
Report the MDE in original units (not just in standardised form) to make it interpretable to domain experts who may not be familiar with Cohen's .
8.4 The Non-Centrality Parameter ()
The non-centrality parameter summarises the total signal in the study — it captures both the effect size and the sample size:
| Interpretation | Meaning |
|---|---|
| Power ; study cannot distinguish from | |
| Power is exactly at the target level | |
| large | High power; test statistic distribution under well separated from critical value |
8.5 Interpreting Achieved Post-Hoc Power
Post-hoc power (computed after data collection using the observed effect size) has a deterministic relationship with the p-value:
| Post-Hoc Power Relationship | Meaning |
|---|---|
| exactly | Post-hoc power always (mathematical identity) |
| Post-hoc power | |
| Post-hoc power |
Because of this mathematical relationship, post-hoc power using the observed effect size adds no information beyond the p-value. Instead, report:
- The 95% CI for the effect size.
- A sensitivity analysis: "What is the power for a range of plausible true effect sizes?"
8.6 Interpreting the Contour Plot
The power contour plot displays power as a function of both and effect size simultaneously, with contour lines at specific power levels (e.g., 0.60, 0.70, 0.80, 0.90, 0.95):
| Region of the Contour Plot | Interpretation |
|---|---|
| Above the 0.80 contour | Combinations of and effect size where power |
| Below the 0.80 contour | Underpowered for those and effect size combinations |
| Current study position | Marked on the plot; shows where the study falls relative to power targets |
| Steep contours | Power changes rapidly with in this region (steep learning curve) |
| Flat contours | Diminishing returns; large increases in needed for modest power gains |
9. Visualising Power and Sample Size
9.1 Power Curve (Power vs. Sample Size)
The power curve is the primary visualisation for a priori power analysis. It plots the statistical power (-axis) as a function of sample size (-axis) for fixed and effect size.
Key annotations on the DataStatPro power curve:
- A horizontal dashed line at the target power level (e.g., 0.80).
- A vertical dashed line at the required .
- The intersection point highlighted and labelled with exact and power.
- Shaded region below the target power: underpowered zone.
- Optional: Multiple curves for different effect sizes or levels.
Best practices:
- Set the x-axis range to display at least to show the full shape of the curve.
- Annotate the MDE at the required .
- Use a logarithmic x-axis when the required is very large to avoid compression of the curve at small .
9.2 Sensitivity Curve (Power vs. Effect Size)
The sensitivity curve plots power (-axis) as a function of effect size (-axis) for a fixed and .
Use cases:
- Assessing robustness: "What happens to power if the true effect is smaller than anticipated?"
- Reporting minimum detectable effect: The effect size at which the curve crosses the target power line.
- Communicating uncertainty about the effect size to stakeholders.
Best practices:
- Mark Cohen's small/medium/large benchmarks as vertical reference lines.
- Annotate the MDE (the effect size where the curve intersects the target power line).
- Shade the "adequately powered" region (to the right of the MDE).
9.3 Power Contour Plot (Power as a Function of and Effect Size)
The contour plot provides the most comprehensive two-dimensional view of how power depends on both sample size and effect size. Contour lines connect combinations of that yield equal power.
Reading the contour plot:
- The study's planned combination is marked.
- Power target contour (e.g., 0.80) divides the plot into adequate and inadequate power regions.
- Researchers can identify the trade-off: increasing the effect size estimate by a given amount allows reducing by a corresponding amount while maintaining power.
9.4 Error Rate Trade-Off Plot
The error rate trade-off plot visualises the relationship between (Type I error rate) and (Type II error rate = power) for a fixed and effect size:
- As decreases (stricter threshold), increases (lower power).
- The optimal trade-off depends on the relative costs of Type I and Type II errors in the specific application.
Useful for:
- Choosing between and given limited sample size.
- Demonstrating to reviewers the implications of changing the significance threshold.
9.5 G*Power-Style Distribution Plot
DataStatPro generates the classic two-distribution diagram showing:
- The central distribution of the test statistic under (blue curve).
- The non-central distribution of the test statistic under (orange curve).
- The critical value () as a vertical dashed line.
- The region (critical region under , right tail of blue curve).
- The power () region (area under the orange curve beyond ).
- The region (area under the orange curve to the left of ).
This plot is highly effective for teaching the concept of power and for communicating results to non-statistician audiences.
9.6 Sample Size Comparison Table Plot
For multiple scenarios (e.g., small/medium/large effect; power = 0.80/0.90/0.95), DataStatPro generates a bubble chart or heatmap where:
- Rows represent power targets.
- Columns represent effect sizes.
- Cell values (or bubble sizes) represent the required .
This provides a rapid overview of how sample size requirements vary across the range of plausible inputs, supporting scenario planning.
9.7 Attrition-Adjusted Recruitment Funnel
For longitudinal studies or clinical trials, DataStatPro generates a funnel diagram showing:
- Enrollment target (accounting for attrition).
- Expected completers at each wave.
- Final analytic sample.
- Required sample vs. expected completers — highlighting any shortfall.
10. Sensitivity Analysis and Robustness Checks
10.1 What Is a Sensitivity Analysis in Power Planning?
A sensitivity analysis for power examines how the required sample size (or achieved power) changes as input parameters vary within plausible ranges. It answers: "How robust is my power calculation to uncertainty in the assumed effect size, standard deviation, or other inputs?"
10.2 Varying the Effect Size
The most important sensitivity analysis varies the effect size across a range defined by:
- The SESOI (lower bound — the smallest effect that matters).
- The expected effect from prior literature (central estimate).
- A larger, optimistic effect (upper bound).
Report for each scenario:
| Scenario | Effect Size | (power = 0.80) | (power = 0.90) |
|---|---|---|---|
| Pessimistic (SESOI) | Small | Largest | Largest |
| Most likely | Medium | Target | Target |
| Optimistic | Large | Smallest | Smallest |
Decision rule: Plan for the scenario that produces the largest to ensure adequate power across all plausible effect sizes.
10.3 Varying the Standard Deviation
For mean-based tests, the effect size depends on . If is estimated from a pilot study or literature with uncertainty, sensitivity analysis should vary across a plausible range (e.g., ):
Report for and as the worst and best cases.
10.4 The "What If" Power Table
A comprehensive "What If" power table reports power for a grid of values and effect sizes, enabling researchers and reviewers to assess robustness:
| per group | |||||
|---|---|---|---|---|---|
| 20 | .10 | .18 | .29 | .41 | .69 |
| 30 | .13 | .23 | .38 | .54 | .83 |
| 50 | .17 | .32 | .52 | .70 | .94 |
| 80 | .23 | .45 | .68 | .85 | .99 |
| 100 | .26 | .52 | .75 | .90 | .99 |
| 150 | .33 | .64 | .86 | .96 | |
| 200 | .39 | .73 | .92 | .99 |
(Two-sample independent t-test, , two-tailed)
10.5 Bayesian Power Analysis
Classical power analysis assumes a fixed, known effect size. Bayesian power analysis incorporates uncertainty about the effect size by averaging power over a prior distribution of effect sizes:
Where is the prior distribution on the effect size (e.g., a half-normal or truncated normal distribution).
Average power is always lower than the power at the expected effect size. If the prior is wide (high uncertainty), average power can be substantially lower than the nominal target. DataStatPro supports average power calculations under normal, half-normal, and uniform prior distributions.
10.6 Sequential and Adaptive Designs
Traditional power analysis assumes a fixed sample size collected before any analysis. Sequential designs allow interim analyses with pre-specified stopping rules, which can reduce the expected sample size while controlling error rates.
Key concepts:
| Concept | Description |
|---|---|
| Group sequential design | Planned interim analyses with O'Brien-Fleming or Pocock stopping boundaries |
| Alpha spending | Controls FWER across all interim and final analyses |
| Expected sample size | Average under and ; may be less than fixed design |
| Inflation factor | Required is larger than fixed design to preserve power after early stopping |
DataStatPro supports group sequential design power analysis with O'Brien-Fleming, Pocock, and Kim-DeMets (power family) alpha spending functions.
10.7 Equivalence and Non-Inferiority Tests
Standard power analysis targets superiority — detecting that an effect is non-zero. Equivalence tests (TOST) and non-inferiority tests have different frameworks:
Equivalence (TOST — Two One-Sided Tests):
: (effect is outside equivalence bounds) : (effect is within equivalence bounds)
Sample size for TOST (per group):
Where is the equivalence margin and is the assumed true difference.
Non-inferiority:
: (treatment is inferior by more than the margin) : (treatment is not inferior)
Sample size (per group):
Where is the non-inferiority margin and is the expected true difference ( for a conservative assumption).
11. Advanced Topics
11.1 Effect Size Conversion
It is often necessary to convert between effect size measures. DataStatPro's built-in converter handles all common transformations:
| From | To | Formula |
|---|---|---|
| (correlation) | ||
| (approximately, for small ) |
11.2 The Winner's Curse and Effect Size Inflation
Studies with low power that happen to produce a significant result tend to produce inflated effect size estimates. This phenomenon — the "Winner's Curse" — occurs because a small- study can only reach significance when the observed effect happens to be larger than the true effect by chance.
Consequences:
- Effect sizes from small, significant studies overestimate the true population effect.
- Replication studies using these inflated effect sizes are often underpowered.
- The "replication crisis" in psychology and other sciences is partly driven by this phenomenon.
Mitigation:
- Base power calculations on conservative (smaller) effect size estimates.
- Use effect sizes from meta-analyses rather than individual significant studies.
- Apply a shrinkage factor (e.g., ) as a conservative hedge.
11.3 Power Analysis for Multilevel Models
For multilevel (hierarchical) models with data nested within clusters (students within schools, patients within clinics):
The effective sample size for a cluster-randomised trial depends on both the number of clusters and the cluster size :
Power depends primarily on the number of clusters (not the number of individuals per cluster) when the ICC is high. Doubling the number of individuals per cluster has diminishing returns once .
Optimal allocation: Add more clusters (not more individuals per cluster) when the ICC is high or when between-cluster variance is the limiting factor.
11.4 Power for Survival Analysis (Log-Rank Test)
For survival outcomes (time to event), the log-rank test's power depends on the number of events (not the sample size):
Required number of events (for two-group comparison):
Where is the hypothesised hazard ratio under .
Required total (accounting for censoring rate ):
The key insight is that studies with high censoring rates need larger to accumulate enough events — extending the follow-up period is often more efficient than increasing .
11.5 Precision Analysis: Planning for Confidence Interval Width
An alternative to power analysis is precision analysis — planning to achieve a desired confidence interval width, rather than a desired power level. This is consistent with an estimation-focused approach and does not require specifying the effect size under .
Required for a 95% CI of width for the mean:
Required for a 95% CI of width for a proportion:
Using gives the most conservative (largest) .
11.6 Prospective Power Analysis for Replication Studies
When planning a replication study of a previously published finding:
- Extract the original study's effect size and its SE (or CI).
- Apply the shrinkage factor: (conservative hedge).
- Compute required for at power (higher than 0.80 to account for uncertainty).
- Report both the nominal power (if is correct) and the power at (robustness check).
11.7 Negative Findings and Equivalence: Planning for Both
A study designed to test for superiority may fail to reject but not demonstrate equivalence. Planning for both outcomes requires pre-specifying:
- Equivalence margin : The largest effect that would be practically negligible.
- A TOST equivalence test as a secondary analysis alongside the primary superiority test.
- Sufficient power for both: The sample size is the maximum of and .
11.8 Reporting Power in Pre-Registration
Pre-registration of power analyses on platforms such as the Open Science Framework (OSF), ClinicalTrials.gov, or AsPredicted.org requires:
| Element | Required Detail |
|---|---|
| Research question and primary hypothesis | Specific and testable |
| Primary outcome and statistical test | Named explicitly |
| Effect size and justification | Value, measure, and source |
| , power target, directionality | All three specified |
| Computed | Total and per group |
| Attrition and design adjustments | If applicable |
| Software used | Name and version |
| Deviation policy | What will happen if cannot be reached |
Pre-registration creates a public record of the planned analysis and protects against post-hoc power manipulation and researcher degrees of freedom.
12. Worked Examples
Example 1: A Priori — Two-Sample Independent t-Test
A clinical researcher plans to compare the effectiveness of a new cognitive training programme (Group A) vs. standard care (Group B) on memory scores. Based on a published meta-analysis, the expected Cohen's . The researcher wants (two-tailed) and power .
Step 1 — Effect size: (from meta-analysis).
Step 2 — Look up constants:
Step 3 — Apply formula:
Round up: , .
Step 4 — Verify with exact non-central t:
At per group:
Using the non-central t-distribution: ✅ (meets the 0.80 target).
Step 5 — Attrition adjustment (expecting 12% dropout):
Step 6 — MDE at per group:
The study is designed to detect effects of with 80% power.
Summary:
| Parameter | Value |
|---|---|
| Test | Two-sample independent t-test (two-tailed) |
| Effect size | (meta-analysis) |
| Power target | |
| per group (analysis) | 78 |
| total (analysis) | 156 |
| Achieved power | |
| total (enrollment; 12% attrition) | 178 |
| MDE |
APA write-up: "An a priori power analysis conducted in DataStatPro indicated that 78 participants per group (total ) were required to detect an effect of with 80% power at a two-tailed (achieved power = 0.80). The effect size was based on a published meta-analysis. Assuming 12% attrition, 89 participants per group (total ) will be recruited."
Example 2: A Priori — One-Way ANOVA (Three Groups)
An education researcher compares three teaching methods on test performance. Literature suggests group means of 65, 70, and 68 with a common within-group SD of 12. , power target .
Step 1 — Compute Cohen's :
Grand mean:
This is between Cohen's small () and medium () benchmarks.
Step 2 — Compute equivalent:
Step 3 — Required (iterative, via DataStatPro):
Using non-central F with , :
DataStatPro iterates: at per group (): power ✅
Step 4 — Attrition adjustment (8%):
Summary:
| Parameter | Value |
|---|---|
| Test | One-way ANOVA (), two-tailed |
| Effect size | ; |
| Power target | |
| per group (analysis) | 53 |
| total (analysis) | 159 |
| Achieved power | |
| total (enrollment; 8% attrition) | 173 |
APA write-up: "A priori power analysis for a one-way ANOVA with three groups indicated that 53 participants per group (total ) were required to detect () with 80% power at (achieved power = 0.80). The expected group means (, , ; pooled ) were derived from the literature. With anticipated 8% attrition, 58 participants per group (total ) will be recruited."
Example 3: A Priori — Pearson Correlation
A developmental psychologist hypothesises a moderate correlation () between parental involvement (hours/week) and child academic achievement. (two-tailed), power .
Step 1 — Fisher z-transformation:
Step 2 — Look up constants:
Step 3 — Apply formula:
Round up: .
Step 4 — MDE (minimum detectable correlation at , power ):
Summary:
| Parameter | Value |
|---|---|
| Test | Pearson correlation (two-tailed) |
| Effect size | (literature) |
| Power target | |
| Required | 82 |
| Achieved power | |
| MDE |
APA write-up: "Based on an expected correlation of , a priori power analysis indicated that participants were required to achieve 90% power at (two-tailed). Calculations were conducted using DataStatPro."
Example 4: A Priori — Chi-Square Test of Association (2 × 3 Table)
A sociologist examines the association between age group (18–34, 35–54, 55+) and preferred news source (online, print, broadcast). Based on the literature, the expected cell proportions are:
| Online | Broadcast | ||
|---|---|---|---|
| 18–34 | .18 | .04 | .11 |
| 35–54 | .10 | .09 | .14 |
| 55+ | .06 | .12 | .16 |
, power .
Step 1 — Compute marginal proportions:
Row marginals: , ,
Column marginals: , ,
Step 2 — Compute Cohen's :
DataStatPro computes: (using the full cell proportion matrix).
Step 3 — Degrees of freedom:
Step 4 — Required (non-central chi-square, DataStatPro):
At , , power :
Round up: .
Summary:
| Parameter | Value |
|---|---|
| Test | Chi-square test of association ( table) |
| Effect size | (from expected cell proportions) |
| Power target | |
| Required | 312 |
| Achieved power |
APA write-up: "An a priori power analysis for a chi-square test of association indicated that participants were required to detect with 80% power at . The expected cell proportions were derived from prior survey data."
Example 5: Sensitivity Analysis — Post-Hoc Assessment
A completed study of exam score differences between two teaching conditions found , , , , . The result was non-significant (, ).
Observed effect size:
Post-hoc power (observed , NOT recommended as standalone):
At per group, , : Power .
This is low — but this is mathematically expected given the non-significant result.
More useful — Sensitivity analysis (power vs. effect size at per group):
| True | Power at per group |
|---|---|
| 0.20 | .16 |
| 0.30 | .25 |
| 0.40 | .37 |
| 0.50 | .52 |
| 0.60 | .66 |
| 0.80 | .87 |
95% CI for observed : (computed via DataStatPro).
Interpretation: The study had sufficient power only for large effects (). The non-significant result is uninformative about effects in the small-to-medium range. The 95% CI for is wide (), spanning from negligible to large. A future study designed to detect with 80% power would require per group.
APA write-up: "The sample of per group provided 52% power to detect a medium effect of at (two-tailed), indicating the study was substantially underpowered for effects of practical interest. The 95% CI for Cohen's spans a wide range. A sensitivity power analysis indicated that detecting with 80% power at would require per group. The non-significant result should therefore be interpreted with caution rather than as evidence of no effect."
13. Common Mistakes and How to Avoid Them
Mistake 1: Using Post-Hoc Power with the Observed Effect Size
Problem: Computing "observed power" using the effect size estimated from the completed study's data and presenting it as an independent finding. Because observed power is a monotonically increasing function of the p-value, always gives power . The observed power adds no information whatsoever beyond the p-value itself.
Solution: Replace post-hoc power with: (a) A 95% CI for the effect size, and (b) A sensitivity power analysis showing power for a range of plausible true effect sizes. This genuinely informs about what the study could and could not detect.
Mistake 2: Basing Effect Size on a Single Pilot Study
Problem: Running a pilot study (), observing , and using this value directly in a power calculation. Small pilot studies produce highly unstable effect size estimates. The true effect could easily be — leading to a seriously underpowered main study.
Solution: Use pilot studies for feasibility and nuisance parameter estimation (SD, retention rate, ICC) only. Determine the target effect size from the SESOI, published literature, or meta-analyses. If a pilot effect size must be used, apply a conservative discount factor (e.g., multiply by 0.60–0.75).
Mistake 3: Confusing Total with Per-Group
Problem: A formula yields per group, but the researcher enrolls 50 participants total (25 per group), resulting in only 25% of the required power.
Solution: Always explicitly distinguish total from per-group in both calculations and reports. DataStatPro reports both total and the per-group breakdown on all output screens.
Mistake 4: Not Adjusting for Attrition
Problem: Calculating that 120 completers are needed and recruiting exactly 120 participants, then losing 18 to dropout — leaving 102 completers with power substantially below target.
Solution: Always calculate the attrition-adjusted enrollment target: . Obtain attrition estimates from the literature or previous studies in the same population. Be conservative (overestimate attrition rates).
Mistake 5: Ignoring the Design Effect in Clustered Studies
Problem: Treating a clustered design (e.g., 20 students per class) as if observations were independent, underestimating the required number of clusters by a factor of DEFF.
Solution: Always specify the expected ICC and average cluster size, and apply the design effect: . Err on the side of overestimating the ICC. Use DataStatPro's clustered design module.
Mistake 6: Using Cohen's Conventions as the Default Effect Size
Problem: Entering ("medium") into a power calculation simply because it is conventional, without any scientific justification. This produces a sample size that may be completely inappropriate for the specific research question — the true effect could be (requiring 6× more participants).
Solution: Always justify the effect size from the SESOI, prior literature, or meta-analysis. Use Cohen's conventions only as an absolute last resort, and document that they were used in the absence of domain-specific information. Never present Cohen's conventions as though they represent the expected effect.
Mistake 7: Performing a Power Analysis for the Wrong Test
Problem: Computing power for a two-sample t-test when the actual analysis will be a mixed ANOVA (within × between), or computing power for a chi-square test when logistic regression will be used. Different tests have different power functions.
Solution: Identify the exact statistical test to be used (including model specification, covariates, and correction methods) before computing power. The power analysis must match the planned analysis.
Mistake 8: Conducting Multiple Tests but Powering Only for One
Problem: Planning 5 outcome variables but computing power only for the most important one, without applying a multiple testing correction. The familywise false-positive rate for 5 independent tests at is .
Solution: Clearly specify the single primary outcome and power accordingly. For secondary outcomes, apply Bonferroni or Holm-Bonferroni corrections: where is the number of primary hypotheses. Compute power at for all primary outcomes, or justify a less conservative correction.
Mistake 9: Treating Non-Significant Results as Evidence of No Effect
Problem: A study with fails to reject () and concludes "the two conditions are equivalent". With , power for a medium effect is . The non-significant result is as consistent with a medium true effect as with no effect.
Solution: Distinguish between "no evidence of an effect" and "evidence of no effect". To provide evidence of equivalence, use a TOST equivalence test with a pre-specified equivalence margin, or present the 95% CI for the effect size to show that meaningful effects can be ruled out. Power the study for equivalence, not just superiority, if equivalence is a potential conclusion.
Mistake 10: Reporting Sample Size Without Justification
Problem: Stating only "sample size was " in a methods section with no reference to power, effect size, or target power. Readers (and reviewers) cannot assess whether the study was adequately powered.
Solution: Always include a complete power analysis justification in the methods section: test used, effect size with source, , power target, computed , and software. Pre-register the power analysis before data collection.
14. Troubleshooting
| Problem | Likely Cause | Solution |
|---|---|---|
| Required is extremely large (e.g., ) | Effect size is very small; is very small; power target is very high | Check whether the effect size is realistic; consider whether the study is feasible; explore precision analysis as an alternative |
| Required is smaller than expected | Effect size is large; one-tailed test used; power target is low (e.g., 0.70) | Verify inputs; confirm directionality; consider increasing power target |
| Power does not reach target even with very large | Effect size effectively zero; test has an inherent power ceiling | Check whether is correctly specified; effect size of zero gives power regardless of |
| Post-hoc power is very low (e.g., ) | Study was substantially underpowered; effect is genuinely small | Expected when ; replace post-hoc power with CI for effect size and sensitivity analysis |
| DataStatPro gives different than another power calculator | Different rounding conventions, approximation formulae, or non-central distribution methods | Both may be correct; use exact non-central distribution methods (DataStatPro default); difference is typically 0–2 participants |
| Design effect is very large () | Very high ICC or very large cluster size | Consider increasing number of clusters rather than cluster size; add cluster-level covariates to reduce ICC |
| Power is not improved by doubling (clustered design) | ICCis high; adding individuals within clusters is inefficient | Add more clusters, not more individuals per cluster; consult a biostatistician |
| Power for interaction effect is very low | Interaction effects are inherently smaller and harder to detect than main effects | Plan for 4× the needed for the main effect to detect a crossover interaction; report as a limitation |
| Cohen's is unusually large | Proportions are both near 0 or near 1; arcsine transformation stretches the scale | Verify and ; the arcsine transformation is mathematically correct; large reflects high sensitivity in that region |
| Achieved power slightly below target after rounding | formula gives a non-integer; rounding up gives target power; rounding down falls just below | Always round up, never down; add 1–2 participants as a buffer |
| Equivalence test requires much larger than superiority test | Equivalence requires showing the effect is within a narrow margin; inherently conservative | Use a realistic equivalence margin; consider whether the margin is defined appropriately |
| Sample size for ANOVA with many groups is surprisingly large | Many-group ANOVA has reduced power per group for fixed total ; each group has small | Concentrate comparisons on the most important pairwise contrasts; consider a planned contrast rather than omnibus ANOVA |
| Attrition-adjusted is unrealistically large | Very high assumed attrition rate | Revisit attrition estimates; consider strategies to reduce dropout; report as a study limitation if is infeasible |
| Power analysis for regression gives very different from t-test | Different effect size frameworks ( vs. ); different | Convert between effect sizes using DataStatPro's converter; confirm (predictors tested) is specified correctly |
15. Quick Reference Cheat Sheet
The Four Elements of Power Analysis
Specify any three → solve for the fourth.
Core Sample Size Formulae
| Test | Effect Size | Per-Group or Total |
|---|---|---|
| One-sample t | ||
| Two-sample t (equal) | ||
| Paired t | ||
| Correlation | () | |
| Proportion (one) | ||
| Proportion (two) | ||
| Chi-square | ||
| Regression |
Key Z-Score Constants
| (two-tailed) | Power () | ||
|---|---|---|---|
| 1.645 | 0.524 | ||
| 1.960 | 0.842 | ||
| 2.241 | 1.036 | ||
| 2.576 | 1.282 | ||
| 3.291 | 1.645 | ||
| 2.326 |
Sample Size for Two-Sample t-Test (, Two-Tailed)
| Power = 0.70 | Power = 0.80 | Power = 0.90 | Power = 0.95 | |
|---|---|---|---|---|
| 0.20 (small) | 264 | 394 | 526 | 650 |
| 0.30 | 118 | 176 | 234 | 290 |
| 0.50 (medium) | 44 | 64 | 86 | 106 |
| 0.80 (large) | 18 | 26 | 34 | 42 |
| 1.00 | 12 | 18 | 24 | 28 |
| 1.20 | 8 | 12 | 16 | 20 |
(Figures are per group; multiply by 2 for total .)
Sample Size for Correlation (, Two-Tailed)
| Power = 0.80 | Power = 0.90 | |
|---|---|---|
| 782 | 1046 | |
| 194 | 259 | |
| 84 | 112 | |
| 46 | 61 | |
| 28 | 37 | |
| 12 | 16 |
Cohen's Effect Size Conventions
| Test | Small | Medium | Large |
|---|---|---|---|
| t-test () | 0.20 | 0.50 | 0.80 |
| ANOVA () | 0.10 | 0.25 | 0.40 |
| Correlation () | 0.10 | 0.30 | 0.50 |
| Chi-square () | 0.10 | 0.30 | 0.50 |
| Regression () | 0.02 | 0.15 | 0.35 |
| Proportion () | 0.20 | 0.50 | 0.80 |
Attrition Adjustment
| Attrition Rate | Inflation Factor |
|---|---|
| 5% | × 1.053 |
| 10% | × 1.111 |
| 15% | × 1.176 |
| 20% | × 1.250 |
| 25% | × 1.333 |
| 30% | × 1.429 |
Design Effect for Clustered Studies
| ICC | Cluster size | ||
|---|---|---|---|
| 0.01 | 1.09 | 1.19 | 1.29 |
| 0.05 | 1.45 | 1.95 | 2.45 |
| 0.10 | 1.90 | 2.90 | 3.90 |
| 0.20 | 2.80 | 4.80 | 6.80 |
Effect Size Conversions
| From | To | Formula |
|---|---|---|
Four Modes of Power Analysis Decision Guide
| Goal | Analysis Mode | Fixed | Solved |
|---|---|---|---|
| Plan sample size before data collection | A priori | , , | |
| Assess power of a completed study | Post-hoc | , , | |
| Find smallest detectable effect | Sensitivity | , , | |
| Justify a non-standard | Criterion | , , |
APA 7th Edition Power Analysis Reporting Templates
A priori (standard): "An a priori power analysis conducted in DataStatPro indicated that [N per group / total N] participants were required to detect [effect size measure] = [value] with [power]% power at a [one/two]-tailed = [value] (achieved power = [value]). The effect size was based on [source/justification]."
A priori (with attrition): "[As above]. Assuming [X]% attrition, [inflated N] participants will be recruited."
A priori (clustered design): "[As above]. Assuming an ICC of [value] and an average cluster size of [m], the design effect was [DEFF], yielding a required [N clusters] clusters of [m] participants each (total = [value])."
Sensitivity analysis: "With [N] participants per group, the study had [power]% power to detect an effect of [ES measure] = [MDE value] at = [value] (two-tailed). Power for a range of effect sizes is provided in [Table/Figure X]."
Non-significant result with sensitivity: "With [N] per group, the study had [power]% power to detect [ES measure] = [value] at = [value]. The 95% CI for [effect size] = [[LB], [UB]], indicating that effects as large as [UB value] cannot be ruled out. A future study powered to detect [ES measure] = [target value] with 80% power would require [future N] per group."
Power Analysis Reporting Checklist
| Element | Required |
|---|---|
| Analysis mode (a priori / post-hoc / sensitivity) | ✅ Always |
| Statistical test named exactly | ✅ Always |
| Effect size measure and value | ✅ Always |
| Effect size source and justification | ✅ Always |
| Significance level and directionality | ✅ Always |
| Power target () | ✅ Always |
| Computed total and per group | ✅ Always |
| Achieved power at computed | ✅ Always |
| Software and version | ✅ Always |
| Attrition rate and enrollment | ✅ When attrition is anticipated |
| Design effect, ICC, cluster size | ✅ For clustered designs |
| Multiple testing correction and adjusted | ✅ When multiple primary outcomes |
| MDE in original units | ✅ Recommended |
| Sensitivity power table or curve | ✅ Recommended |
| Equivalence margin (for TOST) | ✅ For equivalence studies |
| Pre-registration reference | ✅ When pre-registered |
| Discussion of feasibility | ✅ When is large or constrained |
| Bayesian / average power | ✅ When prior uncertainty about effect size is substantial |
This tutorial provides a comprehensive foundation for understanding, conducting, interpreting, visualising, and reporting sample size and power analyses within the DataStatPro application. For further reading, consult Cohen's "Statistical Power Analysis for the Behavioral Sciences" (2nd ed., 1988) for foundational theory and conventions; Lakens, Scheel & Isager's "Equivalence Testing for Psychological Research: A Tutorial" (2018) for TOST methods; Gelman & Carlin's "Beyond Power Calculations" (2014) for design analysis and Type M/S errors; Faul, Erdfelder, Lang & Buchner's "GPower 3" (2007) for computational methods; and Zar's "Biostatistical Analysis" (5th ed., 2010) for biological and health science applications. For feature requests or support, contact the DataStatPro team.*