ANOVA Tests and Alternatives: Zero to Hero Tutorial
This comprehensive tutorial takes you from the foundational concepts of analysis of variance all the way through one-way, factorial, repeated measures, and mixed ANOVA designs, their non-parametric alternatives, post-hoc testing, effect sizes, and practical usage within the DataStatPro application. Whether you are encountering ANOVA for the first time or deepening your understanding of variance decomposition and group comparison, this guide builds your knowledge systematically from the ground up.
Table of Contents
- Prerequisites and Background Concepts
- What is ANOVA?
- The Mathematics Behind ANOVA
- Assumptions of ANOVA
- Types of ANOVA
- Using the ANOVA Calculator Component
- One-Way Between-Subjects ANOVA
- Factorial Between-Subjects ANOVA
- One-Way Repeated Measures ANOVA
- Mixed ANOVA
- Post-Hoc Tests and Planned Contrasts
- Non-Parametric Alternatives
- Advanced Topics
- Worked Examples
- Common Mistakes and How to Avoid Them
- Troubleshooting
- Quick Reference Cheat Sheet
1. Prerequisites and Background Concepts
Before diving into ANOVA, it is essential to be comfortable with the following foundational statistical concepts. Each is briefly reviewed below.
1.1 The Logic of Variance Decomposition
ANOVA is built upon one foundational insight: total variability in a dataset can be partitioned into components attributable to specific sources. For a one-way design:
If the between-group variance is substantially larger than the within-group variance, it suggests that group membership explains a meaningful portion of the variability in scores — that is, the groups differ. This ratio of variances is the F-statistic.
1.2 The F-Distribution
The F-distribution arises from the ratio of two independent chi-squared distributions divided by their respective degrees of freedom:
In the ANOVA context, this becomes the ratio of two mean squares:
Under (all group means equal), this ratio equals approximately 1. Values much greater than 1 provide evidence against .
The F-distribution is characterised by two parameters:
- = numerator degrees of freedom (between-groups)
- = denominator degrees of freedom (within-groups/error)
It is always non-negative and right-skewed.
1.3 Why Not Multiple t-Tests?
A natural question is: why not simply run multiple t-tests to compare groups? With groups, this would require pairwise tests.
The familywise error rate (FWER) inflates:
Where is the number of tests. For groups: pairwise tests; — far above the nominal .
ANOVA maintains the Type I error at for the omnibus test that all group means are simultaneously equal.
1.4 The Relationship Between ANOVA and Regression
ANOVA and regression are mathematically equivalent. Both are special cases of the General Linear Model (GLM):
In ANOVA, the predictors are categorical group membership variables (dummy or effect coded). Understanding this equivalence is essential for interpreting interactions and for moving to ANCOVA and mixed models.
1.5 Main Effects and Interactions
In factorial designs (multiple independent variables):
- A main effect is the effect of one independent variable (IV) averaged across all levels of the other IV(s).
- An interaction effect occurs when the effect of one IV depends on the level of another IV.
⚠️ When a significant interaction is present, main effects must be interpreted with extreme caution — the "average" effect of a variable may be misleading when the effect differs substantially across levels of the other variable.
1.6 Fixed vs. Random Effects
- Fixed effects: The levels of the IV are specifically chosen and are the only levels of interest (e.g., three specific drug dosages). Conclusions apply only to those levels.
- Random effects: The levels of the IV are a random sample from a larger population of possible levels (e.g., 10 randomly selected schools). Conclusions generalise to the population of levels.
- Mixed effects: Some factors are fixed, others are random. This is the basis of mixed models (also called hierarchical linear models).
Standard ANOVA assumes all factors are fixed. When factors are random, the denominator of the F-ratio changes.
1.7 Variance Explained: , , and
Effect sizes for ANOVA are variance-explained indices — they quantify what proportion of the total (or residual) variance is attributable to a given effect. These are reviewed in detail in the Mathematics section, but the key formulas are:
Both and correct for the positive bias of in finite samples and are preferred for reporting.
2. What is ANOVA?
2.1 The Core Idea
Analysis of Variance (ANOVA) is a parametric inferential procedure for testing whether three or more population means differ simultaneously. Despite its name, ANOVA tests mean differences by comparing variances — specifically, by assessing whether the variability between groups is larger than expected given the variability within groups.
The general form of the F-statistic:
A large F indicates that the between-group differences are large relative to random sampling error — evidence that at least one group mean differs from the others.
2.2 What ANOVA Tests and Does Not Test
ANOVA tells you:
- Whether at least one group mean differs from the others (omnibus test).
- How much variance in the outcome is explained by group membership (, ).
ANOVA does NOT tell you:
- Which specific groups differ from each other (requires post-hoc tests or planned contrasts).
- The direction or magnitude of specific pairwise differences.
- Whether the omnibus difference is practically meaningful (requires effect sizes with CIs).
2.3 The ANOVA Family
| Design | Independent Variables | Participants | ANOVA Type |
|---|---|---|---|
| One factor, different participants per group | 1 (between) | Different | One-way between-subjects |
| One factor, same participants in all groups | 1 (within) | Same | One-way repeated measures |
| Two+ factors, different participants per cell | 2+ (between) | Different | Factorial between-subjects |
| Two factors: one between, one within | 1 between, 1 within | Mixed | Mixed (split-plot) ANOVA |
| Two+ factors, same participants in all cells | 2+ (within) | Same | Fully within-subjects factorial |
2.4 ANOVA in Context
The ANOVA test is one member of a broader family of procedures for comparing group means:
| Situation | Test |
|---|---|
| 2 groups, independent, normal | t-test (Welch's recommended) |
| 3+ groups, independent, normal, equal variances | One-way ANOVA |
| 3+ groups, independent, normal, unequal variances | Welch's one-way ANOVA |
| 3+ groups, independent, non-normal or ordinal | Kruskal-Wallis test |
| 3+ conditions, same participants, normal | Repeated measures ANOVA |
| 3+ conditions, same participants, non-normal | Friedman test |
| 2+ factors (between), normal | Factorial ANOVA |
| 1 between + 1 within factor | Mixed ANOVA |
| Controlling for a covariate | ANCOVA |
| Multiple dependent variables simultaneously | MANOVA |
3. The Mathematics Behind ANOVA
3.1 One-Way ANOVA: Sum of Squares Decomposition
Consider groups with observations in group and total observations. Let be the mean of group and be the grand mean.
Grand mean:
Total sum of squares (): Total variability in the data.
Between-groups sum of squares (): Variability due to group differences.
Within-groups sum of squares (): Variability within groups (error).
Verification:
3.2 Mean Squares and the F-Statistic
Mean squares are sums of squares divided by their degrees of freedom — they are variance estimates:
The F-statistic:
Under , the F-statistic follows an F-distribution with degrees of freedom. The p-value is:
3.3 The Expected Mean Squares
Understanding why the F-ratio works requires examining the expected values of the mean squares under and :
(always an unbiased estimate of )
Under (all equal): , so .
Under (some differ): , so .
The non-centrality parameter of the F-distribution:
This links the population effect size to the expected F-statistic and is used for power analysis.
3.4 The ANOVA Source Table
The standard ANOVA output is presented in a source table:
| Source | SS | df | MS | ||
|---|---|---|---|---|---|
| Between groups | |||||
| Within groups (Error) | |||||
| Total |
3.5 Effect Sizes for One-Way ANOVA
Eta squared () — biased, but widely reported:
Omega squared () — bias-corrected, preferred:
Epsilon squared () — alternative bias correction:
Cohen's — for power analysis:
Relationship between effect sizes:
Cohen's (1988) benchmarks:
| Label | or | |
|---|---|---|
| Small | ||
| Medium | ||
| Large |
3.6 Confidence Intervals for ANOVA Effect Sizes
CIs for and use the non-central F-distribution. The observed F-statistic follows a non-central F-distribution with non-centrality parameter related to the population effect:
The 95% CI bounds for are found numerically (inverting the non-central F CDF), then converted to :
DataStatPro computes these exact CIs automatically using numerical iteration.
3.7 Factorial ANOVA: Partitioning Variance for Multiple Factors
For a two-factor (A B) between-subjects ANOVA with levels of A, levels of B, and observations per cell:
| Source | SS | df |
|---|---|---|
| A (Main effect) | ||
| B (Main effect) | ||
| AB (Interaction) | ||
| Within (Error) | ||
| Total |
Computing each SS:
Let = mean of level of factor A, = mean of level of factor B, = cell mean, = grand mean.
F-ratios (fixed effects model, all denominators are ):
3.8 Partial Eta Squared () for Factorial Designs
In factorial ANOVA, partial eta squared isolates the effect of one factor after controlling for other effects:
⚠️ In factorial designs, the sum of all partial values can exceed 1.0. They are NOT proportions of total variance — only carries that interpretation. Always label the statistic precisely: write , not , for partial values.
Partial omega squared (preferred — bias-corrected):
3.9 Repeated Measures ANOVA: Within-Subjects Decomposition
For a one-way repeated measures design with conditions and participants:
The key feature: between-subjects variability is removed from the error term, which dramatically increases power when individual differences are large.
| Source | SS | df |
|---|---|---|
| Between subjects | ||
| Conditions (Within) | ||
| Error (Residual) | ||
| Total |
Generalised eta squared (; Olejnik & Algina, 2003) is recommended for repeated measures designs because it is comparable across between-subjects and within- subjects designs:
3.10 Sphericity and the Mauchly Test
Repeated measures ANOVA requires the sphericity assumption: the variances of the differences between all pairs of conditions are equal. Formally, for all pairs :
Mauchly's test evaluates this assumption:
- : the sphericity assumption holds.
- A significant result () indicates sphericity violation.
Epsilon () corrections adjust the degrees of freedom when sphericity is violated. Two commonly used corrections:
Greenhouse-Geisser (GG) correction:
; means sphericity holds exactly.
Huynh-Feldt (HF) correction (less conservative than GG, preferred when ):
Corrected degrees of freedom:
Decision rule for epsilon corrections:
| Recommended Correction | |
|---|---|
| (Mauchly ) | None (uncorrected) |
| Huynh-Feldt | |
| Greenhouse-Geisser | |
| Severe violation | Multivariate approach (MANOVA) |
3.11 Mixed ANOVA: Between + Within Factors
A mixed ANOVA (also split-plot ANOVA) includes both between-subjects and within- subjects factors. For a design with one between factor (A, levels) and one within factor (B, levels) and participants per group:
| Source | SS | df | MS | F |
|---|---|---|---|---|
| A (Between) | ||||
| S(A) — Subjects within A | — | |||
| B (Within) | ||||
| AB | ||||
| BS(A) — Error | — | |||
| Total |
Note the two separate error terms:
- Between-subjects effects (A) are tested against (between-subjects error).
- Within-subjects effects (B, AB) are tested against (within- subjects error).
4. Assumptions of ANOVA
4.1 Normality of Residuals
ANOVA assumes that the residuals (differences between observed values and group means) are normally distributed within each population:
How to check:
- Shapiro-Wilk test on residuals (most powerful for per group).
- Q-Q plots of residuals: points should follow the diagonal.
- Histograms of residuals: should be approximately bell-shaped.
- Skewness () and kurtosis () of residuals.
Robustness: ANOVA is robust to mild normality violations, particularly when:
- Group sizes are equal (balanced design).
- – per group (CLT applies).
- Distributions are symmetric even if non-normal.
When violated: Use the Kruskal-Wallis test (independent groups) or the Friedman test (repeated measures) as non-parametric alternatives. Consider data transformations (log, square root, Box-Cox) for skewed distributions.
4.2 Homogeneity of Variance (Homoscedasticity)
Standard ANOVA assumes that all populations have equal variances:
This is the homoscedasticity assumption and is required for to serve as a valid pooled estimate of the common population variance .
How to check:
- Levene's test (preferred — robust to non-normality): : all group variances equal.
- Brown-Forsythe test (more robust, uses median rather than mean).
- Bartlett's test (powerful but sensitive to non-normality — avoid for non-normal data).
- Variance ratio rule: if , heterogeneity is concerning.
Robustness: ANOVA is relatively robust to heterogeneity when group sizes are equal. When group sizes are unequal AND variances are unequal, ANOVA can have severely inflated or deflated Type I error rates.
When violated: Use Welch's one-way ANOVA (with Games-Howell post-hoc tests), which does not assume equal variances and is recommended as the default for independent designs.
4.3 Independence of Observations
All observations must be independent of each other, both within and across groups. Dependence typically arises from:
- Clustered or nested data (students in classrooms, patients in hospitals).
- Repeated measurements on the same participant (use repeated measures ANOVA instead).
- Time series data.
- Family or matched data.
When violated: Use repeated measures ANOVA (for within-subjects data), mixed models (for nested or clustered data), or multilevel ANOVA (for hierarchical designs).
4.4 Sphericity (Repeated Measures Only)
As described in Section 3.10, repeated measures ANOVA additionally requires sphericity — that the variances of all pairwise difference scores are equal. This is a stronger assumption than homogeneity of variance.
When violated: Apply Greenhouse-Geisser or Huynh-Feldt corrections to the degrees of freedom, or use the multivariate approach (MANOVA on the repeated measures).
4.5 Interval Scale of Measurement
The dependent variable must be measured on at least an interval scale (equal-spaced intervals). Ordinal data (e.g., Likert scales) technically violate this assumption.
When violated: Use non-parametric alternatives (Kruskal-Wallis, Friedman) or analyse using ordinal regression.
4.6 Absence of Significant Outliers
ANOVA is based on means and is sensitive to extreme outliers, particularly in small samples. Outliers inflate and unpredictably.
How to check:
- Boxplots per group: values beyond are mild outliers; are extreme.
- Standardised residuals: flags potential outliers.
- Studentised residuals from the ANOVA model.
When outliers present: Investigate the cause. Report analyses with and without outliers. Consider trimmed mean ANOVA or Kruskal-Wallis as robust alternatives.
4.7 Assumption Summary Table
| Assumption | One-Way | Factorial | Repeated Measures | Mixed | How to Check | Remedy |
|---|---|---|---|---|---|---|
| Normality | ✅ | ✅ | ✅ (residuals) | ✅ | Shapiro-Wilk, Q-Q | Kruskal-Wallis / Friedman |
| Homogeneity of variance | ✅ | ✅ | — | ✅ (between) | Levene's | Welch's ANOVA |
| Independence | ✅ | ✅ | ✅ (between subjects) | ✅ | Design review | Mixed models |
| Sphericity | — | — | ✅ | ✅ (within part) | Mauchly's test | GG/HF correction |
| Interval scale | ✅ | ✅ | ✅ | ✅ | Measurement theory | Non-parametric |
| No severe outliers | ✅ | ✅ | ✅ | ✅ | Boxplots, residuals | Trimmed means / robust |
5. Types of ANOVA
5.1 Classification by Design
By Number of Independent Variables
| IVs | Design Name | Example |
|---|---|---|
| 1 | One-way ANOVA | Effect of teaching method (3 levels) on test scores |
| 2 | Two-way (factorial) ANOVA | Effect of drug (3 levels) and sex (2 levels) on pain |
| 3 | Three-way ANOVA | Drug dose time on response |
| -way factorial ANOVA | Generalisation of the above |
By Type of Factor
| Factor Type | Description | Design |
|---|---|---|
| Between-subjects | Different participants per level | Standard ANOVA |
| Within-subjects | Same participants in all levels | Repeated measures ANOVA |
| Mixed | Combination of between and within | Mixed (split-plot) ANOVA |
5.2 Choosing the Correct ANOVA Design
What is the number of independent variables?
├── 1 IV
│ └── Is the same participant in all conditions?
│ ├── NO (between-subjects) → One-way between-subjects ANOVA
│ └── YES (within-subjects) → One-way repeated measures ANOVA
└── 2+ IVs
└── What type are the IVs?
├── All between-subjects → Factorial between-subjects ANOVA
├── All within-subjects → Fully within-subjects factorial ANOVA
└── Mixed (some between, some within) → Mixed ANOVA
5.3 Type I, II, and III Sums of Squares
In unbalanced designs (unequal cell sizes), the partition of SS depends on the order in which effects are entered. Three conventions exist:
| Type | Description | When to Use |
|---|---|---|
| Type I (Sequential) | SS for each effect controlling for effects entered earlier | When the order of entry is theoretically meaningful |
| Type II (Hierarchical) | SS for each effect controlling for all other effects at the same level | When there is no significant interaction |
| Type III (Marginal) | SS for each effect controlling for all other effects including interactions | When there is a significant interaction; most common default in SPSS |
⚠️ For balanced designs (equal cell sizes), all three types give identical results. For unbalanced designs, Type III is the most commonly reported but requires full-rank parameterisation (effect coding or deviation coding, not dummy coding). Always specify which type was used when reporting factorial ANOVA results.
6. Using the ANOVA Calculator Component
The ANOVA Calculator component in DataStatPro provides a comprehensive tool for running, diagnosing, visualising, and reporting ANOVA designs and their alternatives.
Step-by-Step Guide
Step 1 — Select the ANOVA Design
Choose from the "ANOVA Type" dropdown:
- One-Way Between-Subjects ANOVA: One IV, independent groups.
- Factorial Between-Subjects ANOVA: Two or more IVs, independent groups.
- One-Way Repeated Measures ANOVA: One IV, same participants in all conditions.
- Mixed ANOVA: One or more between-subjects IVs and one or more within-subjects IVs.
- Welch's One-Way ANOVA: Robust to heterogeneity of variance.
- Kruskal-Wallis Test: Non-parametric one-way.
- Friedman Test: Non-parametric repeated measures.
Step 2 — Input Method
- Raw data: Upload or paste the dataset. DataStatPro performs all assumption checks automatically, computes effect sizes, and generates visualisations.
- Summary statistics: Enter group means, SDs, and values. Full assumption checks are not available but all inferential statistics and effect sizes are computed.
- ANOVA table values: Enter SS, df, and MS values from a published table to compute effect sizes, power, and CIs.
Step 3 — Specify the Design Structure
- Number of groups/levels for each factor.
- Factor names and level labels for clear output labelling.
- Cell sizes (equal or unequal — DataStatPro auto-detects balance).
- SS Type for factorial designs (Type I, II, or III — default: Type III).
Step 4 — Select Assumption Tests
DataStatPro automatically runs, with results displayed in a colour-coded panel:
- ✅ Shapiro-Wilk normality test on residuals (per group for small samples).
- ✅ Levene's test for homogeneity of variance (between-subjects designs).
- ✅ Mauchly's test for sphericity (repeated measures designs) with and automatically applied when sphericity is violated.
- ✅ Boxplots per group for outlier detection.
Step 5 — Select Post-Hoc Tests
When the omnibus F is significant, specify post-hoc tests:
- Tukey HSD — balanced designs, equal variances (controls FWER).
- Bonferroni — conservative; any design.
- Holm-Bonferroni — less conservative sequential procedure.
- Scheffé — most conservative; allows all possible contrasts.
- Games-Howell — unequal variances or unequal (recommended with Welch's ANOVA).
- Dunnett's test — comparing all groups to a single control group.
- Custom planned contrasts — specify weights for specific a priori comparisons.
Step 6 — Select Effect Sizes
- (preferred): Bias-corrected estimate for one-way ANOVA.
- (preferred for factorial): Partial omega squared.
- (common): Biased; provided for comparison.
- (common for factorial): Partial eta squared.
- (recommended for repeated measures): Generalised eta squared.
- Cohen's : For power analysis.
- 95% CIs for all effect sizes via non-central F-distribution.
Step 7 — Select Display Options
- ✅ Full ANOVA source table with F-statistics and p-values.
- ✅ Descriptive statistics (mean, SD, SE, 95% CI) per group/cell.
- ✅ Effect size estimates with 95% CIs for each effect.
- ✅ Assumption test results panel.
- ✅ Post-hoc pairwise comparison table with adjusted p-values and effect sizes.
- ✅ Interaction plot (line plot of cell means) for factorial designs.
- ✅ Profile plots for repeated measures.
- ✅ Raincloud plots (half violin + boxplot + raw data) per group.
- ✅ Power analysis and required for each effect.
- ✅ APA 7th edition results paragraph (auto-generated).
Step 8 — Run the Analysis
Click "Run ANOVA". DataStatPro will:
- Compute the full ANOVA source table.
- Apply sphericity corrections automatically if Mauchly's test is significant.
- Run all selected post-hoc tests and planned contrasts.
- Compute effect sizes with exact CIs.
- Generate all visualisations.
- Output an APA-compliant results paragraph.
7. One-Way Between-Subjects ANOVA
7.1 Purpose and Design
The one-way between-subjects ANOVA tests whether the means of three or more independent groups differ significantly. It is the generalisation of the independent samples t-test (when , ) to groups.
Common applications:
- Comparing exam scores across three teaching methods (Lecture, Flipped, Project-Based).
- Evaluating the effect of four drug dosage levels on pain rating.
- Assessing anxiety differences across five diagnostic categories.
- Comparing productivity across three management styles.
7.2 Full Procedure
Step 1 — State hypotheses
Step 2 — Compute grand mean and group means
Step 3 — Compute sums of squares
Step 4 — Compute degrees of freedom
Step 5 — Compute mean squares and F
Step 6 — Compute p-value and make a decision
Reject if .
Step 7 — Compute effect sizes
Step 8 — Conduct post-hoc tests or planned contrasts
If is rejected, identify which groups differ (Section 11).
7.3 Computing and from F
When only the F-statistic, df, and are reported:
(approximate)
7.4 Interpreting the Omnibus F-Test
The omnibus F-test is a global test. A significant result tells you only that:
- At least one pair of group means differs significantly.
- This difference is unlikely to have arisen by sampling error alone.
It does not tell you which groups differ, by how much, or in what direction. Post-hoc tests (Section 11) are required to answer these questions.
💡 When groups were theoretically predicted to differ in specific ways before data collection, use planned contrasts rather than (or in addition to) the omnibus F-test. Planned contrasts are more powerful and more informative than post-hoc tests.
8. Factorial Between-Subjects ANOVA
8.1 Purpose and Design
Factorial ANOVA simultaneously examines the effects of two or more IVs and their interactions on a continuous DV. It is more efficient than running separate one-way ANOVAs because it:
- Tests all main effects and interactions simultaneously.
- Controls FWER across all tests.
- Reveals interaction effects that separate one-way analyses cannot detect.
- Requires fewer participants than running separate experiments for each factor.
8.2 The Concept of Interaction
An interaction exists when the effect of one IV differs depending on the level of another IV. Interactions are the most important and often the most theoretically interesting finding in factorial designs.
Types of interactions:
| Type | Description | Pattern |
|---|---|---|
| Ordinal | Lines in interaction plot do not cross; one group always higher | Parallelism violated but ranking preserved |
| Disordinal (crossover) | Lines cross; one group higher at some levels, lower at others | Ranking reverses |
| Spreading | Effect of A increases (or decreases) with level of B | Lines fan out |
Interpreting an interaction:
When a significant AB interaction is found:
- Do not interpret main effects in isolation — they are averages that may be misleading when the interaction is substantial.
- Probe the interaction with simple effects analysis: test the effect of A separately at each level of B (or vice versa).
- Plot the interaction with a line plot: this is essential for understanding the pattern.
8.3 Simple Effects Analysis
Simple effects decompose the interaction by examining the effect of one IV at each level of the other IV. For a 23 design:
- Simple effect of A at : compare vs.
- Simple effect of A at : compare vs.
- Simple effect of A at : compare vs.
Simple effects use from the full factorial model as the error term (pooled error), which is more stable than separate-group estimates.
8.4 Effect Sizes in Factorial ANOVA
For factorial designs, report partial effect sizes for each effect:
Partial eta squared (common, biased):
Partial omega squared (preferred, bias-corrected):
Generalised eta squared (recommended for between-subjects factorial designs, Olejnik & Algina, 2003):
For purely between-subjects designs, for each effect.
9. One-Way Repeated Measures ANOVA
9.1 Purpose and Design
One-way repeated measures ANOVA tests whether means differ across conditions when the same participants are measured in all conditions. It is the generalisation of the paired t-test to conditions.
Common applications:
- Measuring depression at three time points (pre, post, follow-up).
- Comparing cognitive performance across four task difficulty levels.
- Evaluating preference ratings for five product variants.
- Assessing physiological response across six stimulus intensities.
Advantages over one-way between-subjects ANOVA:
- Greater statistical power: between-subjects variability is removed from the error term.
- Fewer participants needed: each participant contributes observations.
- Controls individual difference confounds: the same person is compared across conditions.
Disadvantages:
- Carryover effects: experiencing one condition may affect performance in others (counterbalance with randomised condition order or include adequate wash-out periods).
- Sphericity assumption: more complex than the equal-variance assumption.
- Attrition: losing participants eliminates all their data from all conditions.
9.2 Full Procedure
Step 1 — Compute condition means and participant means
= mean of condition ; = mean of participant ; = grand mean.
Step 2 — Compute sums of squares
Step 3 — Degrees of freedom
Step 4 — Mean squares and F
Step 5 — Sphericity check and correction
Run Mauchly's test. If violated, apply GG or HF correction:
The F-statistic is unchanged; only the reference distribution (via corrected df) changes.
Step 6 — Effect size
Generalised eta squared () — recommended for repeated measures:
Partial eta squared () — common but inflated:
Partial omega squared () — bias-corrected:
💡 Use for repeated measures when comparing effect sizes across studies using different designs (between-subjects vs. within-subjects), as it is the most design-comparable measure. For purely within-design comparisons, is the least-biased choice.
10. Mixed ANOVA
10.1 Purpose and Design
Mixed ANOVA combines at least one between-subjects factor and at least one within- subjects factor. It is among the most commonly used designs in psychology, medicine, and education because most longitudinal experiments involve:
- A between-subjects factor: Treatment group vs. control group.
- A within-subjects factor: Time (pre, post, follow-up).
The mixed ANOVA tests:
- Main effect of the between-subjects factor (e.g., treatment vs. control, collapsed across time).
- Main effect of the within-subjects factor (e.g., change over time, collapsed across groups).
- Interaction (e.g., does the pattern of change over time differ between treatment and control?). This interaction is typically the primary research question.
10.2 The Primary Interaction: Time Group
In a treatment evaluation study, the Time Group interaction answers: "Does the treatment group change differently over time compared to the control group?" This is typically the most important test in a mixed ANOVA:
- If the interaction is significant: the time trajectories differ between groups — strong evidence of a treatment effect beyond any change in the control group.
- If the interaction is non-significant: the pattern of change does not differ between groups — the treatment does not appear to differentially affect change over time.
10.3 Probing a Significant Interaction
When the Group Time interaction is significant:
Option 1 — Simple effects of Time within each Group: Conduct one-way repeated measures ANOVA (or paired t-tests with correction) separately for each group. This answers: "Did each group change significantly over time?"
Option 2 — Simple effects of Group at each Time point: Conduct independent t-tests (or one-way ANOVA) separately at each time point with Bonferroni correction. This answers: "At which time points do the groups differ?"
10.4 Sphericity in Mixed ANOVA
The sphericity assumption applies to the within-subjects factor and any interaction involving the within-subjects factor. Mauchly's test and GG/HF corrections apply specifically to:
- The main effect of the within-subjects factor (B).
- The interaction AB.
The between-subjects main effect (A) does not require sphericity but does require homogeneity of variance across groups (Levene's test).
10.5 Effect Sizes for Mixed ANOVA
For mixed ANOVA, generalised eta squared () is strongly recommended for all effects because it accounts for the different variance structures of between-subjects and within-subjects components:
This allows direct comparison of effect sizes from mixed designs with purely between- subjects or purely within-subjects designs.
11. Post-Hoc Tests and Planned Contrasts
11.1 The Need for Post-Hoc Testing
A significant omnibus F-test tells you only that some group means differ. Post-hoc tests are pairwise comparisons conducted after a significant F-test to determine which specific groups differ, while controlling the familywise error rate.
The key trade-off: Controlling the FWER requires more conservative critical values, which reduces power for individual comparisons. Choosing a post-hoc test involves balancing Type I and Type II error control.
11.2 Overview of Post-Hoc Tests
| Test | FWER Control | Assumes Equal Variances | Best For |
|---|---|---|---|
| Tukey HSD | ✅ Exact for balanced | ✅ Yes | Balanced designs, all pairwise |
| Tukey-Kramer | ✅ Approximate | ✅ Yes | Unbalanced designs, all pairwise |
| Bonferroni | ✅ Conservative | ❌ No | Any design, any set of comparisons |
| Holm-Bonferroni | ✅ Less conservative | ❌ No | Any design; preferred over Bonferroni |
| Scheffé | ✅ Most conservative | ✅ Yes | All possible contrasts (not just pairwise) |
| Games-Howell | ✅ Approximate | ❌ No | Unequal variances or unequal |
| Dunnett | ✅ Optimal | ✅ Yes | All groups vs. one control group |
| Fisher LSD | ❌ No control | ✅ Yes | Exploratory only; requires significant |
11.3 Tukey's HSD — Full Procedure
Tukey's Honestly Significant Difference (HSD) is the most commonly used post-hoc test for balanced designs with equal group variances. It controls the FWER at exactly for all pairwise comparisons.
Critical value: The studentised range statistic , where is the number of groups.
Minimum significant difference (MSD):
For unequal group sizes (Tukey-Kramer method):
Declare groups and significantly different if .
95% CI for the pairwise difference :
11.4 Games-Howell Test — Unequal Variances
Games-Howell is the recommended post-hoc test when variances are unequal (Levene's significant) or group sizes differ substantially. It uses Welch-Satterthwaite df for each pairwise comparison:
Standard error for pair :
Test statistic:
Degrees of freedom (Welch-Satterthwaite):
Significance assessed against the studentised range distribution .
11.5 Planned Contrasts — A Priori Comparisons
Planned contrasts (a priori comparisons) are specific, theoretically motivated comparisons formulated before data collection. They are more powerful than post-hoc tests because:
- They do not require a significant omnibus F-test.
- They allow targeted tests of specific hypotheses.
- Fewer comparisons means a less severe FWER correction (or none, for orthogonal contrasts).
Contrast coefficients: A contrast is a weighted sum of group means , where .
Examples for groups (Control, Drug A, Drug B, Drug C):
| Contrast | Comparison | ||||
|---|---|---|---|---|---|
| Control vs. all treatments | Control vs. treatments | ||||
| Drug A vs. B and C | Drug A vs. B and C | ||||
| Drug B vs. C | Drug B vs. C |
Orthogonal contrasts are statistically independent (). A set of orthogonal contrasts fully partitions and does not require FWER correction.
Contrast F-statistic:
11.6 Effect Sizes for Pairwise Comparisons
After identifying which pairs of groups differ, report effect sizes for each significant pairwise comparison:
Cohen's for the pairwise comparison :
Where can be either the two-group pooled SD or from the full ANOVA model (recommended — more stable estimate).
Using as the standardiser:
Hedges' (bias-corrected):
12. Non-Parametric Alternatives
12.1 When Non-Parametric Tests Are Appropriate
Non-parametric ANOVA alternatives are appropriate when:
- Data are ordinal (ranked or Likert-type treated as ordinal).
- Data are severely non-normally distributed and sample sizes are small.
- There are extreme outliers that distort mean-based statistics.
- The homogeneity of variance assumption is severely violated.
12.2 Kruskal-Wallis Test — Non-Parametric One-Way ANOVA
The Kruskal-Wallis H test is the non-parametric alternative to one-way between- subjects ANOVA. It tests whether the population distributions of independent groups are identical (or equivalently, under the location-shift assumption, whether the groups have equal medians).
Procedure:
Step 1 — Rank all observations
Combine all observations across groups and assign ranks from 1 to . Assign average ranks for ties.
Step 2 — Compute rank sums per group
(sum of ranks for group )
Step 3 — Compute the H statistic
Tie correction:
Where is the number of observations in the -th tied group.
Step 4 — p-value
For and per group: use exact tables. For larger samples: approximately.
Step 5 — Effect size:
Or, equivalently:
Cohen's benchmarks for (same as ANOVA ): small = .01, medium = .06, large = .14.
Step 6 — Post-hoc tests for Kruskal-Wallis
When is significant, pairwise comparisons use the Dunn test with Bonferroni or Holm correction:
Where and are the mean ranks for groups and .
Effect size for each pairwise comparison (rank-biserial correlation):
12.3 Friedman Test — Non-Parametric Repeated Measures ANOVA
The Friedman test is the non-parametric alternative to one-way repeated measures ANOVA. It tests whether related conditions (measured on the same participants) have equal population distributions.
Procedure:
Step 1 — Rank within each participant
For each participant , rank their scores from 1 (lowest) to (highest). Assign average ranks for ties within a participant.
Step 2 — Compute column rank sums
(sum of ranks in condition across all participants)
Step 3 — Compute Friedman's statistic
Kendall's concordance correction (more accurate for small samples):
, compared to
Step 4 — p-value
(large-sample approximation)
Step 5 — Effect size: Kendall's
ranges from 0 (no agreement across participants) to 1 (perfect agreement):
| Interpretation | |
|---|---|
| Very weak concordance | |
| Weak concordance | |
| Moderate concordance | |
| Strong concordance |
Or report :
Step 6 — Post-hoc tests for Friedman
Pairwise comparisons using Wilcoxon signed-rank tests with Bonferroni or Holm correction, or the Conover test (more powerful):
Effect size for each pairwise comparison (matched-pairs rank-biserial correlation):
12.4 Welch's One-Way ANOVA — Robust to Heteroscedasticity
Welch's F-test (Welch, 1951) is a parametric ANOVA variant that does not assume homogeneity of variance. It is the recommended default for one-way between-subjects ANOVA when group variances may differ.
Weighted group means:
Welch's F-statistic:
Degrees of freedom (approximate):
Post-hoc: Use Games-Howell pairwise tests when Welch's ANOVA is significant.
💡 Just as Welch's t-test is the recommended default over Student's t-test for two groups, Welch's one-way ANOVA is increasingly recommended as the default over classical ANOVA for three or more independent groups. The loss of power when variances are truly equal is negligible.
13. Advanced Topics
13.1 ANCOVA — Analysis of Covariance
ANCOVA extends ANOVA by including one or more continuous covariates in the model. It serves two purposes:
- Reduce error variance by partialling out variability explained by the covariate, thereby increasing power to detect group differences.
- Adjust group means for pre-existing differences in the covariate (important for quasi-experimental designs).
The ANCOVA model:
Where is the group effect, is the regression coefficient for covariate , and .
Additional assumptions for ANCOVA:
- Homogeneity of regression slopes: The relationship between the covariate and DV is the same (parallel) across groups. Test with the GroupCovariate interaction term; if significant, standard ANCOVA is inappropriate.
- Independence of covariate and treatment: In experimental designs, the covariate (e.g., pre-test) should not be affected by the treatment itself.
- Linear relationship between covariate and DV.
Adjusted means (estimated marginal means):
These are the group means estimated at the grand mean of the covariate.
Effect size for ANCOVA:
13.2 Trend Analysis for Ordered Groups
When the levels of the IV represent an ordered quantitative variable (e.g., drug dose: 0, 10, 20, 40 mg), polynomial trend analysis (orthogonal polynomials) is more informative than post-hoc pairwise tests.
Linear trend: Tests whether the means increase or decrease monotonically.
Quadratic trend: Tests whether the means follow a U-shape (accelerating or decelerating pattern).
Standard orthogonal polynomial coefficients for groups:
| Linear () | Quadratic () | Cubic () | |
|---|---|---|---|
| 3 | — | ||
| 4 | |||
| 5 |
13.3 Power Analysis for ANOVA
A priori power analysis determines the required sample size before data collection. The primary input is Cohen's (not — that is for regression):
For one-way ANOVA (equal group sizes), the non-centrality parameter:
Required per group for power at two-sided level :
Iteratively solve: Power
No closed form exists — DataStatPro uses numerical methods for exact power calculations.
Approximate per group using the approximation:
Required per group for common scenarios (80% power, , one-way):
| Cohen's | Label | ||||
|---|---|---|---|---|---|
| 0.10 | Small | 322 | 274 | 240 | 215 |
| 0.25 | Medium | 52 | 45 | 39 | 35 |
| 0.40 | Large | 21 | 18 | 16 | 14 |
| 0.50 | Large | 14 | 12 | 11 | 10 |
For repeated measures ANOVA, power also depends on the within-subjects correlation (average correlation among repeated measures):
Higher → more power benefit from the repeated measures design.
13.4 Dealing with Violations: Transformation Strategies
When the normality or homoscedasticity assumptions are violated, data transformations can sometimes restore assumption validity before applying ANOVA:
| Distribution Shape | Suggested Transformation | Formula |
|---|---|---|
| Right skew (positive) | Log | or |
| Moderate right skew | Square root | |
| Severe right skew | Reciprocal | |
| Proportion data | Arcsine | |
| Count data | Square root |
⚠️ Transformed means cannot be back-transformed directly to obtain the mean of the original variable. Back-transforming estimates the median (for log), not the mean. Always report descriptive statistics in the original scale alongside transformed results.
13.5 Robust ANOVA: Trimmed Means
Trimmed mean ANOVA (Wilcox, 2017) replaces standard means with -trimmed means, dramatically reducing sensitivity to outliers and non-normality while maintaining reasonable power.
Yuen's trimmed mean F-test for one-way ANOVA uses:
= 20%-trimmed mean for group
Where is the Winsorised variance for group and .
The test statistic is compared to an F-distribution with adjusted degrees of freedom.
13.6 Bayesian ANOVA
Bayesian ANOVA (Rouder et al., 2012; implemented via BayesFactor R package) quantifies evidence for and against each effect using Bayes Factors. For each effect:
The prior on effect sizes under is typically a Cauchy distribution:
(default "medium" prior)
Bayesian ANOVA advantages:
- Quantifies evidence for null effects (not just failure to reject ).
- Allows continuous evidence monitoring without inflating Type I error.
- Produces posterior distributions for effect sizes.
- Avoids the all-or-nothing dichotomy of significance testing.
13.7 Reporting ANOVA According to APA 7th Edition
APA Publication Manual (7th ed.) requirements for ANOVA:
- Report the F-statistic, both degrees of freedom, and exact p-value: [value], [value]
- Report effect size with 95% CI: [value] [95% CI: LB, UB] or [value] [95% CI: LB, UB]
- Specify which effect size was used (state vs. explicitly).
- Report the sphericity correction used (GG or HF) and value.
- Report group means and standard deviations.
- Report post-hoc test results with adjusted p-values and effect sizes per comparison.
- Specify whether equal variances were assumed and which SS Type was used (factorial).
14. Worked Examples
Example 1: One-Way Between-Subjects ANOVA — Effect of Therapy Type on Depression
A clinical researcher assigns participants per group to one of three therapy conditions (CBT, Behavioural Activation, Waitlist Control). Post-treatment depression scores (PHQ-9; lower = less depression) are measured.
Group summary statistics:
| Group | Mean PHQ-9 | SD | |
|---|---|---|---|
| CBT | 30 | ||
| Behavioural Activation (BA) | 30 | ||
| Waitlist Control (WL) | 30 |
,
Grand mean:
Step 1 — Between-groups SS:
Step 2 — Within-groups SS:
summed across groups:
Step 3 — Total SS:
Step 4 — ANOVA source table:
| Source | SS | df | MS | ||
|---|---|---|---|---|---|
| Between | |||||
| Within | |||||
| Total |
Step 5 — Levene's test: , — homogeneity of variance holds; standard ANOVA is appropriate.
Step 6 — Effect sizes:
95% CI for (via non-central F, , , ):
Non-centrality
95% CI for : (numerical)
Step 7 — Post-hoc tests (Tukey HSD):
; Tukey critical value
| Comparison | Difference | HSD | Significant? | Cohen's |
|---|---|---|---|---|
| CBT vs. BA | No, | |||
| CBT vs. WL | Yes, | |||
| BA vs. WL | Yes, |
Cohen's for each pair using :
; ;
Summary:
| Statistic | Value |
|---|---|
| (Large) | |
| [95% CI: 0.150, 0.376] (Large) | |
| Cohen's | |
| CBT vs. Control | (Large) |
| BA vs. Control | (Large) |
| CBT vs. BA | (Small; ns) |
APA write-up: "A one-way between-subjects ANOVA revealed a significant effect of therapy type on post-treatment depression, , , [95% CI: 0.150, 0.376], indicating a large effect. Tukey HSD post-hoc tests showed that both CBT (, ) and Behavioural Activation (, ) produced significantly lower depression scores than the Waitlist Control (, ), and respectively (both ). CBT and BA did not differ significantly from each other, , ."
Example 2: Two-Way Factorial ANOVA — Drug Exercise on Anxiety
A researcher uses a between-subjects design: Drug (Drug A vs. Placebo) Exercise (None, Moderate, High). per cell; DV = anxiety score (lower = less anxious). Total .
Cell means:
| No Exercise | Moderate | High | Row Mean | |
|---|---|---|---|---|
| Drug A | ||||
| Placebo | ||||
| Col Mean |
Grand mean:
Step 1 — Compute SS (all cells balanced, ):
Cell means for interaction SS:
Deviations from additive model:
| Cell | Deviation | ||||
|---|---|---|---|---|---|
| Drug A, None | 24.1 | 19.033 | 25.750 | 21.733 | |
| Drug A, Mod | 18.3 | 19.033 | 21.050 | 21.733 | |
| Drug A, High | 14.7 | 19.033 | 18.400 | 21.733 | |
| Placebo, None | 27.4 | 24.433 | 25.750 | 21.733 | |
| Placebo, Mod | 23.8 | 24.433 | 21.050 | 21.733 | |
| Placebo, High | 22.1 | 24.433 | 18.400 | 21.733 |
Pooled within-cells : Assume (given), so:
Step 2 — ANOVA source table:
| Source | SS | df | MS | ||
|---|---|---|---|---|---|
| Drug (D) | |||||
| Exercise (E) | |||||
| DE | |||||
| Within (Error) | |||||
| Total |
Step 3 — Partial omega squared for each effect:
Step 4 — Interpretation:
The interaction is not significant () — the effect of Drug is consistent across all exercise levels. Interpret main effects:
- Drug main effect: Drug A () produces significantly lower anxiety than Placebo (), — large effect.
- Exercise main effect: Higher exercise is associated with lower anxiety. Tukey HSD post-hoc tests would reveal which exercise levels differ.
APA write-up: "A between-subjects ANOVA examined the effects of Drug (Drug A vs. Placebo) and Exercise level (None, Moderate, High) on anxiety scores. The interaction was not significant, , , . There were significant main effects of Drug, , , [95% CI: 0.102, 0.340], and Exercise, , , [95% CI: 0.136, 0.395]. Both effects were large."
Example 3: One-Way Repeated Measures ANOVA — Memory Scores Across Four Time Points
A cognitive psychologist measures word recall at four time points (immediate recall, 5-minute delay, 30-minute delay, 24-hour delay) in participants.
Condition means and SDs:
| Time Point | Mean Recall | SD |
|---|---|---|
| Immediate | ||
| 5 minutes | ||
| 30 minutes | ||
| 24 hours |
ANOVA results (given):
| Source | SS | df | MS | ||
|---|---|---|---|---|---|
| Between subjects | |||||
| Time | |||||
| Error | |||||
| Total |
Mauchly's test: , , — sphericity holds; no correction needed.
Effect sizes:
Post-hoc tests (Bonferroni-corrected pairwise comparisons):
6 comparisons; adjusted
Using paired t-tests on each pair (or use RM ANOVA contrast framework):
| Comparison | Mean Diff | |||
|---|---|---|---|---|
| Imm vs. 5 min | ||||
| Imm vs. 30 min | ||||
| Imm vs. 24 hr | ||||
| 5 min vs. 30 min | ||||
| 5 min vs. 24 hr | ||||
| 30 min vs. 24 hr |
All pairwise comparisons are significant — recall declines significantly at every delay interval.
APA write-up: "A one-way repeated measures ANOVA examined word recall across four time points. Mauchly's test indicated that the sphericity assumption was met, , . There was a significant effect of time, , , [95% CI: 0.198, 0.421], , indicating a large effect. Bonferroni-corrected pairwise comparisons revealed that recall declined significantly at each subsequent time point (all ), with effect sizes ranging from (immediate vs. 5-min) to (immediate vs. 24-hr)."
Example 4: Kruskal-Wallis Test — Non-Parametric Comparison of Pain Ratings
A pain researcher compares pain ratings (0–10 VAS scale, ordinal) across three acupuncture protocols. Shapiro-Wilk tests indicate non-normality in all groups.
Data:
| Protocol A () | Protocol B () | Protocol C () |
|---|---|---|
Step 1 — Combined ranks:
Sorted values: 2(1), 3(2.5), 3(2.5), 4(5), 4(5), 4(5), 5(8), 5(8), 5(8), 5(8), 6(11.5), 6(11.5), 6(11.5), 7(15), 7(15), 7(15), 8(18.5), 8(18.5), 9(21)
Wait — let me recount: Protocol A: {3,5,2,6,4,5,3,4}, B: {7,8,6,9,7,8,7}, C: {5,6,4,7,5,6}
Combined sorted: 2,3,3,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,9
| Value | Count | Ranks | Avg Rank |
|---|---|---|---|
| 2 | 1 | 1 | 1.0 |
| 3 | 2 | 2–3 | 2.5 |
| 4 | 3 | 4–6 | 5.0 |
| 5 | 4 | 7–10 | 8.5 |
| 6 | 4 | 11–14 | 12.5 |
| 7 | 4 | 15–18 | 16.5 |
| 8 | 2 | 19–20 | 19.5 |
| 9 | 1 | 21 | 21.0 |
Rank assignments:
- Protocol A: 1.0, 8.5, 2.5, 12.5, 5.0, 8.5, 2.5, 5.0 →
- Protocol B: 16.5, 19.5, 12.5, 21.0, 16.5, 19.5, 16.5 →
- Protocol C: 8.5, 12.5, 5.0, 16.5, 8.5, 12.5 →
Check: ✅
Step 2 — H statistic:
Tie correction factor:
Step 3 — p-value:
Step 4 — Effect size:
This is a very large effect — protocol membership explains approximately 65% of the rank variability in pain ratings.
Dunn post-hoc tests (Holm-corrected):
(cap at )
Use: — from rank-based approach:
| Comparison | (Holm-adj) | ||
|---|---|---|---|
| A vs. B | |||
| A vs. C | |||
| B vs. C |
APA write-up: "Due to significant non-normality, a Kruskal-Wallis H test was conducted to compare pain ratings across three acupuncture protocols. The test revealed a significant difference, , , , indicating a very large effect of protocol. Holm-corrected Dunn post-hoc tests revealed that Protocol A (Mdn = 4.5) produced significantly lower pain ratings than Protocol B (Mdn = 7.0), , , . Differences between A and C and between B and C did not survive correction ( and respectively)."
15. Common Mistakes and How to Avoid Them
Mistake 1: Interpreting the Omnibus F Without Post-Hoc Tests
Problem: Reporting a significant and concluding that all groups differ from each other, or that a specific pair of groups differs, without conducting post-hoc tests. The omnibus F tells you only that at least one difference exists.
Solution: Always follow a significant omnibus F with appropriate post-hoc tests or planned contrasts. Specify which test was used and apply the correct FWER correction. Report all pairwise comparisons with adjusted p-values and individual effect sizes.
Mistake 2: Reporting as if It Were
Problem: Reporting (e.g., ) and labelling it simply as "effect size" or, worse, confusing it with the less-biased . is consistently biased upward and overestimates the population effect, sometimes substantially in small samples with few groups.
Solution: Always report (or for factorial designs) as the primary effect size, and label all effect sizes precisely. If is reported (e.g., for software compatibility), clearly note that it is biased and report alongside.
Mistake 3: Confusing and in Factorial Designs
Problem: In factorial ANOVA with two or more factors, values can sum to more than 1.0 across all effects. Reporting and describing it as "the proportion of total variance explained" is incorrect — it is the proportion of variance explained after removing the other effects.
Solution: Use for total-variance proportions and for partial proportions, always labelling them distinctly. Preferably, use or and state which was used.
Mistake 4: Ignoring Significant Interactions and Interpreting Main Effects Alone
Problem: When a significant AB interaction is present, reporting and interpreting main effects as if the interaction did not exist. The main effect of A is the average effect across all levels of B — if the interaction is disordinal (crossover), this average is actively misleading.
Solution: Test for interactions before interpreting main effects. When an interaction is significant, probe it with simple effects analysis and interaction plots. Describe the pattern of the interaction rather than (or in addition to) the main effects.
Mistake 5: Using One-Way ANOVA When Repeated Measures ANOVA is Needed
Problem: Treating pre-post data from the same participants as independent groups and running a between-subjects one-way ANOVA. This inflates the error term with between- person variability, severely reduces power, and violates the independence assumption.
Solution: Identify whether data come from different participants (between-subjects) or the same participants (within-subjects). Use repeated measures ANOVA when each participant contributes more than one score. If in doubt, check whether the data file has one row per participant.
Mistake 6: Not Checking or Correcting for Sphericity Violations
Problem: Running repeated measures ANOVA in SPSS or R and not checking Mauchly's test, or checking it but ignoring a significant result and reporting uncorrected values.
Solution: Always report Mauchly's test result. When , report the Greenhouse-Geisser (or Huynh-Feldt if ) corrected results. Report both and the corrected df alongside the F-statistic.
Mistake 7: Applying Standard ANOVA When Variances Are Unequal
Problem: Using classical ANOVA with unequal group sizes and markedly different group variances (). This produces inflated Type I error rates and untrustworthy p-values.
Solution: When Levene's test is significant (especially with unequal ), use Welch's one-way ANOVA with Games-Howell post-hoc tests. Report Levene's test result in the method section and justify the choice of test.
Mistake 8: Running Multiple Pairwise t-Tests After ANOVA Without Correction
Problem: After a significant F, running all pairwise t-tests without applying a multiple comparisons correction, effectively using per comparison and inflating FWER.
Solution: Use a proper post-hoc procedure (Tukey HSD, Games-Howell, Holm-Bonferroni) that controls the FWER. Fisher's LSD (uncorrected pairwise tests) is not appropriate as a standalone post-hoc procedure unless there are only groups.
Mistake 9: Interpreting Non-Significant Interactions as Absence of Interaction
Problem: Concluding that "there is no interaction" based solely on for the interaction term. A non-significant interaction test only indicates insufficient evidence for an interaction, not evidence of its absence. Underpowered studies routinely fail to detect real interactions.
Solution: Report the effect size for the interaction (, ) and its 95% CI alongside the p-value. If the CI is wide, acknowledge low precision. Consider equivalence testing for the interaction if absence of interaction is the primary claim.
Mistake 10: Failing to Report Descriptive Statistics and Visualisations for Factorial Designs
Problem: In factorial and repeated measures ANOVA, reporting only the omnibus F- statistics without cell means, standard deviations, and interaction plots. Statistical significance alone is uninterpretable without the pattern of means.
Solution: Always report means and standard deviations (or standard errors) for every cell. For factorial designs, always include an interaction plot. For repeated measures, include a profile plot. Raincloud plots (half violin + box + scatter) are increasingly recommended for transparent reporting of individual data.
16. Troubleshooting
| Problem | Likely Cause | Solution |
|---|---|---|
| ; no treatment effect or large within-group variability | Report as non-significant; consider power; inspect within-group variability | |
| or is negative | True effect near zero; | Report as 0 (convention); increase sample size; note small effect |
| values sum to | Expected in factorial ANOVA; is not a total-variance proportion | Switch to or for total-variance interpretation |
| Mauchly's test is significant | Sphericity violated (common in repeated measures) | Apply GG correction (if ) or HF (if ); report |
| Levene's test is significant | Heterogeneous variances across groups | Use Welch's ANOVA with Games-Howell post-hoc |
| Interaction is significant but interaction plot looks parallel | Scaling issue on plot axes; small but real interaction | Rescale y-axis to start at true minimum; report for interaction |
| Post-hoc tests reveal no significant pairs despite significant | Effect is driven by small differences across many pairs; no single large pair | Report omnibus and note no individual pair survives correction; reduce FWER burden with planned contrasts |
| Planned contrasts do not sum to zero | Contrast coding error | Re-specify: for all contrasts |
| ANOVA gives different result to multiple t-tests | ANOVA uses pooled error term; t-tests use only two groups | Trust ANOVA; the pooled error is more stable |
| Repeated measures ANOVA has very different vs. | Large between-subjects variance inflating denominator | Report both; is preferred for cross-design comparison |
| Very large with very small | Large ; even tiny mean differences are statistically significant | Report effect size — statistical significance does not imply practical significance |
| Cell size is 0 for some factorial cells | Empty cells in design | Empty cells break standard ANOVA; use regression approach or multilevel modelling |
| Significant ANCOVA result changes after adding covariategroup interaction | Homogeneity of regression slopes violated | Standard ANCOVA is inappropriate; use moderated regression instead |
| Kruskal-Wallis is significant but Dunn tests show no significant pairs | Conservative Bonferroni correction; effect spread across many pairs | Use Holm correction instead; report Dunn tests without correction if all planned |
| Friedman test statistic is 0 | Identical rankings across all participants | Verify data; check for data entry errors or insufficient variability |
17. Quick Reference Cheat Sheet
Core ANOVA Equations
| Formula | Description |
|---|---|
| Between-groups SS (one-way) | |
| Within-groups SS (one-way) | |
| Total SS decomposition | |
| Between-groups mean square | |
| Within-groups mean square (error) | |
| F-ratio (one-way ANOVA) | |
| One-way ANOVA p-value | |
| Factor A SS (factorial, balanced) | |
| Interaction SS | |
| Factorial F for main effect A | |
| Conditions SS (repeated measures) | |
| Subjects SS (repeated measures) | |
| Error SS (repeated measures) |
Effect Size Formulas
| Formula | Description |
|---|---|
| Eta squared (one-way; biased) | |
| Partial eta squared (factorial) | |
| Generalised eta squared (RM/mixed) | |
| Omega squared (one-way; preferred) | |
| Partial omega squared (factorial) | |
| Epsilon squared (one-way) | |
| Cohen's (from ) | |
| Cohen's (from ; preferred) | |
| from -statistic | |
| from -statistic (approx) | |
| Cohen's for post-hoc pairwise |
Non-Parametric Formulas
| Formula | Description |
|---|---|
| Kruskal-Wallis | |
| Effect size for Kruskal-Wallis | |
| Friedman | |
| Kendall's (Friedman effect size) | |
| Dunn's test statistic | |
| Rank-biserial (pairwise) | |
| Welch's one-way ANOVA |
Sphericity Corrections
| Formula | Description |
|---|---|
| Greenhouse-Geisser epsilon | |
| Huynh-Feldt epsilon | |
| Corrected degrees of freedom |
Post-Hoc Test Selection Guide
| Condition | Recommended Post-Hoc Test |
|---|---|
| Balanced design, equal variances | Tukey HSD |
| Unbalanced design, equal variances | Tukey-Kramer |
| Unequal variances or group sizes | Games-Howell |
| All groups vs. one control | Dunnett's test |
| All possible contrasts (not just pairwise) | Scheffé |
| Any design, conservative | Bonferroni |
| Any design, less conservative than Bonferroni | Holm-Bonferroni |
| Non-parametric (Kruskal-Wallis) | Dunn test with Holm correction |
| Non-parametric (Friedman) | Wilcoxon signed-rank + Holm, or Conover |
APA 7th Edition Reporting Templates
One-Way Between-Subjects ANOVA: "A one-way between-subjects ANOVA revealed [a significant / no significant] effect of [IV] on [DV], [value], [value], [value] [95% CI: LB, UB]. [Post-hoc results if significant.]"
Factorial Between-Subjects ANOVA: "A between-subjects ANOVA was conducted. The [IV IV] interaction [was / was not] significant, [value], [value], [value] [95% CI: LB, UB]. [Describe interaction pattern or, if not significant, main effects:] There was a significant main effect of [IV], [value], [value], [value] [95% CI: LB, UB], and [of IV], [value], [value], [value] [95% CI: LB, UB]."
One-Way Repeated Measures ANOVA: "A one-way repeated measures ANOVA was conducted. Mauchly's test [indicated / did not indicate] a violation of sphericity, [value], [value][; consequently, Greenhouse-Geisser / Huynh-Feldt corrected values are reported, [value]]. There was a significant effect of [condition], [value], [value], [value] [95% CI: LB, UB], [value]."
Mixed ANOVA: "A -level (between) -level (within) mixed ANOVA was conducted. Mauchly's test [was / was not] significant for the within-subjects factor, [value], [value][; GG correction applied, [value]]. The [between within] interaction [was / was not] significant, [value], [value], [value] [95% CI: LB, UB]. [Describe simple effects if significant.]"
Kruskal-Wallis: "A Kruskal-Wallis H test was conducted due to [non-normality / ordinal data]. The test revealed [a significant / no significant] difference across groups, [value], [value], [value]. [Dunn pairwise post-hoc results if significant.]"
Friedman Test: "A Friedman test was conducted. There was [a significant / no significant] difference across conditions, [value], [value], [value]."
Welch's One-Way ANOVA: "Due to significant heterogeneity of variance (Levene's [value], [value]), Welch's one-way ANOVA was applied. Results indicated [a significant / no significant] effect of [IV] on [DV], [value], [value], [value] [95% CI: LB, UB]. Games-Howell post-hoc tests were used."
Required Sample Size — One-Way ANOVA (80% Power, )
| Cohen's | Label | ||||
|---|---|---|---|---|---|
| 0.10 | Small | 322 | 274 | 240 | 215 |
| 0.15 | Small-Med | 144 | 123 | 107 | 96 |
| 0.25 | Medium | 52 | 45 | 39 | 35 |
| 0.35 | Med-Large | 27 | 23 | 21 | 19 |
| 0.40 | Large | 21 | 18 | 16 | 14 |
| 0.50 | Large | 14 | 12 | 11 | 10 |
All values are per group. Multiply by for total .
Cohen's Benchmarks — ANOVA Effect Sizes
| Label | / | (approx) | |
|---|---|---|---|
| Small | |||
| Medium | |||
| Large |
Note: Cohen's benchmarks for apply approximately to and . Always prioritise domain-specific benchmarks over these generic conventions.
Degrees of Freedom Reference
| Design | Source | df |
|---|---|---|
| One-way between | Between | |
| Within | ||
| Total | ||
| Factorial () | A | |
| B | ||
| AB | ||
| Within | ||
| One-way RM | Conditions | |
| Subjects | ||
| Error | ||
| Mixed ( between, within) | A (between) | |
| S(A) (between error) | ||
| B (within) | ||
| AB | ||
| BS(A) (within error) |
Assumption Checks Reference
| Assumption | Test | Action if Violated |
|---|---|---|
| Normality of residuals | Shapiro-Wilk, Q-Q plot | Kruskal-Wallis / Friedman; transform data |
| Homogeneity of variance | Levene's, Brown-Forsythe | Welch's ANOVA + Games-Howell |
| Sphericity (RM designs) | Mauchly's test (, ) | GG correction (), HF () |
| Homogeneity of regression slopes (ANCOVA) | GroupCovariate interaction test | Use moderated regression instead |
| Independence | Design review | Mixed models / multilevel ANOVA |
| Outliers | Boxplots, standardised residuals ($ | z |
| Interval scale | Measurement theory | Non-parametric alternatives |
ANOVA Reporting Checklist
| Item | Required |
|---|---|
| -statistic with both df | ✅ Always |
| Exact p-value (or ) | ✅ Always |
| or with 95% CI | ✅ Always (preferred over ) |
| Which effect size was reported ( vs. etc.) | ✅ Always |
| Group means and SDs for all groups/cells | ✅ Always |
| Sample sizes per group/cell | ✅ Always |
| Levene's test result (between-subjects) | ✅ For independent designs |
| Mauchly's test and (within-subjects) | ✅ For RM and mixed designs |
| Which sphericity correction applied (GG or HF) | ✅ When Mauchly's significant |
| for repeated measures / mixed | ✅ Recommended |
| Post-hoc test name and FWER correction method | ✅ When omnibus F significant |
| Post-hoc pairwise differences with adjusted and | ✅ When omnibus F significant |
| Interaction plot for factorial/mixed designs | ✅ When interaction significant |
| Simple effects for significant interactions | ✅ When interaction significant |
| SS Type for unbalanced factorial designs | ✅ For unbalanced factorial |
| Power analysis or sensitivity analysis | ✅ For null results |
| Whether Welch's ANOVA was used | ✅ If variances are unequal |
| Domain-specific benchmark context | ✅ Recommended |
Conversion Formulas
| From | To | Formula |
|---|---|---|
| , , | ||
| , , | (approx) | |
| (2 groups) | Cohen's | |
| Cohen's (2 groups) | ||
| (2 groups) | ||
| (Kendall's) | (avg pairwise) | |
This tutorial provides a comprehensive foundation for understanding, conducting, and reporting ANOVA and its alternatives within the DataStatPro application. For further reading, consult Field's "Discovering Statistics Using IBM SPSS Statistics" (5th ed., 2018) for applied coverage, Maxwell, Delaney & Kelley's "Designing Experiments and Analyzing Data" (3rd ed., 2018) for rigorous methodological depth, Wilcox's "Introduction to Robust Estimation and Hypothesis Testing" (4th ed., 2017) for robust alternatives, Olejnik & Algina's (2003) "Generalized Eta and Omega Squared Statistics" (Educational and Psychological Measurement) for effect size recommendations in repeated measures designs, and Lakens's "Calculating and Reporting Effect Sizes to Facilitate Cumulative Science" (Frontiers in Psychology, 2013) for practical effect size guidance. For Bayesian ANOVA, see Rouder et al. (2012) in the Journal of Mathematical Psychology. For feature requests or support, contact the DataStatPro team.