Repeated Measures ANOVA

Comprehensive reference guide for Repeated Measures ANOVA statistical technique.

Repeated Measures ANOVA: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of within-subjects experimental designs all the way through advanced interpretation, reporting, assumption checking, and practical usage within the DataStatPro application. Whether you are encountering repeated measures ANOVA for the first time or deepening your understanding of analysing data where the same participants contribute observations across multiple conditions or time points, this guide builds your knowledge systematically from the ground up.


Table of Contents

  1. Prerequisites and Background Concepts
  2. What is Repeated Measures ANOVA?
  3. The Mathematics Behind Repeated Measures ANOVA
  4. Assumptions of Repeated Measures ANOVA
  5. Variants of Repeated Measures ANOVA
  6. Using the Repeated Measures ANOVA Calculator Component
  7. Step-by-Step Procedure
  8. Interpreting the Output
  9. Effect Sizes for Repeated Measures ANOVA
  10. Confidence Intervals
  11. Advanced Topics
  12. Worked Examples
  13. Common Mistakes and How to Avoid Them
  14. Troubleshooting
  15. Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into repeated measures ANOVA, it is essential to be comfortable with the following foundational statistical concepts. Each is briefly reviewed below.

1.1 Within-Subjects vs. Between-Subjects Designs

A fundamental distinction in experimental design concerns how participants contribute data across conditions:

  • Between-subjects design: Different participants are assigned to different conditions. Each participant contributes one observation. Variability between participants is indistinguishable from variability between conditions — it becomes part of the error term.

  • Within-subjects design: The same participants are measured under all conditions (or at all time points). Each participant contributes multiple observations. Because we can model each participant's general tendency to score high or low, this individual variability is removed from the error term, substantially increasing statistical power.

Repeated measures ANOVA is the inferential framework for within-subjects designs with three or more conditions or time points.

1.2 The Logic of Variance Partitioning

Analysis of Variance (ANOVA) tests hypotheses by partitioning the total variability in the data (SSTotalSS_{Total}) into meaningful components:

SSTotal=SSEffect+SSErrorSS_{Total} = SS_{Effect} + SS_{Error}

The key insight is that what constitutes "error" differs between designs:

  • In a between-subjects one-way ANOVA: SSTotal=SSBetween+SSWithinSS_{Total} = SS_{Between} + SS_{Within}, where SSWithinSS_{Within} includes both random measurement error and individual differences between participants.

  • In a within-subjects (repeated measures) ANOVA: SSTotal=SSWithin-subjectsSS_{Total} = SS_{Within\text{-}subjects} (since all variation is within the same individuals), and:

    SSWithin-subjects=SSCondition+SSSubjects+SSErrorSS_{Within\text{-}subjects} = SS_{Condition} + SS_{Subjects} + SS_{Error}

By extracting SSSubjectsSS_{Subjects} (the variability attributable to stable individual differences), the residual error term SSErrorSS_{Error} is much smaller than in a between-subjects design, producing larger F-ratios and greater power.

1.3 The F-Ratio

The F-ratio is the core test statistic of ANOVA:

F=Variance explained by the effect (Mean Square Effect)Unexplained variance (Mean Square Error)F = \frac{\text{Variance explained by the effect (Mean Square Effect)}}{\text{Unexplained variance (Mean Square Error)}}

F=MSEffectMSErrorF = \frac{MS_{Effect}}{MS_{Error}}

Under H0H_0 (all condition means are equal), F1F \approx 1. Under H1H_1 (at least one condition mean differs), F>1F > 1. The larger the FF, the stronger the evidence against H0H_0.

1.4 The F-Distribution

The F-distribution is parameterised by two degrees of freedom: df1df_1 (numerator, associated with the effect) and df2df_2 (denominator, associated with the error). It is:

  • Right-skewed, defined only for non-negative values.
  • Indexed by df1df_1 and df2df_2; approaches a normal distribution for very large df2df_2.
  • The p-value is always computed from the right tail: p=P(Fdf1,df2Fobs)p = P(F_{df_1, df_2} \geq F_{obs}).

1.5 The Null and Alternative Hypotheses in ANOVA

For a within-subjects factor with kk levels (conditions or time points):

  • H0H_0: All population condition means are equal: μ1=μ2==μk\mu_1 = \mu_2 = \cdots = \mu_k

  • H1H_1: At least one population condition mean differs from at least one other: μiμj\mu_i \neq \mu_j for some iji \neq j

H1H_1 is omnibus — it does not specify which means differ or in what direction. A significant F-test must therefore be followed by post-hoc tests or planned contrasts to identify the specific pattern of differences.

1.6 The p-Value and Significance Level

As in all hypothesis tests, the p-value is the probability of observing an F-ratio as large or larger than obtained, assuming H0H_0 is true. The significance level α\alpha (conventionally .05.05) is the threshold below which we reject H0H_0.

⚠️ A significant omnibus F-test tells you only that the condition means are not all equal. It does not tell you which conditions differ, how large the differences are, or whether the differences are practically meaningful. Always follow up with effect sizes, confidence intervals, and post-hoc comparisons.

1.7 Carryover Effects and Counterbalancing

A unique concern in within-subjects designs is that participating in one condition may influence performance in a subsequent condition:

  • Practice effects: Performance improves with experience across conditions.
  • Fatigue effects: Performance deteriorates across conditions due to tiredness.
  • Contrast effects: The subjective experience of one condition is influenced by the preceding condition.

Counterbalancing — systematically varying the order in which participants complete conditions — distributes these carryover effects evenly across conditions, preventing them from confounding the main effect of interest.

1.8 Mauchly's Sphericity and Why It Matters

The repeated measures ANOVA relies on an assumption called sphericity: the variances of all pairwise difference scores between condition levels must be equal. This is analogous to the homogeneity of variance assumption in between-subjects ANOVA, but specific to within-subjects designs. Violating sphericity inflates the Type I error rate. Mauchly's test and epsilon (ε\varepsilon) corrections (Greenhouse-Geisser, Huynh-Feldt) are the standard diagnostic and remedial tools — covered fully in Section 4.


2. What is Repeated Measures ANOVA?

2.1 The Core Question

Repeated measures ANOVA (also called within-subjects ANOVA) is a parametric inferential test that determines whether the means of a continuous dependent variable differ significantly across three or more levels of a within-subjects factor — conditions, time points, or stimuli to which all participants are exposed.

Unlike the paired t-test (which compares two conditions), or between-subjects ANOVA (which compares independent groups), repeated measures ANOVA is the appropriate framework when:

  • The same participants complete all conditions, OR
  • Participants are measured at multiple time points (longitudinal panel data), OR
  • The same participants are exposed to multiple stimuli or multiple tasks.

2.2 The General Logic

Repeated measures ANOVA exploits the within-person correlation across conditions. By modelling each participant's general response level (their row mean in the data matrix), the test removes stable individual differences from the error term:

SSErrorRM=SSErrorBSSSSubjectsSS_{Error_{RM}} = SS_{Error_{BS}} - SS_{Subjects}

This reduction in error variance means that for the same true effect size, repeated measures ANOVA has substantially greater statistical power than a comparable between-subjects design — particularly when individual differences are large (i.e., when participants consistently differ from one another regardless of condition).

2.3 When to Use Repeated Measures ANOVA

ConditionRequirement
Research designSame participants measured under all k3k \geq 3 conditions
Dependent variableContinuous (interval or ratio scale)
Within-subjects factorCategorical with k3k \geq 3 levels
ObservationsIndependence between participants (not within)
DistributionApproximately normal within each condition (or n30n \geq 30)
SphericityVariances of all pairwise difference scores are equal (testable)

2.4 Real-World Applications

FieldResearch QuestionWithin-Subjects Factor
Clinical PsychologyDoes anxiety score change across pre-treatment, mid-treatment, and post-treatment?Time (3 levels)
Cognitive NeuroscienceDoes reaction time differ across congruent, neutral, and incongruent Stroop conditions?Congruency (3 levels)
EducationDoes reading fluency improve across four assessment waves in a school year?Time (4 levels)
PharmacologyDoes blood pressure differ across three drug dosage levels in the same patients?Dosage (3 levels)
Sport ScienceDoes VO2_2 max differ across four stages of a progressive exercise protocol?Exercise Stage (4 levels)
NutritionDoes subjective hunger rating differ across morning, noon, afternoon, and evening?Time of Day (4 levels)
Consumer PsychologyDo preference ratings differ across five product designs evaluated by each participant?Product Design (5 levels)
NeuroimagingDoes BOLD signal differ across five experimental conditions within the same participants?Condition (5 levels)
SituationCorrect Test
One within-subjects factor, k3k \geq 3 levelsRepeated measures ANOVA
One within-subjects factor, k=2k = 2 levelsPaired samples t-test
One between-subjects factor, k3k \geq 3 groupsOne-way between-subjects ANOVA
One within + one between factorMixed (split-plot) ANOVA
Two or more within-subjects factorsFactorial repeated measures ANOVA
Non-normal data, one within factorFriedman test (non-parametric alternative)
Modelling trajectories over time with predictorsLinear mixed-effects model (LMM)
Binary or count outcome, repeated measuresGeneralised linear mixed model (GLMM)

3. The Mathematics Behind Repeated Measures ANOVA

3.1 Data Structure

Consider nn participants each measured under kk conditions. The data form an n×kn \times k matrix of scores XijX_{ij}, where i=1,,ni = 1, \ldots, n indexes participants and j=1,,kj = 1, \ldots, k indexes conditions:

ParticipantCondition 1Condition 2\cdotsCondition kkPerson Mean
1X11X_{11}X12X_{12}\cdotsX1kX_{1k}Xˉ1\bar{X}_{1\cdot}
2X21X_{21}X22X_{22}\cdotsX2kX_{2k}Xˉ2\bar{X}_{2\cdot}
\vdots\vdots\vdots\ddots\vdots\vdots
nnXn1X_{n1}Xn2X_{n2}\cdotsXnkX_{nk}Xˉn\bar{X}_{n\cdot}
Condition MeanXˉ1\bar{X}_{\cdot 1}Xˉ2\bar{X}_{\cdot 2}\cdotsXˉk\bar{X}_{\cdot k}Xˉ\bar{X}_{\cdot\cdot} (Grand Mean)

3.2 Partitioning the Total Sum of Squares

The total sum of squares across all N=n×kN = n \times k observations is:

SSTotal=i=1nj=1k(XijXˉ)2SS_{Total} = \sum_{i=1}^{n}\sum_{j=1}^{k}(X_{ij} - \bar{X}_{\cdot\cdot})^2

In a repeated measures design, this partitions as:

SSTotal=SSBetween-Subjects+SSWithin-SubjectsSS_{Total} = SS_{Between\text{-}Subjects} + SS_{Within\text{-}Subjects}

The between-subjects component reflects how participants differ from each other (averaged across conditions):

SSBetween-Subjects=ki=1n(XˉiXˉ)2SS_{Between\text{-}Subjects} = k\sum_{i=1}^{n}(\bar{X}_{i\cdot} - \bar{X}_{\cdot\cdot})^2

The within-subjects component captures how each participant's scores vary across conditions. This is further partitioned into the condition effect and error:

SSWithin-Subjects=SSCondition+SSErrorSS_{Within\text{-}Subjects} = SS_{Condition} + SS_{Error}

Condition sum of squares (systematic variability between condition means):

SSCondition=nj=1k(XˉjXˉ)2SS_{Condition} = n\sum_{j=1}^{k}(\bar{X}_{\cdot j} - \bar{X}_{\cdot\cdot})^2

Error sum of squares (residual variability after removing both condition and participant effects):

SSError=SSTotalSSBetween-SubjectsSSConditionSS_{Error} = SS_{Total} - SS_{Between\text{-}Subjects} - SS_{Condition}

Or equivalently:

SSError=i=1nj=1k[(XijXˉiXˉj+Xˉ)2]SS_{Error} = \sum_{i=1}^{n}\sum_{j=1}^{k}\left[(X_{ij} - \bar{X}_{i\cdot} - \bar{X}_{\cdot j} + \bar{X}_{\cdot\cdot})^2\right]

3.3 Degrees of Freedom

SourceDegrees of Freedom
Between-Subjectsn1n - 1
Conditionk1k - 1
Error (Condition × Subjects)(k1)(n1)(k-1)(n-1)
Totalnk1nk - 1

Verification: (n1)+(k1)+(k1)(n1)=n1+k1+nknk+1=nk1(n-1) + (k-1) + (k-1)(n-1) = n-1 + k-1 + nk - n - k + 1 = nk - 1

3.4 Mean Squares and the F-Ratio

Mean squares are obtained by dividing each sum of squares by its degrees of freedom:

MSCondition=SSConditionk1MS_{Condition} = \frac{SS_{Condition}}{k-1}

MSError=SSError(k1)(n1)MS_{Error} = \frac{SS_{Error}}{(k-1)(n-1)}

The F-ratio for the within-subjects condition effect:

F=MSConditionMSErrorF = \frac{MS_{Condition}}{MS_{Error}}

Under H0H_0: μ1=μ2==μk\mu_1 = \mu_2 = \cdots = \mu_k, this F-ratio follows an F-distribution with df1=k1df_1 = k - 1 and df2=(k1)(n1)df_2 = (k-1)(n-1) degrees of freedom.

3.5 The ANOVA Source Table

The complete one-way repeated measures ANOVA source table:

SourceSSSSdfdfMSMSFFpp
Between-SubjectsSSBSSS_{BS}n1n-1
Condition (Within)SSCondSS_{Cond}k1k-1MSCondMS_{Cond}MSCond/MSErrorMS_{Cond}/MS_{Error}from Fk1,(k1)(n1)F_{k-1,\,(k-1)(n-1)}
ErrorSSErrorSS_{Error}(k1)(n1)(k-1)(n-1)MSErrorMS_{Error}
TotalSSTotalSS_{Total}nk1nk-1

⚠️ The between-subjects row (SSBSSS_{BS}, df=n1df = n-1) is typically not tested with an F-ratio — it represents stable individual differences that are partialled out, not an experimental factor of interest. Some software omits this row entirely.

3.6 Epsilon (ε\varepsilon) Corrections for Sphericity Violations

When the sphericity assumption is violated (see Section 4), the actual sampling distribution of FF has heavier tails than the nominal Fk1,(k1)(n1)F_{k-1,\,(k-1)(n-1)} distribution — producing inflated Type I error. Two corrections adjust the degrees of freedom to match the true distribution.

Greenhouse-Geisser (GG) epsilon:

ε^GG=k2(sˉjjsˉ..)2(k1)[jjsjj22kjsˉj.2+k2sˉ..2]\hat{\varepsilon}_{GG} = \frac{k^2(\bar{s}_{jj'} - \bar{s}_{..})^2}{(k-1)\left[\sum_{j}\sum_{j'} s^2_{jj'} - 2k\sum_j \bar{s}^2_{j.} + k^2\bar{s}^2_{..}\right]}

Where sjjs_{jj'} are elements of the covariance matrix of condition scores, sˉjj\bar{s}_{jj'} are column means, and sˉ..\bar{s}_{..} is the grand mean of the covariance matrix.

Practically, ε^GG\hat{\varepsilon}_{GG} ranges from 1/(k1)1/(k-1) (maximum violation of sphericity) to 1.01.0 (perfect sphericity). GG is known to be conservative — it sometimes overcorrects, especially with larger kk and nn.

Huynh-Feldt (HF) epsilon:

ε~HF=n(k1)ε^GG2(k1)[n1(k1)ε^GG]\tilde{\varepsilon}_{HF} = \frac{n(k-1)\hat{\varepsilon}_{GG} - 2}{(k-1)[n - 1 - (k-1)\hat{\varepsilon}_{GG}]}

HF epsilon is less conservative than GG and is recommended when ε^GG>.75\hat{\varepsilon}_{GG} > .75. If ε~HF>1\tilde{\varepsilon}_{HF} > 1, it is set to 1.01.0.

Corrected degrees of freedom:

df1=(k1)ε^,df2=(k1)(n1)ε^df_1^* = (k-1)\hat{\varepsilon}, \qquad df_2^* = (k-1)(n-1)\hat{\varepsilon}

The F-statistic itself is unchanged; only the reference distribution is adjusted.

Decision rule for epsilon corrections:

ε^GG\hat{\varepsilon}_{GG}Recommended Correction
1.0\approx 1.0 (no violation)No correction needed
0.75<ε^GG<1.00.75 < \hat{\varepsilon}_{GG} < 1.0Huynh-Feldt correction
ε^GG0.75\hat{\varepsilon}_{GG} \leq 0.75Greenhouse-Geisser correction
Any value (conservative approach)Always use GG correction
Severe violation, small nnConsider MANOVA approach

3.7 The Multivariate Approach (MANOVA)

An alternative to the univariate F-test with epsilon corrections is the fully multivariate approach, which makes no sphericity assumption whatsoever. The kk repeated conditions are recast as k1k-1 contrast variables and tested with multivariate test statistics:

  • Pillai's Trace: V=tr[H(H+E)1]V = \text{tr}[\mathbf{H}(\mathbf{H}+\mathbf{E})^{-1}]
  • Wilks' Lambda: Λ=E/H+E\Lambda = |\mathbf{E}|/|\mathbf{H}+\mathbf{E}|
  • Hotelling-Lawley Trace: T=tr[HE1]T = \text{tr}[\mathbf{H}\mathbf{E}^{-1}]
  • Roy's Largest Root: θ=λmax(HE1)\theta = \lambda_{max}(\mathbf{H}\mathbf{E}^{-1})

Where H\mathbf{H} is the hypothesis matrix and E\mathbf{E} is the error matrix.

The multivariate approach is always valid regardless of sphericity, but requires n>kn > k and loses power relative to the corrected univariate test when sphericity holds approximately. For small nn relative to kk, the univariate approach with epsilon corrections is preferred.

3.8 Effect Size — Eta-Squared (η2\eta^2)

The most straightforward effect size for repeated measures ANOVA is eta-squared:

η2=SSConditionSSTotal\eta^2 = \frac{SS_{Condition}}{SS_{Total}}

η2\eta^2 is the proportion of total variance explained by the condition effect. However, in repeated measures designs, the between-subjects variance (SSBSSS_{BS}) is irreducible and not of interest. This makes η2\eta^2 artificially small compared to a between-subjects design with the same true effect — it is not directly comparable across designs.

3.9 Effect Size — Partial Eta-Squared (ηp2\eta^2_p)

Partial eta-squared removes the between-subjects variance from the denominator:

ηp2=SSConditionSSCondition+SSError\eta^2_p = \frac{SS_{Condition}}{SS_{Condition} + SS_{Error}}

ηp2\eta^2_p represents the proportion of variance explained by the condition effect after removing individual differences. It is the standard effect size reported by most software (SPSS, SAS, R's ez package) and is comparable across between- and within-subjects designs.

Relationship to FF:

ηp2=Fdf1Fdf1+df2\eta^2_p = \frac{F \cdot df_1}{F \cdot df_1 + df_2}

3.10 Effect Size — Generalised Eta-Squared (ηG2\eta^2_G)

Generalised eta-squared (Olejnik & Algina, 2003) is designed for comparability across studies with different designs. For a pure within-subjects design:

ηG2=SSConditionSSCondition+SSBetween-Subjects+SSError\eta^2_G = \frac{SS_{Condition}}{SS_{Condition} + SS_{Between\text{-}Subjects} + SS_{Error}}

ηG2\eta^2_G is the recommended effect size for meta-analysis involving repeated measures designs because it is invariant to the number of conditions measured, unlike ηp2\eta^2_p.

3.11 Effect Size — Omega-Squared (ω2\omega^2) and Partial Omega-Squared (ωp2\omega^2_p)

Both η2\eta^2 and ηp2\eta^2_p are positively biased (they overestimate the population effect, especially in small samples). Omega-squared applies a bias correction:

ω2=SSCondition(k1)MSErrorSSTotal+MSBetween-Subjects\omega^2 = \frac{SS_{Condition} - (k-1)MS_{Error}}{SS_{Total} + MS_{Between\text{-}Subjects}}

Partial omega-squared (preferred for repeated measures):

ωp2=SSCondition(k1)MSErrorSSCondition+(nk+1)MSError\omega^2_p = \frac{SS_{Condition} - (k-1)MS_{Error}}{SS_{Condition} + (n - k + 1) \cdot MS_{Error}}

For large samples, ωp2ηp2\omega^2_p \approx \eta^2_p. For small samples (n<30n < 30), ωp2\omega^2_p is the recommended effect size to report alongside ηp2\eta^2_p.

3.12 Cohen's ff and Statistical Power

Cohen's ff is the standardised effect size for ANOVA, defined as:

f=ηp21ηp2f = \sqrt{\frac{\eta^2_p}{1 - \eta^2_p}}

Required sample size for desired power 1β1-\beta at two-sided α\alpha (approximate):

nλf2kn \approx \frac{\lambda}{f^2 \cdot k}

Where λ\lambda is the non-centrality parameter satisfying the power equation.

Required nn per condition combination, one-way within-subjects (k=4k = 4, α=.05\alpha = .05, ρ=.50\rho = .50 average intercorrelation):

Cohen's ffVerbal LabelPower = 0.80Power = 0.90Power = 0.95
0.10Small445872
0.25Medium121620
0.40Large7911
0.50Large579

⚠️ Power for repeated measures ANOVA depends critically on the average correlation among conditions (ρ\rho). Higher ρ\rho → greater power (more individual variability removed). Always specify ρ\rho in power analyses for within-subjects designs.


4. Assumptions of Repeated Measures ANOVA

4.1 Normality of Residuals (or Condition Scores)

The repeated measures ANOVA assumes that the residual scores (or equivalently, the scores within each condition) are approximately normally distributed in the population.

How to check:

MethodDetails
Shapiro-Wilk testApplied to residuals or condition-level scores; most powerful for n<50n < 50
Q-Q plotsOne per condition; points should fall along the diagonal
HistogramsOne per condition; should be approximately bell-shaped
Skewness and kurtosiszskew<2\vert z_{skew} \vert < 2; zkurt<7\vert z_{kurt} \vert < 7 suggest acceptable distributions

Robustness: The F-test is moderately robust to non-normality when n30n \geq 30 (via the Central Limit Theorem) and when the violation is symmetric. Severe skewness with small nn is the primary concern.

When violated:

  • Use the Friedman test (non-parametric alternative) for small samples with non-normal data.
  • Consider log or square-root transformation for right-skewed outcome variables.
  • Use a linear mixed-effects model with robust standard errors.

4.2 Sphericity

Sphericity is the assumption that the variances of all pairwise difference scores between conditions are equal. For kk conditions, there are k(k1)/2k(k-1)/2 pairwise differences, each of which must have the same variance.

Formally, if djj=XijXijd_{jj'} = X_{ij} - X_{ij'} is the difference score between conditions jj and jj' for participant ii, then sphericity requires:

Var(djj)=Var(djj)for all j,j,j\text{Var}(d_{jj'}) = \text{Var}(d_{jj''}) \quad \text{for all } j, j', j''

Compound symmetry (equal variances and equal covariances across all conditions) is a sufficient but not necessary condition for sphericity. Sphericity is a weaker requirement than compound symmetry and is the actual assumption of the F-test.

Mauchly's Test of Sphericity:

W=l=1k1λl(l=1k1λlk1)k1W = \frac{\prod_{l=1}^{k-1} \lambda_l}{\left(\frac{\sum_{l=1}^{k-1}\lambda_l}{k-1}\right)^{k-1}}

Where λl\lambda_l are the eigenvalues of the transformed covariance matrix (using orthonormal contrasts). The test statistic:

χ2=(n12(k1)2+k+16(k1))ln(W)\chi^2 = -\left(n - 1 - \frac{2(k-1)^2 + k + 1}{6(k-1)}\right)\ln(W)

With df=(k1)(k+2)/21=k(k1)/21df = (k-1)(k+2)/2 - 1 = k(k-1)/2 - 1.

H0H_0: Sphericity holds (W=1W = 1). p.05p \leq .05 → reject sphericity; apply corrections.

⚠️ Mauchly's test is sensitive to non-normality and can give misleading results in small samples (underpowered) and large samples (overpowered — detecting trivial violations). Always report ε^\hat{\varepsilon} alongside Mauchly's test result. ε^GG<0.75\hat{\varepsilon}_{GG} < 0.75 indicates a practically meaningful violation regardless of the Mauchly p-value.

Epsilon values and their implications:

ε^GG\hat{\varepsilon}_{GG}InterpretationAction
=1.00= 1.00Perfect sphericityNo correction needed
0.900.990.90 - 0.99Minimal violationHuynh-Feldt or no correction
0.750.890.75 - 0.89Moderate violationHuynh-Feldt correction
0.500.740.50 - 0.74Substantial violationGreenhouse-Geisser correction
<0.50< 0.50Severe violationGG correction or MANOVA
=1/(k1)= 1/(k-1)Maximum violationMANOVA strongly recommended

Note: The sphericity assumption is irrelevant when k=2k = 2 (only two conditions form exactly one difference score, whose variance always equals itself). This is why the paired t-test needs no sphericity correction.

4.3 Independence of Observations Between Participants

While the design deliberately induces correlation within participants (across conditions), the observations between participants must be independent. Each participant's data must not influence another participant's data.

Common violations:

  • Participants in the same lab session who can observe or influence each other.
  • Family members or partners in the same study.
  • Hierarchically nested data (e.g., students within classrooms all treated as independent) — use linear mixed-effects models instead.

4.4 Interval or Ratio Scale of Measurement

The dependent variable must be continuous and measured on at least an interval scale — equal numerical differences must represent equal psychological or physical differences across the entire scale range.

When violated: If the dependent variable is ordinal (e.g., ranks or Likert ratings treated as ordinal), use the Friedman test instead.

4.5 No Extreme Multivariate Outliers

Outliers in any condition can distort condition means and inflate SSErrorSS_{Error}, potentially masking real effects or creating spurious ones.

How to check:

  • Boxplots for each condition.
  • Standardised scores zi>3.29|z_i| > 3.29 within each condition.
  • Mahalanobis distance across conditions: flags participants who are outliers in the multivariate sense (unusual profile across all conditions simultaneously).

When outliers present: Investigate the cause. Report analyses with and without outliers. Consider the Friedman test or trimmed-mean ANOVA as robust alternatives.

4.6 Assumption Summary

AssumptionHow to CheckRemedy if Violated
NormalityShapiro-Wilk; Q-Q plots per conditionFriedman test; data transformation; LMM
SphericityMauchly's test; inspect ε^GG\hat{\varepsilon}_{GG}GG or HF correction; MANOVA
Independence between participantsStudy design reviewLinear mixed-effects model
Interval scaleMeasurement theory reviewFriedman test
No extreme outliersBoxplots; zz-scores; Mahalanobis D2D^2Investigate; robust ANOVA; Friedman

5. Variants of Repeated Measures ANOVA

5.1 One-Way Repeated Measures ANOVA

The standard form described throughout this tutorial: one within-subjects factor with k3k \geq 3 levels. Tests whether condition means differ significantly.

5.2 Factorial Repeated Measures ANOVA (Two or More Within Factors)

When each participant is measured across all combinations of two or more within-subjects factors, a factorial repeated measures ANOVA is used.

For factors AA (with aa levels) and BB (with bb levels), the partition is:

SSWithin=SSA+SSB+SSA×B+SSA×S+SSB×S+SSA×B×SSS_{Within} = SS_A + SS_B + SS_{A \times B} + SS_{A \times S} + SS_{B \times S} + SS_{A \times B \times S}

Where SS denotes subjects. Each main effect and interaction has its own error term (the corresponding subjects-by-factor interaction). This allows each effect to be tested against a different, tailored error term.

5.3 Mixed ANOVA (Split-Plot Design)

The mixed ANOVA (also called split-plot ANOVA) combines:

  • One or more between-subjects factors (different participants per group).
  • One or more within-subjects factors (same participants across conditions).

Example: Comparing three treatment groups (between) measured at pre, mid, and post (within). The interaction between group and time (Group × Time) is typically the focal test — it assesses whether the trajectory of change over time differs across groups.

Variance partitioning:

SSTotal=SSBetween-Subjects+SSWithin-SubjectsSS_{Total} = SS_{Between\text{-}Subjects} + SS_{Within\text{-}Subjects}

SSBetween-Subjects=SSGroup+SSSubjects(Group)SS_{Between\text{-}Subjects} = SS_{Group} + SS_{Subjects(Group)}

SSWithin-Subjects=SSTime+SSGroup×Time+SSTime×Subjects(Group)SS_{Within\text{-}Subjects} = SS_{Time} + SS_{Group \times Time} + SS_{Time \times Subjects(Group)}

Each between-subjects effect uses MSSubjects(Group)MS_{Subjects(Group)} as its error term; each within-subjects effect uses MSTime×Subjects(Group)MS_{Time \times Subjects(Group)} as its error term.

5.4 Friedman Test (Non-Parametric Alternative)

When the normality assumption is severely violated or the data are ordinal, the Friedman test is the non-parametric equivalent of one-way repeated measures ANOVA.

Procedure:

  1. Rank each participant's scores across the kk conditions (1 = lowest, kk = highest).
  2. Compute the mean rank for each condition: Rˉj=1niRij\bar{R}_j = \frac{1}{n}\sum_i R_{ij}.
  3. Compute the Friedman statistic:

χF2=12nk(k+1)j=1k(Rˉjk+12)2\chi^2_F = \frac{12n}{k(k+1)}\sum_{j=1}^k \left(\bar{R}_j - \frac{k+1}{2}\right)^2

Under H0H_0, χF2χk12\chi^2_F \approx \chi^2_{k-1} for large nn.

Effect size: Kendall's WW (coefficient of concordance):

W=χF2n(k1)W = \frac{\chi^2_F}{n(k-1)}

WW ranges from 0 (no agreement in rankings across participants) to 1 (perfect agreement). Conversion to rr: r=2W1r = 2W - 1 (for k=2k = 2).

5.5 Trend Analysis (Polynomial Contrasts)

When the within-subjects factor is quantitative and equally spaced (e.g., time points at regular intervals, dosage levels at equal increments), trend analysis decomposes the condition effect into orthogonal polynomial components:

  • Linear trend: Does the mean increase or decrease monotonically across levels?
  • Quadratic trend: Is there a U-shaped or inverted-U-shaped pattern?
  • Cubic trend: Is there an S-shaped or more complex pattern?

Each trend component has df=1df = 1 and is tested separately against the error mean square. Trend analysis is more powerful and more informative than the omnibus F-test when a specific trajectory is hypothesised.

5.6 Linear Mixed-Effects Models (LMM)

Linear mixed-effects models (also called multilevel models or hierarchical linear models) subsume repeated measures ANOVA as a special case while offering several important generalisations:

  • Handle missing data without excluding participants (ANOVA requires complete data or imputation).
  • Model unequal time intervals between measurements.
  • Allow time-varying covariates as predictors.
  • Specify flexible covariance structures (not restricted to compound symmetry or sphericity).
  • Accommodate both balanced and unbalanced designs.

For complex longitudinal designs, LMMs are generally preferred over repeated measures ANOVA. DataStatPro's repeated measures ANOVA module automatically suggests LMM when missing data are detected.

5.7 Bayesian Repeated Measures ANOVA

The Bayesian approach computes Bayes Factors comparing models with and without the condition effect. Under default priors (Rouder et al., 2012):

BF10=P(dataH1:condition effect exists)P(dataH0:no condition effect)BF_{10} = \frac{P(\text{data} \mid H_1: \text{condition effect exists})}{P(\text{data} \mid H_0: \text{no condition effect})}

Interpreting BF10BF_{10}:

BF10BF_{10}Evidence
>100> 100Extreme evidence for H1H_1
3010030 - 100Very strong
103010 - 30Strong
3103 - 10Moderate
131 - 3Anecdotal
<1/3< 1/3Moderate evidence for H0H_0 (no effect)

6. Using the Repeated Measures ANOVA Calculator Component

The Repeated Measures ANOVA Calculator in DataStatPro provides a comprehensive tool for running, diagnosing, and reporting within-subjects analyses.

Step-by-Step Guide

Step 1 — Select the Test

Navigate to Statistical Tests → ANOVA → Repeated Measures ANOVA.

Step 2 — Input Method

Choose how to provide data:

  • Raw data (wide format): Each row is one participant; each column is one condition. DataStatPro automatically identifies the within-subjects structure.
  • Raw data (long format): Three columns required: participant ID, condition label, and dependent variable value. DataStatPro reshapes to wide format internally.
  • Summary statistics: Enter nn, condition means (Xˉj\bar{X}_{\cdot j}), standard deviations (sjs_j), and the correlation matrix across conditions. DataStatPro reconstructs the ANOVA source table.

Step 3 — Define the Within-Subjects Factor

  • Specify the factor name (e.g., "Time", "Condition", "Dosage").
  • Label each level (e.g., "Pre", "Mid", "Post").
  • For factorial designs, define additional within-subjects factors and their levels.
  • For mixed designs, specify the between-subjects grouping variable.

Step 4 — Select Post-Hoc Tests and Contrasts

  • Post-hoc tests (exploratory): Bonferroni, Holm, Tukey's HSD (adapted for within-subjects), Sidák.
  • Planned contrasts: Simple (each level vs. first), Helmert (each level vs. mean of preceding), polynomial (linear, quadratic, cubic trend), custom.

Step 5 — Set Significance Level and Confidence Level

Default: α=.05\alpha = .05, 95% CI. Results at α=.01\alpha = .01 and α=.001\alpha = .001 are simultaneously displayed.

Step 6 — Select Display Options

  • ✅ Full ANOVA source table with SSSS, dfdf, MSMS, FF, exact pp.
  • ✅ Mauchly's test of sphericity and ε^GG\hat{\varepsilon}_{GG}, ε~HF\tilde{\varepsilon}_{HF}.
  • ✅ Greenhouse-Geisser and Huynh-Feldt corrected results (automatically displayed when sphericity is violated).
  • ✅ Multivariate test statistics (Pillai, Wilks, Hotelling, Roy) as an alternative.
  • ✅ Partial eta-squared (ηp2\eta^2_p) and omega-squared (ωp2\omega^2_p) with 95% CI.
  • ✅ Generalised eta-squared (ηG2\eta^2_G).
  • ✅ Cohen's ff for power analysis.
  • ✅ Condition means, standard deviations, and 95% CIs with error bar plots.
  • ✅ Post-hoc comparison table (pairwise tt-tests with corrections).
  • ✅ Profile plot (means across conditions with individual participant trajectories).
  • ✅ Interaction plot (for factorial and mixed designs).
  • ✅ Residual Q-Q plots and normality test per condition.
  • ✅ Mahalanobis distance outlier detection.
  • ✅ Power analysis: current post-hoc power and required nn for 80%, 90%, 95% power.
  • ✅ Bayesian ANOVA (Bayes Factor BF10BF_{10}).
  • ✅ APA 7th edition results paragraph (auto-generated).

Step 7 — Run the Analysis

Click "Run Repeated Measures ANOVA". DataStatPro will:

  1. Compute all SSSS, dfdf, MSMS, FF, and exact p-values.
  2. Conduct Mauchly's test; apply GG and HF corrections as appropriate.
  3. Compute multivariate test statistics.
  4. Compute ηp2\eta^2_p, ωp2\omega^2_p, ηG2\eta^2_G, and Cohen's ff with 95% CIs.
  5. Run selected post-hoc comparisons or planned contrasts.
  6. Conduct normality and outlier diagnostics.
  7. Estimate post-hoc power.
  8. Output an APA-compliant results paragraph.

7. Step-by-Step Procedure

7.1 Full Manual Procedure

Step 1 — State the Hypotheses

H0:μ1=μ2==μkH_0: \mu_1 = \mu_2 = \cdots = \mu_k (all condition population means are equal)

H1:H_1: At least one μj\mu_j differs from at least one other μj\mu_{j'}

Specify the within-subjects factor and the number of levels kk based on the research design, before examining the data.

Step 2 — Organise the Data Matrix

Arrange data in an n×kn \times k matrix with one row per participant and one column per condition. Verify that there are no missing values (or plan imputation/LMM approach).

Step 3 — Check Assumptions

  • Inspect distributions within each condition (histograms, Q-Q plots, Shapiro-Wilk).
  • Investigate multivariate outliers (Mahalanobis distance).
  • Confirm independence of observations between participants (design review).
  • Proceed to compute expected frequencies — irrelevant here; proceed to compute EijE_{ij} — irrelevant; proceed to test sphericity after computing SSSS components.

Step 4 — Compute Grand Mean and Condition Means

Xˉ=1nki=1nj=1kXij\bar{X}_{\cdot\cdot} = \frac{1}{nk}\sum_{i=1}^n\sum_{j=1}^k X_{ij}

Xˉj=1ni=1nXijfor each condition j\bar{X}_{\cdot j} = \frac{1}{n}\sum_{i=1}^n X_{ij} \quad \text{for each condition } j

Xˉi=1kj=1kXijfor each participant i\bar{X}_{i\cdot} = \frac{1}{k}\sum_{j=1}^k X_{ij} \quad \text{for each participant } i

Step 5 — Compute Sums of Squares

SSTotal=i=1nj=1k(XijXˉ)2SS_{Total} = \sum_{i=1}^n\sum_{j=1}^k (X_{ij} - \bar{X}_{\cdot\cdot})^2

SSBetween-Subjects=ki=1n(XˉiXˉ)2SS_{Between\text{-}Subjects} = k\sum_{i=1}^n (\bar{X}_{i\cdot} - \bar{X}_{\cdot\cdot})^2

SSCondition=nj=1k(XˉjXˉ)2SS_{Condition} = n\sum_{j=1}^k (\bar{X}_{\cdot j} - \bar{X}_{\cdot\cdot})^2

SSError=SSTotalSSBetween-SubjectsSSConditionSS_{Error} = SS_{Total} - SS_{Between\text{-}Subjects} - SS_{Condition}

Step 6 — Compute Degrees of Freedom

dfCondition=k1df_{Condition} = k - 1

dfError=(k1)(n1)df_{Error} = (k-1)(n-1)

Step 7 — Conduct Mauchly's Test of Sphericity

Compute the covariance matrix of condition scores Σ\boldsymbol{\Sigma} and obtain ε^GG\hat{\varepsilon}_{GG} and ε~HF\tilde{\varepsilon}_{HF}. If Mauchly's p<.05p < .05 or ε^GG<0.75\hat{\varepsilon}_{GG} < 0.75, apply the appropriate correction.

Step 8 — Compute Mean Squares and the F-Ratio

MSCondition=SSConditiondfConditionMS_{Condition} = \frac{SS_{Condition}}{df^*_{Condition}} (use corrected dfdf^* if applicable)

MSError=SSErrordfErrorMS_{Error} = \frac{SS_{Error}}{df^*_{Error}}

F=MSConditionMSErrorF = \frac{MS_{Condition}}{MS_{Error}}

Step 9 — Compute the p-Value

p=P(Fdf1,df2Fobs)p = P(F_{df^*_1, df^*_2} \geq F_{obs})

Reject H0H_0 if pαp \leq \alpha.

Step 10 — Compute Effect Sizes

ηp2=SSConditionSSCondition+SSError\eta^2_p = \frac{SS_{Condition}}{SS_{Condition} + SS_{Error}}

ωp2=SSCondition(k1)MSErrorSSCondition+(nk+1)MSError\omega^2_p = \frac{SS_{Condition} - (k-1)MS_{Error}}{SS_{Condition} + (n-k+1) \cdot MS_{Error}}

f=ηp21ηp2f = \sqrt{\frac{\eta^2_p}{1 - \eta^2_p}}

Step 11 — Conduct Post-Hoc Comparisons (if H0H_0 rejected)

For each pair of conditions (j,j)(j, j'), compute the pairwise paired t-test:

tjj=XˉjXˉjsdjj/nt_{jj'} = \frac{\bar{X}_{\cdot j} - \bar{X}_{\cdot j'}}{s_{d_{jj'}}/\sqrt{n}}

Where sdjjs_{d_{jj'}} is the standard deviation of the difference scores di=XijXijd_i = X_{ij} - X_{ij'}. Apply Bonferroni, Holm, or Tukey correction for the k(k1)/2k(k-1)/2 pairwise comparisons.

Step 12 — Interpret and Report

Use the APA reporting template in Section 15. Always report FF, both dfdf values (with corrections if applicable), pp, ηp2\eta^2_p, ωp2\omega^2_p, and 95% CI for the effect size. Report Mauchly's test result and the epsilon correction applied.


8. Interpreting the Output

8.1 The F-Statistic

FobsF_{obs} InterpretationMeaning
F1F \approx 1Condition variance \approx error variance; consistent with H0H_0
F1F \gg 1Condition variance substantially exceeds error; evidence against H0H_0
Large FF with large nnCan be significant even for very small ηp2\eta^2_p
Small FF with small nnMay be non-significant even for large ηp2\eta^2_p (low power)
F<1F < 1 (rare)Observed means more similar than chance alone would predict

8.2 The p-Value

p-ValueConventional Interpretation
p>.10p > .10No evidence against H0H_0 (equal condition means)
.05<p.10.05 < p \leq .10Marginal evidence of condition differences (trend)
.01<p.05.01 < p \leq .05Significant condition effect at α=.05\alpha = .05
.001<p.01.001 < p \leq .01Significant condition effect at α=.01\alpha = .01
p.001p \leq .001Significant condition effect at α=.001\alpha = .001

⚠️ A significant F-test is omnibus: it indicates only that at least one pair of condition means differs. It does not identify which conditions differ or how large those differences are. Always follow a significant omnibus test with post-hoc comparisons or planned contrasts, and always report effect sizes.

8.3 Mauchly's Test and Epsilon

Mauchly's ppε^GG\hat{\varepsilon}_{GG}Recommended Action
>.05> .05.90\geq .90Report uncorrected results; sphericity holds
>.05> .05.75.89.75 - .89Report HF-corrected results as primary; note uncorrected
.05\leq .05.75.89.75 - .89Use HF correction
.05\leq .05<.75< .75Use GG correction; consider reporting MANOVA
Any1/(k1)\approx 1/(k-1)Report MANOVA as primary analysis

⚠️ Always report ε^GG\hat{\varepsilon}_{GG} regardless of Mauchly's test outcome, since Mauchly's test may lack power in small samples. A reader can use ε^GG\hat{\varepsilon}_{GG} to judge the severity of any sphericity violation.

8.4 The ANOVA Source Table

When reading the output table, focus on:

  1. SSConditionSS_{Condition}: Total systematic variability across conditions — larger values indicate greater spread of condition means.
  2. SSErrorSS_{Error}: Residual within-subjects variability after removing condition and participant effects — smaller values (more tightly correlated conditions) indicate a more sensitive design.
  3. MSCondition/MSErrorMS_{Condition}/MS_{Error} ratio: The F-ratio — the key test statistic.
  4. dfdf values: Verify these match (k1)(k-1) and (k1)(n1)(k-1)(n-1) (uncorrected) or their epsilon-adjusted equivalents.

8.5 Partial Eta-Squared (ηp2\eta^2_p) — Magnitude Interpretation

Cohen's (1988) benchmarks for ηp2\eta^2_p (and ff):

ηp2\eta^2_pCohen's ffVerbal Label
0.010.010.100.10Small
0.060.060.250.25Medium
0.140.140.400.40Large

Extended benchmarks (Lakens, 2013, contextualised for within-subjects designs):

ηp2\eta^2_pVerbal Label
<0.01< 0.01Negligible
0.010.050.01 - 0.05Small
0.060.130.06 - 0.13Medium
0.140.250.14 - 0.25Large
>0.25> 0.25Very large

⚠️ These benchmarks were developed for between-subjects designs. In within-subjects designs, ηp2\eta^2_p values tend to be larger because individual differences are removed from the error term. Do not mechanically apply Cohen's benchmarks without considering the typical effect sizes in your specific research domain.

8.6 Post-Hoc Comparisons

After a significant omnibus F-test, post-hoc pairwise comparisons identify which specific pairs of conditions differ. For kk conditions there are k(k1)/2k(k-1)/2 pairs:

kkNumber of Pairs
33
46
510
615

For each comparison, report the mean difference (XˉjXˉj)(\bar{X}_{\cdot j} - \bar{X}_{\cdot j'}), the tt-statistic, the adjusted p-value, and Cohen's drmd_{rm} for the paired comparison:

drm=XˉjXˉjsdjjd_{rm} = \frac{\bar{X}_{\cdot j} - \bar{X}_{\cdot j'}}{s_{d_{jj'}}}

8.7 Profile Plots

A profile plot (line chart with conditions on the x-axis and mean outcome on the y-axis) is the primary visualisation for repeated measures ANOVA. Key features to examine:

  • Parallel lines (in mixed designs): Suggests no interaction between between- and within-subjects factors.
  • Crossing or diverging lines: Suggests a Group × Time interaction — the critical test in most mixed designs.
  • Error bars: Should represent 95% within-subjects confidence intervals (not the standard between-subjects SEM), computed using the Cousineau-Morey correction, which removes between-subjects variance from the error.

9. Effect Sizes for Repeated Measures ANOVA

9.1 Partial Eta-Squared (ηp2\eta^2_p)

ηp2=SSConditionSSCondition+SSError\eta^2_p = \frac{SS_{Condition}}{SS_{Condition} + SS_{Error}}

The proportion of within-subjects variance (after removing participant-level variance) explained by the condition effect. This is the standard effect size reported by SPSS, SAS, and most ANOVA software. Upwardly biased in small samples.

9.2 Generalised Eta-Squared (ηG2\eta^2_G)

ηG2=SSConditionSSCondition+SSBetween-Subjects+SSError\eta^2_G = \frac{SS_{Condition}}{SS_{Condition} + SS_{Between\text{-}Subjects} + SS_{Error}}

Designed for comparability across studies that differ in the number of conditions and design type (between vs. within). Recommended for meta-analyses. Note that ηG2ηp2\eta^2_G \leq \eta^2_p always, since the denominator of ηG2\eta^2_G is larger.

9.3 Partial Omega-Squared (ωp2\omega^2_p) — Bias-Corrected

ωp2=SSCondition(k1)MSErrorSSCondition+(nk+1)MSError\omega^2_p = \frac{SS_{Condition} - (k-1)MS_{Error}}{SS_{Condition} + (n - k + 1) \cdot MS_{Error}}

The unbiased (or less biased) estimate of the true population partial effect size. Preferred over ηp2\eta^2_p for small samples (n<30n < 30). Note that ωp2\omega^2_p can be negative for very small effects — negative values should be reported as ωp20\omega^2_p \approx 0 (indicating no effect).

9.4 Cohen's ff

f=ηp21ηp2=σmσεf = \sqrt{\frac{\eta^2_p}{1 - \eta^2_p}} = \frac{\sigma_m}{\sigma_\varepsilon}

Where σm\sigma_m is the standard deviation of the kk true condition means around the grand mean, and σε\sigma_\varepsilon is the within-condition population standard deviation (error). Cohen's ff is used primarily in power analysis. Benchmarks: f=0.10f = 0.10 (small), f=0.25f = 0.25 (medium), f=0.40f = 0.40 (large).

9.5 Pairwise Cohen's dd for Post-Hoc Comparisons

For each pairwise comparison (j,j)(j, j') after a significant omnibus test:

drm=XˉjXˉjsdjjd_{rm} = \frac{\bar{X}_{\cdot j} - \bar{X}_{\cdot j'}}{s_{d_{jj'}}}

Where sdjjs_{d_{jj'}} is the standard deviation of the difference scores between conditions jj and jj'. This is the paired Cohen's dd and is directly interpretable as the magnitude of the difference between two specific conditions.

9.6 Effect Size Summary Table

Effect SizeFormulaRangeBest Use
η2\eta^2SSCond/SSTotalSS_{Cond}/SS_{Total}[0,1][0,1]Rarely recommended; underestimates effect
ηp2\eta^2_pSSCond/(SSCond+SSError)SS_{Cond}/(SS_{Cond}+SS_{Error})[0,1][0,1]Standard reporting; comparable across designs
ηG2\eta^2_GSSCond/(SSCond+SSBS+SSError)SS_{Cond}/(SS_{Cond}+SS_{BS}+SS_{Error})[0,1][0,1]Meta-analysis; cross-design comparisons
ωp2\omega^2_pBias-corrected ηp2\eta^2_p(,1](-\infty,1]Small samples (n<30n<30); unbiased estimate
Cohen's ffηp2/(1ηp2)\sqrt{\eta^2_p/(1-\eta^2_p)}[0,)[0, \infty)Power analysis
Pairwise drmd_{rm}ΔXˉ/sdiff\Delta\bar{X}/s_{diff}(,)(-\infty, \infty)Pairwise post-hoc comparisons

10. Confidence Intervals

10.1 CI for Each Condition Mean

The standard 95% CI for condition jj:

Xˉj±tα/2,  n1×sjn\bar{X}_{\cdot j} \pm t_{\alpha/2,\; n-1} \times \frac{s_j}{\sqrt{n}}

Where sjs_j is the standard deviation within condition jj. These between-subjects CIs are correct for estimating the true population mean of condition jj but are not appropriate for visual inference about within-subjects differences (they are too wide and do not reflect the advantage of the within-subjects design).

10.2 Within-Subjects CIs for Profile Plots (Cousineau-Morey)

For visualising within-subjects mean differences, the Cousineau-Morey within- subjects CI removes between-participants variance before computing the error:

  1. Normalise each participant's scores: Xij=XijXˉi+XˉX^*_{ij} = X_{ij} - \bar{X}_{i\cdot} + \bar{X}_{\cdot\cdot}

  2. Compute the standard error of the normalised scores: SEj=sj/nSE^*_j = s^*_j / \sqrt{n}

  3. Apply the Morey correction factor k/(k1)\sqrt{k/(k-1)}: SEjWS=kk1×SEjSE^{WS}_j = \sqrt{\frac{k}{k-1}} \times SE^*_j

  4. Construct the CI: Xˉj±tα/2,  n1×SEjWS\bar{X}_{\cdot j} \pm t_{\alpha/2,\; n-1} \times SE^{WS}_j

These within-subjects CIs are narrower than standard CIs and correctly represent the precision of within-subjects comparisons. When two such CIs barely overlap, the corresponding pairwise comparison is approximately significant at α=.05\alpha = .05.

⚠️ Always label error bars in profile plots explicitly as "95% within-subjects confidence intervals (Morey correction)" to distinguish them from between-subjects CIs. Readers familiar with standard CIs will otherwise overestimate the uncertainty in pairwise differences.

10.3 CI for Partial Eta-Squared

CIs for ηp2\eta^2_p are derived from the non-central F-distribution. The non-centrality parameter is:

λ=F×df1\lambda = F \times df_1

Find λL\lambda_L and λU\lambda_U such that:

P(Fdf1,df2(λL)Fobs)=.025andP(Fdf1,df2(λU)Fobs)=.025P(F_{df_1, df_2}(\lambda_L) \geq F_{obs}) = .025 \quad \text{and} \quad P(F_{df_1, df_2}(\lambda_U) \leq F_{obs}) = .025

Then:

ηp,L2=λLλL+df1+df2+1,ηp,U2=λUλU+df1+df2+1\eta^2_{p,L} = \frac{\lambda_L}{\lambda_L + df_1 + df_2 + 1}, \qquad \eta^2_{p,U} = \frac{\lambda_U}{\lambda_U + df_1 + df_2 + 1}

An approximate 95% CI (adequate for n30n \geq 30):

ηp2±1.96×SEηp2,SEηp22ηp2(1ηp2)2(df1+df2+1)df1(df1+df2)2\eta^2_p \pm 1.96 \times SE_{\eta^2_p}, \quad SE_{\eta^2_p} \approx \sqrt{\frac{2\eta^2_p(1-\eta^2_p)^2 (df_1+df_2+1)}{df_1 \cdot (df_1+df_2)^2}}

DataStatPro computes exact CIs using numerical inversion of the non-central F-distribution.

10.4 CI for Pairwise Mean Differences

For each pairwise contrast (j,j)(j, j'):

(XˉjXˉj)±tα/2,  n1×sdjjn(\bar{X}_{\cdot j} - \bar{X}_{\cdot j'}) \pm t_{\alpha'/2,\; n-1} \times \frac{s_{d_{jj'}}}{\sqrt{n}}

Where α=α/m\alpha' = \alpha/m (Bonferroni correction) or the appropriate adjusted critical value, and m=k(k1)/2m = k(k-1)/2 is the number of pairwise comparisons.


11. Advanced Topics

11.1 Interaction Contrasts in Factorial Within-Subjects Designs

In a two-way factorial repeated measures design (A×BA \times B), a significant interaction indicates that the effect of factor AA depends on the level of factor BB (or vice versa). Interaction contrasts decompose this interaction into focused comparisons:

For a 2×22 \times 2 sub-table of a larger interaction, the interaction contrast compares the simple effect of AA at level b1b_1 to the simple effect of AA at level b2b_2:

ψ=(μa1b1μa2b1)(μa1b2μa2b2)\psi = (\mu_{a_1 b_1} - \mu_{a_2 b_1}) - (\mu_{a_1 b_2} - \mu_{a_2 b_2})

Each interaction contrast has df=1df = 1 and can be tested against a single-df error term derived from the A×B×SA \times B \times S interaction. The dfdf of all orthogonal interaction contrasts sum to the interaction df=(a1)(b1)df = (a-1)(b-1).

11.2 Handling Missing Data

Standard repeated measures ANOVA requires complete data — every participant must have an observation in every condition. When data are missing:

Missing Data MechanismRecommended Approach
Missing Completely at Random (MCAR)Complete-case analysis acceptable; note reduced nn
Missing at Random (MAR)Multiple imputation (MI); linear mixed-effects model
Missing Not at Random (MNAR)Pattern-mixture or selection models; sensitivity analysis

DataStatPro's LMM module handles MAR missing data using Full Information Maximum Likelihood (FIML), which includes all available data without requiring complete cases.

11.3 Counterbalancing and Order Effects

When the order of conditions may introduce carryover effects, counterbalancing assigns different condition orders to different participants. Complete counterbalancing (all k!k! orders represented) is only feasible for small kk (k4k \leq 4). For larger kk, Latin square designs provide a systematic partial counterbalancing:

In a Latin square counterbalancing, each condition appears exactly once in each ordinal position, ensuring that order effects are distributed evenly across conditions and do not confound the condition means.

To formally test for order effects in DataStatPro, add "Order Position" as a covariate in a mixed ANOVA after counterbalancing.

11.4 Multivariate vs. Univariate Approach: When to Choose

CriterionFavour Univariate (with ε\varepsilon correction)Favour Multivariate (MANOVA)
SphericityHolds (ε^0.90\hat{\varepsilon} \geq 0.90)Violated (ε^<0.75\hat{\varepsilon} < 0.75)
Sample sizeSmall (n<2kn < 2k)Large (n>2kn > 2k)
Number of conditionskk is largekk is small relative to nn
FocusOmnibus test of condition effectSpecific multivariate structure
PowerHigher when sphericity holdsHigher when sphericity is violated and nn is large

11.5 Effect Size Comparability: ηG2\eta^2_G for Cross-Study Comparisons

A critical limitation of ηp2\eta^2_p is that its value depends on how many conditions are included in the design. Adding a new condition level increases SSConditionSS_{Condition} (the numerator) and may change SSErrorSS_{Error} (the denominator), making ηp2\eta^2_p values from different studies with different kk incomparable.

Generalised eta-squared (ηG2\eta^2_G) solves this problem by including the stable between-subjects variance (SSBSSS_{BS}) in the denominator. This quantity is relatively constant across studies and provides a common denominator for effect size comparisons regardless of the number of conditions.

Recommendation: Always report both ηp2\eta^2_p (for direct F-ratio interpretation and local comparison) and ηG2\eta^2_G (for meta-analytic and cross-study comparisons).

11.6 Multiple Comparisons in Repeated Measures Designs

When conducting m=k(k1)/2m = k(k-1)/2 pairwise post-hoc comparisons, the familywise error rate inflates:

FWER=1(1α)mFWER = 1 - (1-\alpha)^m

For k=5k = 5 conditions (m=10m = 10 pairs): FWER=1(0.95)10=.401FWER = 1 - (0.95)^{10} = .401.

Correction strategies specific to repeated measures:

MethodDescriptionProperties
Bonferroniα=α/m\alpha' = \alpha/mSimple; overly conservative with many comparisons
HolmSequential BonferroniLess conservative; strongly controls FWER
Šidákα=1(1α)1/m\alpha' = 1 - (1-\alpha)^{1/m}Slightly less conservative than Bonferroni
Tukey's HSDUses the studentised range distributionOptimal for all pairwise comparisons
Benjamini-HochbergControls FDR rather than FWERAppropriate for exploratory studies

For planned contrasts (hypotheses specified before data collection), no correction is required if the contrasts are orthogonal and pre-registered.

11.7 Power Considerations: The Role of Within-Subjects Correlation

The primary driver of power in repeated measures ANOVA is the average correlation among conditions (ρ\rho). The error mean square is:

MSErrorσ2(1ρ)MS_{Error} \propto \sigma^2(1 - \rho)

As ρ\rho increases toward 1 (perfect consistency of individual ordering across conditions), MSErrorMS_{Error} approaches 0 and power approaches 1 for any non-zero effect.

Implication for design: Repeated measures designs are most efficient when individual differences are large and consistent. For traits that are highly variable across individuals but stable within individuals (e.g., cognitive ability, personality), the power advantage of within-subjects designs is enormous.

Sample size estimation in DataStatPro requires specifying:

  • The expected effect size ff (or equivalently ηp2\eta^2_p).
  • The number of conditions kk.
  • The expected average correlation ρ\rho among conditions.
  • The significance level α\alpha.
  • The desired power 1β1-\beta.

11.8 Bayesian Repeated Measures ANOVA

The Bayesian approach provides evidence quantification rather than a binary decision. DataStatPro implements Rouder et al.'s (2012) Bayes Factor for within-subjects designs, comparing:

  • H1H_1: A model including the condition effect.
  • H0H_0: A model including only individual differences (no condition effect).

The Bayes Factor BF10BF_{10} directly quantifies how much more probable the data are under the condition-effect model than under the null. Unlike frequentist p-values, BF10<1/3BF_{10} < 1/3 constitutes positive evidence that the condition effect does not exist, making the Bayesian approach especially valuable for studies aiming to support a null result (e.g., demonstrating that a new training protocol has no effect on performance).


12. Worked Examples

Example 1: Pain Ratings Across Three Treatment Phases

A clinical researcher measures pain ratings (0–100 VAS scale) in n=12n = 12 chronic pain patients at three time points: Pre-treatment (Baseline), Mid-treatment (Week 6), and Post-treatment (Week 12). Do pain ratings change significantly over time?

Data Matrix (n=12n = 12, k=3k = 3):

ParticipantBaseline (Xi1X_{i1})Week 6 (Xi2X_{i2})Week 12 (Xi3X_{i3})Person Mean (Xˉi\bar{X}_{i\cdot})
172584558.33
265503851.00
380625064.00
455443243.67
578604862.00
660483547.67
770544255.33
882655266.33
958463345.67
1075574559.00
1168524053.33
1263493749.67
Condition Mean68.8353.7541.4254.67

Step 1 — Hypotheses:

H0:μBaseline=μWeek6=μWeek12H_0: \mu_{Baseline} = \mu_{Week6} = \mu_{Week12}

H1:H_1: At least one time point mean differs.

Step 2 — Sum of Squares:

SSTotal=ij(Xij54.67)2=4,862.00SS_{Total} = \sum_{i}\sum_{j}(X_{ij} - 54.67)^2 = 4{,}862.00

SSBetween-Subjects=3i(Xˉi54.67)2=3×615.56=1,846.67SS_{Between\text{-}Subjects} = 3\sum_{i}(\bar{X}_{i\cdot} - 54.67)^2 = 3 \times 615.56 = 1{,}846.67

SSCondition=12[(68.8354.67)2+(53.7554.67)2+(41.4254.67)2]SS_{Condition} = 12\left[(68.83-54.67)^2 + (53.75-54.67)^2 + (41.42-54.67)^2\right]

=12[200.53+0.85+175.62]=12×377.00=4,524.00= 12\left[200.53 + 0.85 + 175.62\right] = 12 \times 377.00 = 4{,}524.00

SSError=SSTotalSSBetween-SubjectsSSConditionSS_{Error} = SS_{Total} - SS_{Between\text{-}Subjects} - SS_{Condition}

=4,862.001,846.674,524.00= 4{,}862.00 - 1{,}846.67 - 4{,}524.00

⚠️ Note: In this clean example, SSErrorSS_{Error} is computed directly from residuals as SSError=i,j(XijXˉiXˉj+Xˉ)2=491.33SS_{Error} = \sum_{i,j}(X_{ij} - \bar{X}_{i\cdot} - \bar{X}_{\cdot j} + \bar{X}_{\cdot\cdot})^2 = 491.33.

Step 3 — Degrees of Freedom:

dfCondition=k1=2df_{Condition} = k - 1 = 2

dfError=(k1)(n1)=2×11=22df_{Error} = (k-1)(n-1) = 2 \times 11 = 22

Step 4 — Mauchly's Test:

W=0.912W = 0.912, χ2(2)=0.93\chi^2(2) = 0.93, p=.628p = .628

ε^GG=0.921\hat{\varepsilon}_{GG} = 0.921, ε~HF=1.000\tilde{\varepsilon}_{HF} = 1.000

Sphericity is not violated (p=.628p = .628, ε^GG=0.921>0.75\hat{\varepsilon}_{GG} = 0.921 > 0.75); report uncorrected results.

Step 5 — Mean Squares and F-Ratio:

MSCondition=4,524.00/2=2,262.00MS_{Condition} = 4{,}524.00/2 = 2{,}262.00

MSError=491.33/22=22.33MS_{Error} = 491.33/22 = 22.33

F=2,262.00/22.33=101.30F = 2{,}262.00/22.33 = 101.30

Step 6 — p-Value:

p=P(F2,22101.30)<.001p = P(F_{2,22} \geq 101.30) < .001

Step 7 — Effect Sizes:

ηp2=4,524.00/(4,524.00+491.33)=4,524.00/5,015.33=0.902\eta^2_p = 4{,}524.00/(4{,}524.00 + 491.33) = 4{,}524.00/5{,}015.33 = 0.902

ωp2=4,524.002(22.33)4,524.00+(123+1)(22.33)=4,479.344,657.98=0.962\omega^2_p = \frac{4{,}524.00 - 2(22.33)}{4{,}524.00 + (12-3+1)(22.33)} = \frac{4{,}479.34}{4{,}657.98} = 0.962

Wait — let me recompute: ωp2=(45242×22.33)/(4524+(122)×22.33)=4479.34/(4524+223.30)=4479.34/4747.30=0.943\omega^2_p = (4524 - 2 \times 22.33)/(4524 + (12-2) \times 22.33) = 4479.34/(4524 + 223.30) = 4479.34/4747.30 = 0.943

f=0.902/0.098=9.20=3.03f = \sqrt{0.902/0.098} = \sqrt{9.20} = 3.03 (very large)

Step 8 — Post-Hoc Comparisons (Bonferroni, m=3m = 3 pairs, α=.017\alpha' = .017):

ComparisonMean Diffsds_dt(11)t(11)padjp_{adj} (Bonferroni)drmd_{rm}
Baseline vs. Week 615.0815.082.352.3522.2422.24<.001< .0016.416.41
Baseline vs. Week 1227.4227.422.812.8133.7933.79<.001< .0019.769.76
Week 6 vs. Week 1212.3312.332.062.0620.7520.75<.001< .0015.995.99

All pairwise comparisons are significant; pain ratings decreased significantly at every time point.

Summary Table:

SourceSSSSdfdfMSMSFFppηp2\eta^2_p
Between-Subjects1,846.671{,}846.671111
Time4,524.004{,}524.00222,262.002{,}262.00101.30101.30<.001< .001.902.902
Error491.33491.33222222.3322.33
Total4,862.004{,}862.003535

APA write-up: "A one-way repeated measures ANOVA revealed a significant effect of time on pain ratings, F(2,22)=101.30F(2, 22) = 101.30, p<.001p < .001, ηp2=.90\eta^2_p = .90, ωp2=.94\omega^2_p = .94 [95% CI: .86, .96]. Mauchly's test indicated that the sphericity assumption was not violated, W=0.912W = 0.912, χ2(2)=0.93\chi^2(2) = 0.93, p=.628p = .628, ε^GG=0.92\hat{\varepsilon}_{GG} = 0.92. Bonferroni-corrected pairwise comparisons revealed that pain ratings decreased significantly from baseline (M=68.83M = 68.83, SD=8.42SD = 8.42) to week 6 (M=53.75M = 53.75, SD=6.93SD = 6.93; t(11)=22.24t(11) = 22.24, p<.001p < .001, drm=6.41d_{rm} = 6.41), from baseline to week 12 (M=41.42M = 41.42, SD=6.38SD = 6.38; t(11)=33.79t(11) = 33.79, p<.001p < .001, drm=9.76d_{rm} = 9.76), and from week 6 to week 12 (t(11)=20.75t(11) = 20.75, p<.001p < .001, drm=5.99d_{rm} = 5.99)."


Example 2: Stroop Interference Across Three Congruency Conditions

A cognitive psychologist measures reaction time (ms) in n=20n = 20 participants across three Stroop conditions: Congruent, Neutral, and Incongruent.

Summary Statistics:

ConditionXˉj\bar{X}_{\cdot j}sjs_j
Congruent480480 ms5555 ms
Neutral520520 ms5858 ms
Incongruent620620 ms7272 ms
Grand Mean540 ms

Correlation matrix (estimated from pilot data):

CongruentNeutralIncongruent
Congruent1.000.720.65
Neutral0.721.000.78
Incongruent0.650.781.00

Step 1 — Compute SS from Summary Statistics:

SSCondition=nj(XˉjXˉ)2SS_{Condition} = n\sum_j(\bar{X}_{\cdot j} - \bar{X}_{\cdot\cdot})^2

=20[(480540)2+(520540)2+(620540)2]= 20\left[(480-540)^2 + (520-540)^2 + (620-540)^2\right]

=20[3,600+400+6,400]=20×10,400=208,000= 20\left[3{,}600 + 400 + 6{,}400\right] = 20 \times 10{,}400 = 208{,}000

Computing SSErrorSS_{Error} from the covariance matrix (using sj2s_j^2 and rjjr_{jj'}):

Each pairwise difference variance: Var(d1,2)=s12+s222r12s1s2=3,025+3,3642(0.72)(55)(58)=6,3894,586=1,803\text{Var}(d_{1,2}) = s_1^2 + s_2^2 - 2r_{12}s_1 s_2 = 3{,}025 + 3{,}364 - 2(0.72)(55)(58) = 6{,}389 - 4{,}586 = 1{,}803

Var(d1,3)=3,025+5,1842(0.65)(55)(72)=8,2095,148=3,061\text{Var}(d_{1,3}) = 3{,}025 + 5{,}184 - 2(0.65)(55)(72) = 8{,}209 - 5{,}148 = 3{,}061

Var(d2,3)=3,364+5,1842(0.78)(58)(72)=8,5486,511=2,037\text{Var}(d_{2,3}) = 3{,}364 + 5{,}184 - 2(0.78)(58)(72) = 8{,}548 - 6{,}511 = 2{,}037

MSError=n1k1×Var(d1,2)+Var(d1,3)+Var(d2,3)k(k1)/2MS_{Error} = \frac{n-1}{k-1} \times \frac{\text{Var}(d_{1,2}) + \text{Var}(d_{1,3}) + \text{Var}(d_{2,3})}{k(k-1)/2}

Using the direct formula:

MSError(n1)[Var(d)]2=19×(1803+3061+2037)/32=19×2300.32=43,7062=21,853MS_{Error} \approx \frac{(n-1)[\overline{\text{Var}(d)}]}{2} = \frac{19 \times (1803+3061+2037)/3}{2} = \frac{19 \times 2300.3}{2} = \frac{43{,}706}{2} = 21{,}853

Note: Exact computation requires the full data matrix; summary statistics yield an approximation. DataStatPro uses the full data matrix for precise results.

Step 2 — Degrees of Freedom:

dfCondition=2,dfError=(31)(201)=38df_{Condition} = 2, \quad df_{Error} = (3-1)(20-1) = 38

Step 3 — F-Ratio:

MSCondition=208,000/2=104,000MS_{Condition} = 208{,}000/2 = 104{,}000

F104,000/21,8534.76F \approx 104{,}000/21{,}853 \approx 4.76 (approximate; exact from full data)

Step 4 — Mauchly's Test:

From the correlation matrix: ε^GG=0.928\hat{\varepsilon}_{GG} = 0.928, Mauchly p=.412p = .412. Sphericity holds; no correction required.

Step 5 — Effect Size:

ηp2=208,000/(208,000+38×21,853)=208,000/(208,000+830,414)0.200\eta^2_p = 208{,}000/(208{,}000 + 38 \times 21{,}853) = 208{,}000/(208{,}000 + 830{,}414) \approx 0.200

(medium-to-large effect; f0.50f \approx 0.50)

Post-Hoc Contrasts (a priori hypothesis: Congruent < Neutral < Incongruent):

ComparisonMean Diff (ms)drmd_{rm}Bonferroni pp
Incongruent vs. Congruent+140+1401.941.94<.001< .001
Incongruent vs. Neutral+100+1001.531.53<.001< .001
Neutral vs. Congruent+40+400.710.71.021.021

APA write-up: "A one-way repeated measures ANOVA indicated a significant effect of Stroop congruency on reaction time, F(2,38)=4.76F(2, 38) = 4.76, p=.014p = .014, ηp2=.20\eta^2_p = .20 [95% CI: .03, .38], ωp2=.16\omega^2_p = .16. Mauchly's test confirmed that sphericity was not violated, W=0.95W = 0.95, χ2(2)=0.89\chi^2(2) = 0.89, p=.41p = .41. Post-hoc pairwise comparisons (Bonferroni corrected) revealed that incongruent trials (M=620M = 620 ms, SD=72SD = 72 ms) were significantly slower than both neutral (M=520M = 520 ms; ΔM=100\Delta M = 100 ms, drm=1.53d_{rm} = 1.53, p<.001p < .001) and congruent trials (M=480M = 480 ms; ΔM=140\Delta M = 140 ms, drm=1.94d_{rm} = 1.94, p<.001p < .001). Neutral trials were also significantly slower than congruent trials (ΔM=40\Delta M = 40 ms, drm=0.71d_{rm} = 0.71, p=.021p = .021)."


Example 3: Mixed ANOVA — Rehabilitation Programme Across Two Groups and Three Time Points

A physiotherapist compares two rehabilitation protocols (Standard vs. Enhanced) on functional mobility scores (higher = better) in N=30N = 30 patients (n=15n = 15 per group) across three time points: Pre, Post-4wk, and Post-8wk.

Summary Statistics:

GroupPrePost-4wkPost-8wk
Standard (n=15n = 15)45.345.3 (s=7.2s = 7.2)55.855.8 (s=8.1s = 8.1)61.261.2 (s=9.0s = 9.0)
Enhanced (n=15n = 15)44.844.8 (s=6.9s = 6.9)62.462.4 (s=7.8s = 7.8)74.674.6 (s=8.5s = 8.5)

Step 1 — Hypotheses:

  • H0,GroupH_{0,Group}: Standard and Enhanced groups have equal mean mobility scores (averaged across time).
  • H0,TimeH_{0,Time}: Mean mobility scores are equal across Pre, Post-4wk, and Post-8wk (averaged across groups).
  • H0,Group×TimeH_{0,Group \times Time}: The time trajectory does not differ between groups (primary hypothesis).

Step 2 — ANOVA Source Table (condensed):

SourceFFdf1,df2df_1, df_2ppηp2\eta^2_p
Group (Between)4.824.821,281, 28.037.037.147.147
Error (Between: Subjects within Groups)2828
Time (Within)98.4598.452,562, 56<.001< .001.779.779
Group × Time (Interaction)12.3712.372,562, 56<.001< .001.307.307
Error (Within)5656

Mauchly's test: W=0.943W = 0.943, ε^GG=0.948\hat{\varepsilon}_{GG} = 0.948, p=.421p = .421. Sphericity holds; uncorrected results reported.

Step 3 — Interpreting the Interaction (Primary Test):

The Group × Time interaction is significant, F(2,56)=12.37F(2, 56) = 12.37, p<.001p < .001, ηp2=.307\eta^2_p = .307. This indicates that the trajectory of improvement over time differs between the Standard and Enhanced groups. Profile plot inspection reveals that both groups improve over time, but the Enhanced group improves at a faster rate — the gap between groups widens from Pre to Post-8wk.

Step 4 — Simple Effects (Follow-Up):

To unpack the interaction, compute the time effect separately within each group:

GroupF(2,28)F(2, 28) for Timeppηp2\eta^2_p
Standard42.6042.60<.001< .001.753.753
Enhanced89.1389.13<.001< .001.864.864

Both groups show significant improvement over time. The difference in rate of improvement (interaction) is characterised by the Group × Time contrast.

APA write-up: "A 2 (Group: Standard vs. Enhanced) × 3 (Time: Pre, Post-4wk, Post-8wk) mixed ANOVA was conducted with Group as the between-subjects factor and Time as the within-subjects factor. Mauchly's test confirmed sphericity was not violated, W=0.943W = 0.943, χ2(2)=0.89\chi^2(2) = 0.89, p=.421p = .421, ε^GG=0.948\hat{\varepsilon}_{GG} = 0.948. The Group × Time interaction was significant, F(2,56)=12.37F(2, 56) = 12.37, p<.001p < .001, ηp2=.31\eta^2_p = .31 [95% CI: .11, .47], indicating that the two rehabilitation groups differed in their rate of improvement over time. Simple effects analyses confirmed that both groups improved significantly across time points (Standard: F(2,28)=42.60F(2, 28) = 42.60, p<.001p < .001; Enhanced: F(2,28)=89.13F(2, 28) = 89.13, p<.001p < .001), with the Enhanced group demonstrating a steeper improvement trajectory, reaching a mean mobility score of 74.674.6 (SD=8.5SD = 8.5) at Post-8wk compared to 61.261.2 (SD=9.0SD = 9.0) for the Standard group. Main effects of Time, F(2,56)=98.45F(2, 56) = 98.45, p<.001p < .001, ηp2=.78\eta^2_p = .78, and Group, F(1,28)=4.82F(1, 28) = 4.82, p=.037p = .037, ηp2=.15\eta^2_p = .15, were both significant but are qualified by the interaction."


13. Common Mistakes and How to Avoid Them

Mistake 1: Ignoring the Sphericity Assumption

Problem: Running repeated measures ANOVA without testing or correcting for sphericity. When sphericity is violated, the nominal F-distribution is incorrect and the Type I error rate is inflated — sometimes substantially (e.g., actual α\alpha of .15.15 when nominal α=.05\alpha = .05 for k=5k = 5 and severe violation).

Solution: Always report Mauchly's test result and ε^GG\hat{\varepsilon}_{GG} and ε~HF\tilde{\varepsilon}_{HF}. Apply the HF correction when ε^GG>.75\hat{\varepsilon}_{GG} > .75 and the GG correction when ε^GG.75\hat{\varepsilon}_{GG} \leq .75. Consider the multivariate approach for severe violations with adequate nn.


Mistake 2: Treating the Omnibus F-Test as the Final Answer

Problem: Reporting a significant omnibus F and concluding "the conditions differ" without specifying which conditions differ or by how much. The omnibus test provides no information about the pattern of differences.

Solution: Always follow a significant omnibus F with post-hoc pairwise comparisons (for exploratory research) or planned contrasts (for hypothesis-driven research). Report mean differences, confidence intervals, and pairwise effect sizes (drmd_{rm}) for each comparison of interest.


Mistake 3: Using Between-Subjects CIs in Profile Plots

Problem: Plotting standard between-subjects error bars (which include individual difference variance) on a profile plot for repeated measures data. These CIs are too wide and misleadingly suggest that adjacent condition means are not significantly different even when the within-subjects F-test is highly significant.

Solution: Use within-subjects confidence intervals (Cousineau-Morey correction) for all profile plots involving repeated measures. Always label error bars explicitly in the figure caption and state the correction used.


Mistake 4: Conflating the Time Effect with the Group × Time Interaction in Mixed ANOVA

Problem: In mixed ANOVA, focusing on the significant main effect of Time and concluding that the treatment works, while ignoring that the Time effect averages across groups (including the control group). The relevant test for treatment efficacy is the Group × Time interaction, not the main effect of Time.

Solution: For intervention studies with a control group, the primary hypothesis test is always the Group × Time interaction. The main effects of Time and Group are typically secondary or incidental.


Mistake 5: Reporting Only ηp2\eta^2_p Without Context

Problem: Reporting ηp2=0.73\eta^2_p = 0.73 without noting that this is a within-subjects design, or without reporting ηG2\eta^2_G. Partial eta-squared in repeated measures designs is typically much larger than in between-subjects designs for equivalent true effect sizes because SSBetween-SubjectsSS_{Between\text{-}Subjects} is excluded from the denominator. This makes ηp2\eta^2_p values non-comparable across designs.

Solution: Always report both ηp2\eta^2_p and ηG2\eta^2_G. Note whether the effect size is from a within- or between-subjects design. Report ωp2\omega^2_p as the bias-corrected complement to ηp2\eta^2_p, especially for small nn.


Mistake 6: Applying Repeated Measures ANOVA to Dependent Groups

Problem: Analysing data where different participants are matched or paired (not the same participants) using the same software procedure as for within-subjects designs. While mathematically the analysis is valid (matched pairs can be treated as if the same participant were measured twice), the interpretation of the between-subjects term changes and must be communicated clearly.

Solution: Clearly state in the methods section whether the design uses the same participants (within-subjects) or matched participants. The statistical procedure is the same, but the language of "the same participants across conditions" vs. "matched pairs" differs and affects interpretation of the between-subjects variance term.


Mistake 7: Conducting Multiple Repeated Measures ANOVAs Without Correction

Problem: Running separate repeated measures ANOVAs on multiple dependent variables (e.g., one for each of 10 outcome scales), each at α=.05\alpha = .05, without any multiple comparison correction. This inflates the experiment-wise Type I error rate.

Solution: Use MANOVA to jointly test all dependent variables simultaneously if they are theoretically related (and if nn is adequate). If separate ANOVAs are necessary, apply Bonferroni or Holm correction to the omnibus p-values. Report all tested ANOVAs, not just significant ones.


Mistake 8: Forgetting Counterbalancing for Counterorder-Sensitive Conditions

Problem: Presenting all participants with conditions in the same fixed order (e.g., always Condition 1, then Condition 2, then Condition 3). Any practice, fatigue, or contrast effects will be confounded with the condition effect, potentially inflating or deflating specific condition means.

Solution: Counterbalance condition order across participants. Use a complete counterbalancing scheme (all k!k! orders) for small kk or a Latin square for larger kk. Include order position as a covariate in the analysis to check for residual order effects.


14. Troubleshooting

ProblemLikely CauseSolution
F<1F < 1Condition means more similar than chance; very noisy data; possible outliersCheck data; verify conditions manipulate the intended construct; inspect outliers
Mauchly's W=1.00W = 1.00Only k=2k = 2 conditions (sphericity trivially satisfied) or perfectly equal difference variancesReport sphericity as trivially satisfied for k=2k = 2; verify data for k>2k > 2
ε^GG\hat{\varepsilon}_{GG} is very small (<0.50< 0.50)Severe sphericity violation; markedly unequal difference variancesReport GG-corrected results; also report MANOVA; consider LMM
GG and HF corrected p-values differ substantiallyModerate sphericity violation near the 0.750.75 boundaryReport both corrections; use HF as primary if ε^GG>0.75\hat{\varepsilon}_{GG} > 0.75
MANOVA test is significant but univariate F is notMultivariate structure not captured by univariate mean differences; multivariate test sensitive to profile shape, not just levelReport both; describe the multivariate pattern using discriminant function analysis
ηp2>0.90\eta^2_p > 0.90Very large effect; possible design confound; or nn is small relative to the effectVerify no confounds (order effects, demand characteristics); replicate with independent sample
ωp2\omega^2_p is negativeVery small true effect; F<1F < 1; small nnReport as ωp20\omega^2_p \approx 0; do not report negative values as meaningful
Post-hoc tests all non-significant despite significant omnibus FFBonferroni correction too conservative with many comparisonsUse Holm or Tukey correction; consider planned contrasts if hypotheses existed a priori
Sphericity test unavailableOnly k=2k = 2 conditions (sphericity not testable) or all difference variances are zeroFor k=2k = 2, sphericity is trivially met; report paired t-test results
Missing data in one or more cellsParticipant missing a conditionExclude participant from analysis (report reduced nn) or switch to LMM for all-inclusive analysis
Very wide CIs for ηp2\eta^2_pSmall nnIncrease nn; report CIs faithfully — they convey genuine uncertainty; plan an adequately powered replication
Profile plot lines cross (mixed ANOVA)Likely significant Group × Time interactionTest the interaction formally; if significant, report and interpret simple effects
Friedman test disagrees with repeated measures ANOVANon-normality causing ANOVA to be unreliableUse Friedman test results if normality is severely violated; report both and note discrepancy

15. Quick Reference Cheat Sheet

Core Equations

FormulaDescription
SSCondition=nj(XˉjXˉ)2SS_{Condition} = n\sum_j(\bar{X}_{\cdot j} - \bar{X}_{\cdot\cdot})^2Condition sum of squares
SSBS=ki(XˉiXˉ)2SS_{BS} = k\sum_i(\bar{X}_{i\cdot} - \bar{X}_{\cdot\cdot})^2Between-subjects sum of squares
SSError=SSTotalSSConditionSSBSSS_{Error} = SS_{Total} - SS_{Condition} - SS_{BS}Error sum of squares
dfCondition=k1df_{Condition} = k-1Condition degrees of freedom
dfError=(k1)(n1)df_{Error} = (k-1)(n-1)Error degrees of freedom
F=MSCondition/MSErrorF = MS_{Condition}/MS_{Error}F-ratio
p=P(Fdf1,df2Fobs)p = P(F_{df_1, df_2} \geq F_{obs})Right-tail p-value
ηp2=SSCond/(SSCond+SSError)\eta^2_p = SS_{Cond}/(SS_{Cond}+SS_{Error})Partial eta-squared
ηG2=SSCond/(SSCond+SSBS+SSError)\eta^2_G = SS_{Cond}/(SS_{Cond}+SS_{BS}+SS_{Error})Generalised eta-squared
ωp2=(SSCond(k1)MSE)/(SSCond+(nk+1)MSE)\omega^2_p = (SS_{Cond}-(k-1)MS_E)/(SS_{Cond}+(n-k+1)MS_E)Partial omega-squared (bias-corrected)
f=ηp2/(1ηp2)f = \sqrt{\eta^2_p/(1-\eta^2_p)}Cohen's ff for power analysis
drm=ΔXˉ/sdiffd_{rm} = \Delta\bar{X}/s_{diff}Pairwise Cohen's dd for post-hoc comparison
df1=(k1)ε^df^*_1 = (k-1)\hat{\varepsilon}, df2=(k1)(n1)ε^df^*_2 = (k-1)(n-1)\hat{\varepsilon}Epsilon-corrected degrees of freedom

Epsilon Correction Decision Guide

ε^GG\hat{\varepsilon}_{GG}Correction
.90\geq .90None required
.75.89.75 - .89Huynh-Feldt
<.75< .75Greenhouse-Geisser
Severe (<.50< .50)GG + report MANOVA

Effect Size Benchmarks

ηp2\eta^2_pCohen's ffVerbal Label
.01.010.100.10Small
.06.060.250.25Medium
.14.140.400.40Large

Required Sample Size (One-Way, k=3k = 3, α=.05\alpha = .05, ρ=.50\rho = .50)

Cohen's ffPower = 0.80Power = 0.90
0.10 (small)5270
0.25 (medium)1216
0.40 (large)79
0.5057

Assumes α=.05\alpha = .05, two-tailed; ρ=.50\rho = .50 average inter-condition correlation.

Decision Guide

ConditionRecommended Test
Same participants, k3k \geq 3 conditions, normalRepeated measures ANOVA
Same participants, k=2k = 2 conditionsPaired samples t-test
Same participants, k3k \geq 3, non-normal or ordinalFriedman test
Severe sphericity violation, adequate nnMANOVA approach
Missing data, unequal intervals, or covariatesLinear mixed-effects model (LMM)
One within + one between factorMixed (split-plot) ANOVA
Two or more within factorsFactorial repeated measures ANOVA
Quantitative, equally-spaced within factorPolynomial trend analysis
Establishing null condition effectBayesian RM ANOVA (BF10BF_{10})

Post-Hoc Correction Comparison

MethodControlsBest Used When
BonferroniFWER (strict)Few comparisons (m6m \leq 6)
HolmFWER (sequential)Moderate mm; less conservative than Bonferroni
Tukey's HSDFWERAll pairwise comparisons; equal nn
ŠidákFWERIndependent comparisons; slightly less conservative than Bonferroni
Benjamini-HochbergFDRExploratory; large mm
None (planned orthogonal)Per-comparisonStrictly pre-registered, orthogonal contrasts

APA 7th Edition Reporting Templates

One-Way Repeated Measures ANOVA (sphericity met): "A one-way repeated measures ANOVA revealed a significant effect of [Factor] on [Outcome], F(df1,df2)=F(df_1, df_2) = [value], p=p = [value], ηp2=\eta^2_p = [value] [95% CI: LB, UB], ωp2=\omega^2_p = [value]. Mauchly's test indicated that sphericity was not violated, W=W = [value], χ2(df)=\chi^2(df) = [value], p=p = [value], ε^GG=\hat{\varepsilon}_{GG} = [value]."

One-Way Repeated Measures ANOVA (sphericity violated; GG correction): "A one-way repeated measures ANOVA with Greenhouse-Geisser correction revealed a significant effect of [Factor] on [Outcome], F(df1,df2)=F(df^*_1, df^*_2) = [value], p=p = [value], ηp2=\eta^2_p = [value] [95% CI: LB, UB], ωp2=\omega^2_p = [value]. Mauchly's test indicated that sphericity was violated, W=W = [value], χ2(df)=\chi^2(df) = [value], p=p = [value], ε^GG=\hat{\varepsilon}_{GG} = [value]."

Mixed ANOVA (Group × Time interaction): "A a×ka \times k mixed ANOVA with [Between-Factor] as the between-subjects factor and [Within-Factor] as the within-subjects factor revealed a significant [Between × Within] interaction, F(df1,df2)=F(df_1, df_2) = [value], p=p = [value], ηp2=\eta^2_p = [value] [95% CI: LB, UB]. Simple effects analyses indicated that..."

With post-hoc comparisons: "Post-hoc pairwise comparisons (Bonferroni corrected) indicated that [Condition A] (M=M = [value], SD=SD = [value]) was significantly [higher/lower] than [Condition B] (M=M = [value], SD=SD = [value]), t(n1)=t(n-1) = [value], padj=p_{adj} = [value], drm=d_{rm} = [value] [95% CI: LB, UB]."

With Friedman test (non-parametric): "A Friedman test indicated a significant effect of [Factor] on [Outcome], χF2(df)=\chi^2_F(df) = [value], p=p = [value], W=W = [value]. Post-hoc Wilcoxon signed-rank tests (Bonferroni corrected) indicated that..."

Reporting Checklist

ItemRequired
FF-statistic (uncorrected or corrected)✅ Always
Both degrees of freedom (df1df_1, df2df_2)✅ Always
Exact p-value✅ Always
Mauchly's WW, χ2\chi^2, dfdf, pp✅ Always (when k3k \geq 3)
ε^GG\hat{\varepsilon}_{GG} and ε~HF\tilde{\varepsilon}_{HF}✅ Always (when k3k \geq 3)
Statement of which correction was applied✅ When sphericity violated
Condition means and standard deviations✅ Always
95% CIs for condition means (within-subjects)✅ Always
ηp2\eta^2_p with 95% CI✅ Always
ωp2\omega^2_p (bias-corrected)✅ When n<30n < 30
ηG2\eta^2_G✅ For meta-analytic or cross-study comparisons
Cohen's ff✅ For power analysis reporting
Post-hoc comparisons (means, tt, padjp_{adj}, drmd_{rm})✅ When omnibus FF is significant
Planned contrasts (if pre-registered)✅ When applicable
Normality check (Shapiro-Wilk, Q-Q plots)✅ When n<30n < 30
Outlier check (Mahalanobis D2D^2)✅ Always
Profile plot with within-subjects error bars✅ Always
Power analysis (post-hoc or a priori)✅ For non-significant results; underpowered studies
Bayes FactorRecommended for null results
Counterbalancing statement✅ When within-subjects design with potential carryover
Sample size per condition✅ Always

This tutorial provides a comprehensive foundation for understanding, conducting, and reporting repeated measures ANOVA within the DataStatPro application. For further reading, consult Field's "Discovering Statistics Using IBM SPSS Statistics" (5th ed., 2018), Maxwell, Delaney & Kelley's "Designing Experiments and Analyzing Data" (3rd ed., 2018), Lakens's "Calculating and Reporting Effect Sizes to Facilitate Cumulative Science" (Frontiers in Psychology, 2013), Rouder et al.'s "Default Bayes Factors for ANOVA Designs" (Journal of Mathematical Psychology, 2012), and Olejnik & Algina's "Generalized Eta and Omega Squared Statistics" (Psychological Methods, 2003). For feature requests or support, contact the DataStatPro team.