Repeated Measures ANOVA: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of within-subjects experimental designs all the way through advanced interpretation, reporting, assumption checking, and practical usage within the DataStatPro application. Whether you are encountering repeated measures ANOVA for the first time or deepening your understanding of analysing data where the same participants contribute observations across multiple conditions or time points, this guide builds your knowledge systematically from the ground up.

Prerequisites and Background Concepts
What is Repeated Measures ANOVA?
The Mathematics Behind Repeated Measures ANOVA
Assumptions of Repeated Measures ANOVA
Variants of Repeated Measures ANOVA
Using the Repeated Measures ANOVA Calculator Component
Step-by-Step Procedure
Interpreting the Output
Effect Sizes for Repeated Measures ANOVA
Confidence Intervals
Advanced Topics
Worked Examples
Common Mistakes and How to Avoid Them
Troubleshooting
Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into repeated measures ANOVA, it is essential to be comfortable with the following foundational statistical concepts. Each is briefly reviewed below.

1.1 Within-Subjects vs. Between-Subjects Designs

A fundamental distinction in experimental design concerns how participants contribute data across conditions:

Between-subjects design: Different participants are assigned to different conditions. Each participant contributes one observation. Variability between participants is indistinguishable from variability between conditions — it becomes part of the error term.
Within-subjects design: The same participants are measured under all conditions (or at all time points). Each participant contributes multiple observations. Because we can model each participant's general tendency to score high or low, this individual variability is removed from the error term, substantially increasing statistical power.

Repeated measures ANOVA is the inferential framework for within-subjects designs with three or more conditions or time points.

1.2 The Logic of Variance Partitioning

Analysis of Variance (ANOVA) tests hypotheses by partitioning the total variability in the data ( $SS_{Total}$ ) into meaningful components:

$SS_{Total} = SS_{Effect} + SS_{Error}$

The key insight is that what constitutes "error" differs between designs:

In a between-subjects one-way ANOVA: $SS_{Total} = SS_{Between} + SS_{Within}$ , where $SS_{Within}$ includes both random measurement error and individual differences between participants.
In a within-subjects (repeated measures) ANOVA: $SS_{Total} = SS_{Within\text{-}subjects}$ (since all variation is within the same individuals), and:

$SS_{Within\text{-}subjects} = SS_{Condition} + SS_{Subjects} + SS_{Error}$

By extracting $SS_{Subjects}$ (the variability attributable to stable individual differences), the residual error term $SS_{Error}$ is much smaller than in a between-subjects design, producing larger F-ratios and greater power.

1.3 The F-Ratio

The F-ratio is the core test statistic of ANOVA:

$F = \frac{\text{Variance explained by the effect (Mean Square Effect)}}{\text{Unexplained variance (Mean Square Error)}}$

$F = \frac{MS_{Effect}}{MS_{Error}}$

Under $H_0$ (all condition means are equal), $F \approx 1$ . Under $H_1$ (at least one condition mean differs), $F > 1$ . The larger the $F$ , the stronger the evidence against $H_0$ .

1.4 The F-Distribution

The F-distribution is parameterised by two degrees of freedom: $df_1$ (numerator, associated with the effect) and $df_2$ (denominator, associated with the error). It is:

Right-skewed, defined only for non-negative values.
Indexed by $df_1$ and $df_2$ ; approaches a normal distribution for very large $df_2$ .
The p-value is always computed from the right tail: $p = P(F_{df_1, df_2} \geq F_{obs})$ .

1.5 The Null and Alternative Hypotheses in ANOVA

For a within-subjects factor with $k$ levels (conditions or time points):

$H_0$ : All population condition means are equal: $\mu_1 = \mu_2 = \cdots = \mu_k$
$H_1$ : At least one population condition mean differs from at least one other: $\mu_i \neq \mu_j$ for some $i \neq j$

$H_1$ is omnibus — it does not specify which means differ or in what direction. A significant F-test must therefore be followed by post-hoc tests or planned contrasts to identify the specific pattern of differences.

1.6 The p-Value and Significance Level

As in all hypothesis tests, the p-value is the probability of observing an F-ratio as large or larger than obtained, assuming $H_0$ is true. The significance level $\alpha$ (conventionally $.05$ ) is the threshold below which we reject $H_0$ .

⚠️ A significant omnibus F-test tells you only that the condition means are not all equal. It does not tell you which conditions differ, how large the differences are, or whether the differences are practically meaningful. Always follow up with effect sizes, confidence intervals, and post-hoc comparisons.

1.7 Carryover Effects and Counterbalancing

A unique concern in within-subjects designs is that participating in one condition may influence performance in a subsequent condition:

Practice effects: Performance improves with experience across conditions.
Fatigue effects: Performance deteriorates across conditions due to tiredness.
Contrast effects: The subjective experience of one condition is influenced by the preceding condition.

Counterbalancing — systematically varying the order in which participants complete conditions — distributes these carryover effects evenly across conditions, preventing them from confounding the main effect of interest.

1.8 Mauchly's Sphericity and Why It Matters

The repeated measures ANOVA relies on an assumption called sphericity: the variances of all pairwise difference scores between condition levels must be equal. This is analogous to the homogeneity of variance assumption in between-subjects ANOVA, but specific to within-subjects designs. Violating sphericity inflates the Type I error rate. Mauchly's test and epsilon ( $\varepsilon$ ) corrections (Greenhouse-Geisser, Huynh-Feldt) are the standard diagnostic and remedial tools — covered fully in Section 4.

2. What is Repeated Measures ANOVA?

2.1 The Core Question

Repeated measures ANOVA (also called within-subjects ANOVA) is a parametric inferential test that determines whether the means of a continuous dependent variable differ significantly across three or more levels of a within-subjects factor — conditions, time points, or stimuli to which all participants are exposed.

Unlike the paired t-test (which compares two conditions), or between-subjects ANOVA (which compares independent groups), repeated measures ANOVA is the appropriate framework when:

The same participants complete all conditions, OR
Participants are measured at multiple time points (longitudinal panel data), OR
The same participants are exposed to multiple stimuli or multiple tasks.

2.2 The General Logic

Repeated measures ANOVA exploits the within-person correlation across conditions. By modelling each participant's general response level (their row mean in the data matrix), the test removes stable individual differences from the error term:

$SS_{Error_{RM}} = SS_{Error_{BS}} - SS_{Subjects}$

This reduction in error variance means that for the same true effect size, repeated measures ANOVA has substantially greater statistical power than a comparable between-subjects design — particularly when individual differences are large (i.e., when participants consistently differ from one another regardless of condition).

2.3 When to Use Repeated Measures ANOVA

Condition	Requirement
Research design	Same participants measured under all $k \geq 3$ conditions
Dependent variable	Continuous (interval or ratio scale)
Within-subjects factor	Categorical with $k \geq 3$ levels
Observations	Independence between participants (not within)
Distribution	Approximately normal within each condition (or $n \geq 30$ )
Sphericity	Variances of all pairwise difference scores are equal (testable)

2.4 Real-World Applications

Field	Research Question	Within-Subjects Factor
Clinical Psychology	Does anxiety score change across pre-treatment, mid-treatment, and post-treatment?	Time (3 levels)
Cognitive Neuroscience	Does reaction time differ across congruent, neutral, and incongruent Stroop conditions?	Congruency (3 levels)
Education	Does reading fluency improve across four assessment waves in a school year?	Time (4 levels)
Pharmacology	Does blood pressure differ across three drug dosage levels in the same patients?	Dosage (3 levels)
Sport Science	Does VO $_2$ max differ across four stages of a progressive exercise protocol?	Exercise Stage (4 levels)
Nutrition	Does subjective hunger rating differ across morning, noon, afternoon, and evening?	Time of Day (4 levels)
Consumer Psychology	Do preference ratings differ across five product designs evaluated by each participant?	Product Design (5 levels)
Neuroimaging	Does BOLD signal differ across five experimental conditions within the same participants?	Condition (5 levels)

Situation	Correct Test
One within-subjects factor, $k \geq 3$ levels	Repeated measures ANOVA
One within-subjects factor, $k = 2$ levels	Paired samples t-test
One between-subjects factor, $k \geq 3$ groups	One-way between-subjects ANOVA
One within + one between factor	Mixed (split-plot) ANOVA
Two or more within-subjects factors	Factorial repeated measures ANOVA
Non-normal data, one within factor	Friedman test (non-parametric alternative)
Modelling trajectories over time with predictors	Linear mixed-effects model (LMM)
Binary or count outcome, repeated measures	Generalised linear mixed model (GLMM)

3. The Mathematics Behind Repeated Measures ANOVA

3.1 Data Structure

Consider $n$ participants each measured under $k$ conditions. The data form an $n \times k$ matrix of scores $X_{ij}$ , where $i = 1, \ldots, n$ indexes participants and $j = 1, \ldots, k$ indexes conditions:

Participant	Condition 1	Condition 2	$\cdots$	Condition $k$	Person Mean
1	$X_{11}$	$X_{12}$	$\cdots$	$X_{1k}$	$\bar{X}_{1\cdot}$
2	$X_{21}$	$X_{22}$	$\cdots$	$X_{2k}$	$\bar{X}_{2\cdot}$
$\vdots$	$\vdots$	$\vdots$	$\ddots$	$\vdots$	$\vdots$
$n$	$X_{n1}$	$X_{n2}$	$\cdots$	$X_{nk}$	$\bar{X}_{n\cdot}$
Condition Mean	$\bar{X}_{\cdot 1}$	$\bar{X}_{\cdot 2}$	$\cdots$	$\bar{X}_{\cdot k}$	$\bar{X}_{\cdot\cdot}$ (Grand Mean)

3.2 Partitioning the Total Sum of Squares

The total sum of squares across all $N = n \times k$ observations is:

$SS_{Total} = \sum_{i=1}^{n}\sum_{j=1}^{k}(X_{ij} - \bar{X}_{\cdot\cdot})^2$

In a repeated measures design, this partitions as:

$SS_{Total} = SS_{Between\text{-}Subjects} + SS_{Within\text{-}Subjects}$

The between-subjects component reflects how participants differ from each other (averaged across conditions):

$SS_{Between\text{-}Subjects} = k\sum_{i=1}^{n}(\bar{X}_{i\cdot} - \bar{X}_{\cdot\cdot})^2$

The within-subjects component captures how each participant's scores vary across conditions. This is further partitioned into the condition effect and error:

$SS_{Within\text{-}Subjects} = SS_{Condition} + SS_{Error}$

Condition sum of squares (systematic variability between condition means):

$SS_{Condition} = n\sum_{j=1}^{k}(\bar{X}_{\cdot j} - \bar{X}_{\cdot\cdot})^2$

Error sum of squares (residual variability after removing both condition and participant effects):

$SS_{Error} = SS_{Total} - SS_{Between\text{-}Subjects} - SS_{Condition}$

Or equivalently:

$SS_{Error} = \sum_{i=1}^{n}\sum_{j=1}^{k}\left[(X_{ij} - \bar{X}_{i\cdot} - \bar{X}_{\cdot j} + \bar{X}_{\cdot\cdot})^2\right]$

3.3 Degrees of Freedom

Source	Degrees of Freedom
Between-Subjects	$n - 1$
Condition	$k - 1$
Error (Condition × Subjects)	$(k-1)(n-1)$
Total	$nk - 1$

Verification: $(n-1) + (k-1) + (k-1)(n-1) = n-1 + k-1 + nk - n - k + 1 = nk - 1$ ✓

3.4 Mean Squares and the F-Ratio

Mean squares are obtained by dividing each sum of squares by its degrees of freedom:

$MS_{Condition} = \frac{SS_{Condition}}{k-1}$

$MS_{Error} = \frac{SS_{Error}}{(k-1)(n-1)}$

The F-ratio for the within-subjects condition effect:

$F = \frac{MS_{Condition}}{MS_{Error}}$

Under $H_0$ : $\mu_1 = \mu_2 = \cdots = \mu_k$ , this F-ratio follows an F-distribution with $df_1 = k - 1$ and $df_2 = (k-1)(n-1)$ degrees of freedom.

3.5 The ANOVA Source Table

The complete one-way repeated measures ANOVA source table:

Source	$SS$	$df$	$MS$	$F$	$p$
Between-Subjects	$SS_{BS}$	$n-1$	—	—	—
Condition (Within)	$SS_{Cond}$	$k-1$	$MS_{Cond}$	$MS_{Cond}/MS_{Error}$	from $F_{k-1,\,(k-1)(n-1)}$
Error	$SS_{Error}$	$(k-1)(n-1)$	$MS_{Error}$
Total	$SS_{Total}$	$nk-1$

⚠️ The between-subjects row ( $SS_{BS}$ , $df = n-1$ ) is typically not tested with an F-ratio — it represents stable individual differences that are partialled out, not an experimental factor of interest. Some software omits this row entirely.

3.6 Epsilon ( $\varepsilon$ ) Corrections for Sphericity Violations

When the sphericity assumption is violated (see Section 4), the actual sampling distribution of $F$ has heavier tails than the nominal $F_{k-1,\,(k-1)(n-1)}$ distribution — producing inflated Type I error. Two corrections adjust the degrees of freedom to match the true distribution.

Greenhouse-Geisser (GG) epsilon:

$\hat{\varepsilon}_{GG} = \frac{k^2(\bar{s}_{jj'} - \bar{s}_{..})^2}{(k-1)\left[\sum_{j}\sum_{j'} s^2_{jj'} - 2k\sum_j \bar{s}^2_{j.} + k^2\bar{s}^2_{..}\right]}$

Where $s_{jj'}$ are elements of the covariance matrix of condition scores, $\bar{s}_{jj'}$ are column means, and $\bar{s}_{..}$ is the grand mean of the covariance matrix.

Practically, $\hat{\varepsilon}_{GG}$ ranges from $1/(k-1)$ (maximum violation of sphericity) to $1.0$ (perfect sphericity). GG is known to be conservative — it sometimes overcorrects, especially with larger $k$ and $n$ .

Huynh-Feldt (HF) epsilon:

$\tilde{\varepsilon}_{HF} = \frac{n(k-1)\hat{\varepsilon}_{GG} - 2}{(k-1)[n - 1 - (k-1)\hat{\varepsilon}_{GG}]}$

HF epsilon is less conservative than GG and is recommended when $\hat{\varepsilon}_{GG} > .75$ . If $\tilde{\varepsilon}_{HF} > 1$ , it is set to $1.0$ .

Corrected degrees of freedom:

$df_1^* = (k-1)\hat{\varepsilon}, \qquad df_2^* = (k-1)(n-1)\hat{\varepsilon}$

The F-statistic itself is unchanged; only the reference distribution is adjusted.

Decision rule for epsilon corrections:

$\hat{\varepsilon}_{GG}$	Recommended Correction
$\approx 1.0$ (no violation)	No correction needed
$0.75 < \hat{\varepsilon}_{GG} < 1.0$	Huynh-Feldt correction
$\hat{\varepsilon}_{GG} \leq 0.75$	Greenhouse-Geisser correction
Any value (conservative approach)	Always use GG correction
Severe violation, small $n$	Consider MANOVA approach

3.7 The Multivariate Approach (MANOVA)

An alternative to the univariate F-test with epsilon corrections is the fully multivariate approach, which makes no sphericity assumption whatsoever. The $k$ repeated conditions are recast as $k-1$ contrast variables and tested with multivariate test statistics:

Pillai's Trace: $V = \text{tr}[\mathbf{H}(\mathbf{H}+\mathbf{E})^{-1}]$
Wilks' Lambda: $\Lambda = |\mathbf{E}|/|\mathbf{H}+\mathbf{E}|$
Hotelling-Lawley Trace: $T = \text{tr}[\mathbf{H}\mathbf{E}^{-1}]$
Roy's Largest Root: $\theta = \lambda_{max}(\mathbf{H}\mathbf{E}^{-1})$

Where $\mathbf{H}$ is the hypothesis matrix and $\mathbf{E}$ is the error matrix.

The multivariate approach is always valid regardless of sphericity, but requires $n > k$ and loses power relative to the corrected univariate test when sphericity holds approximately. For small $n$ relative to $k$ , the univariate approach with epsilon corrections is preferred.

3.8 Effect Size — Eta-Squared ( $\eta^2$ )

The most straightforward effect size for repeated measures ANOVA is eta-squared:

$\eta^2 = \frac{SS_{Condition}}{SS_{Total}}$

$\eta^2$ is the proportion of total variance explained by the condition effect. However, in repeated measures designs, the between-subjects variance ( $SS_{BS}$ ) is irreducible and not of interest. This makes $\eta^2$ artificially small compared to a between-subjects design with the same true effect — it is not directly comparable across designs.

3.9 Effect Size — Partial Eta-Squared ( $\eta^2_p$ )

Partial eta-squared removes the between-subjects variance from the denominator:

$\eta^2_p = \frac{SS_{Condition}}{SS_{Condition} + SS_{Error}}$

$\eta^2_p$ represents the proportion of variance explained by the condition effect after removing individual differences. It is the standard effect size reported by most software (SPSS, SAS, R's ez package) and is comparable across between- and within-subjects designs.

Relationship to $F$ :

$\eta^2_p = \frac{F \cdot df_1}{F \cdot df_1 + df_2}$

3.10 Effect Size — Generalised Eta-Squared ( $\eta^2_G$ )

Generalised eta-squared (Olejnik & Algina, 2003) is designed for comparability across studies with different designs. For a pure within-subjects design:

$\eta^2_G = \frac{SS_{Condition}}{SS_{Condition} + SS_{Between\text{-}Subjects} + SS_{Error}}$

$\eta^2_G$ is the recommended effect size for meta-analysis involving repeated measures designs because it is invariant to the number of conditions measured, unlike $\eta^2_p$ .

3.11 Effect Size — Omega-Squared ( $\omega^2$ ) and Partial Omega-Squared ( $\omega^2_p$ )

Both $\eta^2$ and $\eta^2_p$ are positively biased (they overestimate the population effect, especially in small samples). Omega-squared applies a bias correction:

$\omega^2 = \frac{SS_{Condition} - (k-1)MS_{Error}}{SS_{Total} + MS_{Between\text{-}Subjects}}$

Partial omega-squared (preferred for repeated measures):

$\omega^2_p = \frac{SS_{Condition} - (k-1)MS_{Error}}{SS_{Condition} + (n - k + 1) \cdot MS_{Error}}$

For large samples, $\omega^2_p \approx \eta^2_p$ . For small samples ( $n < 30$ ), $\omega^2_p$ is the recommended effect size to report alongside $\eta^2_p$ .

3.12 Cohen's $f$ and Statistical Power

Cohen's $f$ is the standardised effect size for ANOVA, defined as:

$f = \sqrt{\frac{\eta^2_p}{1 - \eta^2_p}}$

Required sample size for desired power $1-\beta$ at two-sided $\alpha$ (approximate):

$n \approx \frac{\lambda}{f^2 \cdot k}$

Where $\lambda$ is the non-centrality parameter satisfying the power equation.

Required $n$ per condition combination, one-way within-subjects ( $k = 4$ , $\alpha = .05$ , $\rho = .50$ average intercorrelation):

Cohen's $f$	Verbal Label	Power = 0.80	Power = 0.90	Power = 0.95
0.10	Small	44	58	72
0.25	Medium	12	16	20
0.40	Large	7	9	11
0.50	Large	5	7	9

⚠️ Power for repeated measures ANOVA depends critically on the average correlation among conditions ( $\rho$ ). Higher $\rho$ → greater power (more individual variability removed). Always specify $\rho$ in power analyses for within-subjects designs.

4. Assumptions of Repeated Measures ANOVA

4.1 Normality of Residuals (or Condition Scores)

The repeated measures ANOVA assumes that the residual scores (or equivalently, the scores within each condition) are approximately normally distributed in the population.

How to check:

Method	Details
Shapiro-Wilk test	Applied to residuals or condition-level scores; most powerful for $n < 50$
Q-Q plots	One per condition; points should fall along the diagonal
Histograms	One per condition; should be approximately bell-shaped
Skewness and kurtosis	$\vert z_{skew} \vert < 2$ ; $\vert z_{kurt} \vert < 7$ suggest acceptable distributions

Robustness: The F-test is moderately robust to non-normality when $n \geq 30$ (via the Central Limit Theorem) and when the violation is symmetric. Severe skewness with small $n$ is the primary concern.

When violated:

Use the Friedman test (non-parametric alternative) for small samples with non-normal data.
Consider log or square-root transformation for right-skewed outcome variables.
Use a linear mixed-effects model with robust standard errors.

4.2 Sphericity

Sphericity is the assumption that the variances of all pairwise difference scores between conditions are equal. For $k$ conditions, there are $k(k-1)/2$ pairwise differences, each of which must have the same variance.

Formally, if $d_{jj'} = X_{ij} - X_{ij'}$ is the difference score between conditions $j$ and $j'$ for participant $i$ , then sphericity requires:

$\text{Var}(d_{jj'}) = \text{Var}(d_{jj''}) \quad \text{for all } j, j', j''$

Compound symmetry (equal variances and equal covariances across all conditions) is a sufficient but not necessary condition for sphericity. Sphericity is a weaker requirement than compound symmetry and is the actual assumption of the F-test.

Mauchly's Test of Sphericity:

$W = \frac{\prod_{l=1}^{k-1} \lambda_l}{\left(\frac{\sum_{l=1}^{k-1}\lambda_l}{k-1}\right)^{k-1}}$

Where $\lambda_l$ are the eigenvalues of the transformed covariance matrix (using orthonormal contrasts). The test statistic:

$\chi^2 = -\left(n - 1 - \frac{2(k-1)^2 + k + 1}{6(k-1)}\right)\ln(W)$

With $df = (k-1)(k+2)/2 - 1 = k(k-1)/2 - 1$ .

$H_0$ : Sphericity holds ( $W = 1$ ). $p \leq .05$ → reject sphericity; apply corrections.

⚠️ Mauchly's test is sensitive to non-normality and can give misleading results in small samples (underpowered) and large samples (overpowered — detecting trivial violations). Always report $\hat{\varepsilon}$ alongside Mauchly's test result. $\hat{\varepsilon}_{GG} < 0.75$ indicates a practically meaningful violation regardless of the Mauchly p-value.

Epsilon values and their implications:

$\hat{\varepsilon}_{GG}$	Interpretation	Action
$= 1.00$	Perfect sphericity	No correction needed
$0.90 - 0.99$	Minimal violation	Huynh-Feldt or no correction
$0.75 - 0.89$	Moderate violation	Huynh-Feldt correction
$0.50 - 0.74$	Substantial violation	Greenhouse-Geisser correction
$< 0.50$	Severe violation	GG correction or MANOVA
$= 1/(k-1)$	Maximum violation	MANOVA strongly recommended

Note: The sphericity assumption is irrelevant when $k = 2$ (only two conditions form exactly one difference score, whose variance always equals itself). This is why the paired t-test needs no sphericity correction.

4.3 Independence of Observations Between Participants

While the design deliberately induces correlation within participants (across conditions), the observations between participants must be independent. Each participant's data must not influence another participant's data.

Common violations:

Participants in the same lab session who can observe or influence each other.
Family members or partners in the same study.
Hierarchically nested data (e.g., students within classrooms all treated as independent) — use linear mixed-effects models instead.

4.4 Interval or Ratio Scale of Measurement

The dependent variable must be continuous and measured on at least an interval scale — equal numerical differences must represent equal psychological or physical differences across the entire scale range.

When violated: If the dependent variable is ordinal (e.g., ranks or Likert ratings treated as ordinal), use the Friedman test instead.

4.5 No Extreme Multivariate Outliers

Outliers in any condition can distort condition means and inflate $SS_{Error}$ , potentially masking real effects or creating spurious ones.

How to check:

Boxplots for each condition.
Standardised scores $|z_i| > 3.29$ within each condition.
Mahalanobis distance across conditions: flags participants who are outliers in the multivariate sense (unusual profile across all conditions simultaneously).

When outliers present: Investigate the cause. Report analyses with and without outliers. Consider the Friedman test or trimmed-mean ANOVA as robust alternatives.

4.6 Assumption Summary

Assumption	How to Check	Remedy if Violated
Normality	Shapiro-Wilk; Q-Q plots per condition	Friedman test; data transformation; LMM
Sphericity	Mauchly's test; inspect $\hat{\varepsilon}_{GG}$	GG or HF correction; MANOVA
Independence between participants	Study design review	Linear mixed-effects model
Interval scale	Measurement theory review	Friedman test
No extreme outliers	Boxplots; $z$ -scores; Mahalanobis $D^2$	Investigate; robust ANOVA; Friedman

5. Variants of Repeated Measures ANOVA

5.1 One-Way Repeated Measures ANOVA

The standard form described throughout this tutorial: one within-subjects factor with $k \geq 3$ levels. Tests whether condition means differ significantly.

5.2 Factorial Repeated Measures ANOVA (Two or More Within Factors)

When each participant is measured across all combinations of two or more within-subjects factors, a factorial repeated measures ANOVA is used.

For factors $A$ (with $a$ levels) and $B$ (with $b$ levels), the partition is:

$SS_{Within} = SS_A + SS_B + SS_{A \times B} + SS_{A \times S} + SS_{B \times S} + SS_{A \times B \times S}$

Where $S$ denotes subjects. Each main effect and interaction has its own error term (the corresponding subjects-by-factor interaction). This allows each effect to be tested against a different, tailored error term.

5.3 Mixed ANOVA (Split-Plot Design)

The mixed ANOVA (also called split-plot ANOVA) combines:

One or more between-subjects factors (different participants per group).
One or more within-subjects factors (same participants across conditions).

Example: Comparing three treatment groups (between) measured at pre, mid, and post (within). The interaction between group and time (Group × Time) is typically the focal test — it assesses whether the trajectory of change over time differs across groups.

Variance partitioning:

$SS_{Total} = SS_{Between\text{-}Subjects} + SS_{Within\text{-}Subjects}$

$SS_{Between\text{-}Subjects} = SS_{Group} + SS_{Subjects(Group)}$

$SS_{Within\text{-}Subjects} = SS_{Time} + SS_{Group \times Time} + SS_{Time \times Subjects(Group)}$

Each between-subjects effect uses $MS_{Subjects(Group)}$ as its error term; each within-subjects effect uses $MS_{Time \times Subjects(Group)}$ as its error term.

5.4 Friedman Test (Non-Parametric Alternative)

When the normality assumption is severely violated or the data are ordinal, the Friedman test is the non-parametric equivalent of one-way repeated measures ANOVA.

Procedure:

Rank each participant's scores across the $k$ conditions (1 = lowest, $k$ = highest).
Compute the mean rank for each condition: $\bar{R}_j = \frac{1}{n}\sum_i R_{ij}$ .
Compute the Friedman statistic:

$\chi^2_F = \frac{12n}{k(k+1)}\sum_{j=1}^k \left(\bar{R}_j - \frac{k+1}{2}\right)^2$

Under $H_0$ , $\chi^2_F \approx \chi^2_{k-1}$ for large $n$ .

Effect size: Kendall's $W$ (coefficient of concordance):

$W = \frac{\chi^2_F}{n(k-1)}$

$W$ ranges from 0 (no agreement in rankings across participants) to 1 (perfect agreement). Conversion to $r$ : $r = 2W - 1$ (for $k = 2$ ).

5.5 Trend Analysis (Polynomial Contrasts)

When the within-subjects factor is quantitative and equally spaced (e.g., time points at regular intervals, dosage levels at equal increments), trend analysis decomposes the condition effect into orthogonal polynomial components:

Linear trend: Does the mean increase or decrease monotonically across levels?
Quadratic trend: Is there a U-shaped or inverted-U-shaped pattern?
Cubic trend: Is there an S-shaped or more complex pattern?

Each trend component has $df = 1$ and is tested separately against the error mean square. Trend analysis is more powerful and more informative than the omnibus F-test when a specific trajectory is hypothesised.

5.6 Linear Mixed-Effects Models (LMM)

Linear mixed-effects models (also called multilevel models or hierarchical linear models) subsume repeated measures ANOVA as a special case while offering several important generalisations:

Handle missing data without excluding participants (ANOVA requires complete data or imputation).
Model unequal time intervals between measurements.
Allow time-varying covariates as predictors.
Specify flexible covariance structures (not restricted to compound symmetry or sphericity).
Accommodate both balanced and unbalanced designs.

For complex longitudinal designs, LMMs are generally preferred over repeated measures ANOVA. DataStatPro's repeated measures ANOVA module automatically suggests LMM when missing data are detected.

5.7 Bayesian Repeated Measures ANOVA

The Bayesian approach computes Bayes Factors comparing models with and without the condition effect. Under default priors (Rouder et al., 2012):

$BF_{10} = \frac{P(\text{data} \mid H_1: \text{condition effect exists})}{P(\text{data} \mid H_0: \text{no condition effect})}$

Interpreting $BF_{10}$ :

$BF_{10}$	Evidence
$> 100$	Extreme evidence for $H_1$
$30 - 100$	Very strong
$10 - 30$	Strong
$3 - 10$	Moderate
$1 - 3$	Anecdotal
$< 1/3$	Moderate evidence for $H_0$ (no effect)

6. Using the Repeated Measures ANOVA Calculator Component

The Repeated Measures ANOVA Calculator in DataStatPro provides a comprehensive tool for running, diagnosing, and reporting within-subjects analyses.

Step-by-Step Guide

Step 1 — Select the Test

Navigate to Statistical Tests → ANOVA → Repeated Measures ANOVA.

Step 2 — Input Method

Choose how to provide data:

Raw data (wide format): Each row is one participant; each column is one condition. DataStatPro automatically identifies the within-subjects structure.
Raw data (long format): Three columns required: participant ID, condition label, and dependent variable value. DataStatPro reshapes to wide format internally.
Summary statistics: Enter $n$ , condition means ( $\bar{X}_{\cdot j}$ ), standard deviations ( $s_j$ ), and the correlation matrix across conditions. DataStatPro reconstructs the ANOVA source table.

Step 3 — Define the Within-Subjects Factor

Specify the factor name (e.g., "Time", "Condition", "Dosage").
Label each level (e.g., "Pre", "Mid", "Post").
For factorial designs, define additional within-subjects factors and their levels.
For mixed designs, specify the between-subjects grouping variable.

Step 4 — Select Post-Hoc Tests and Contrasts

Post-hoc tests (exploratory): Bonferroni, Holm, Tukey's HSD (adapted for within-subjects), Sidák.
Planned contrasts: Simple (each level vs. first), Helmert (each level vs. mean of preceding), polynomial (linear, quadratic, cubic trend), custom.

Step 5 — Set Significance Level and Confidence Level

Default: $\alpha = .05$ , 95% CI. Results at $\alpha = .01$ and $\alpha = .001$ are simultaneously displayed.

Step 6 — Select Display Options

✅ Full ANOVA source table with $SS$ , $df$ , $MS$ , $F$ , exact $p$ .
✅ Mauchly's test of sphericity and $\hat{\varepsilon}_{GG}$ , $\tilde{\varepsilon}_{HF}$ .
✅ Greenhouse-Geisser and Huynh-Feldt corrected results (automatically displayed when sphericity is violated).
✅ Multivariate test statistics (Pillai, Wilks, Hotelling, Roy) as an alternative.
✅ Partial eta-squared ( $\eta^2_p$ ) and omega-squared ( $\omega^2_p$ ) with 95% CI.
✅ Generalised eta-squared ( $\eta^2_G$ ).
✅ Cohen's $f$ for power analysis.
✅ Condition means, standard deviations, and 95% CIs with error bar plots.
✅ Post-hoc comparison table (pairwise $t$ -tests with corrections).
✅ Profile plot (means across conditions with individual participant trajectories).
✅ Interaction plot (for factorial and mixed designs).
✅ Residual Q-Q plots and normality test per condition.
✅ Mahalanobis distance outlier detection.
✅ Power analysis: current post-hoc power and required $n$ for 80%, 90%, 95% power.
✅ Bayesian ANOVA (Bayes Factor $BF_{10}$ ).
✅ APA 7th edition results paragraph (auto-generated).

Step 7 — Run the Analysis

Click "Run Repeated Measures ANOVA". DataStatPro will:

Compute all $SS$ , $df$ , $MS$ , $F$ , and exact p-values.
Conduct Mauchly's test; apply GG and HF corrections as appropriate.
Compute multivariate test statistics.
Compute $\eta^2_p$ , $\omega^2_p$ , $\eta^2_G$ , and Cohen's $f$ with 95% CIs.
Run selected post-hoc comparisons or planned contrasts.
Conduct normality and outlier diagnostics.
Estimate post-hoc power.
Output an APA-compliant results paragraph.

7. Step-by-Step Procedure

7.1 Full Manual Procedure

Step 1 — State the Hypotheses

$H_0: \mu_1 = \mu_2 = \cdots = \mu_k$ (all condition population means are equal)

$H_1:$ At least one $\mu_j$ differs from at least one other $\mu_{j'}$

Specify the within-subjects factor and the number of levels $k$ based on the research design, before examining the data.

Step 2 — Organise the Data Matrix

Arrange data in an $n \times k$ matrix with one row per participant and one column per condition. Verify that there are no missing values (or plan imputation/LMM approach).

Step 3 — Check Assumptions

Inspect distributions within each condition (histograms, Q-Q plots, Shapiro-Wilk).
Investigate multivariate outliers (Mahalanobis distance).
Confirm independence of observations between participants (design review).
Proceed to compute expected frequencies — irrelevant here; proceed to compute $E_{ij}$ — irrelevant; proceed to test sphericity after computing $SS$ components.

Step 4 — Compute Grand Mean and Condition Means

$\bar{X}_{\cdot\cdot} = \frac{1}{nk}\sum_{i=1}^n\sum_{j=1}^k X_{ij}$

$\bar{X}_{\cdot j} = \frac{1}{n}\sum_{i=1}^n X_{ij} \quad \text{for each condition } j$

$\bar{X}_{i\cdot} = \frac{1}{k}\sum_{j=1}^k X_{ij} \quad \text{for each participant } i$

Step 5 — Compute Sums of Squares

$SS_{Total} = \sum_{i=1}^n\sum_{j=1}^k (X_{ij} - \bar{X}_{\cdot\cdot})^2$

$SS_{Between\text{-}Subjects} = k\sum_{i=1}^n (\bar{X}_{i\cdot} - \bar{X}_{\cdot\cdot})^2$

$SS_{Condition} = n\sum_{j=1}^k (\bar{X}_{\cdot j} - \bar{X}_{\cdot\cdot})^2$

$SS_{Error} = SS_{Total} - SS_{Between\text{-}Subjects} - SS_{Condition}$

Step 6 — Compute Degrees of Freedom

$df_{Condition} = k - 1$

$df_{Error} = (k-1)(n-1)$

Step 7 — Conduct Mauchly's Test of Sphericity

Compute the covariance matrix of condition scores $\boldsymbol{\Sigma}$ and obtain $\hat{\varepsilon}_{GG}$ and $\tilde{\varepsilon}_{HF}$ . If Mauchly's $p < .05$ or $\hat{\varepsilon}_{GG} < 0.75$ , apply the appropriate correction.

Step 8 — Compute Mean Squares and the F-Ratio

$MS_{Condition} = \frac{SS_{Condition}}{df^*_{Condition}}$ (use corrected $df^*$ if applicable)

$MS_{Error} = \frac{SS_{Error}}{df^*_{Error}}$

$F = \frac{MS_{Condition}}{MS_{Error}}$

Step 9 — Compute the p-Value

$p = P(F_{df^*_1, df^*_2} \geq F_{obs})$

Reject $H_0$ if $p \leq \alpha$ .

Step 10 — Compute Effect Sizes

$\eta^2_p = \frac{SS_{Condition}}{SS_{Condition} + SS_{Error}}$

$\omega^2_p = \frac{SS_{Condition} - (k-1)MS_{Error}}{SS_{Condition} + (n-k+1) \cdot MS_{Error}}$

$f = \sqrt{\frac{\eta^2_p}{1 - \eta^2_p}}$

Step 11 — Conduct Post-Hoc Comparisons (if $H_0$ rejected)

For each pair of conditions $(j, j')$ , compute the pairwise paired t-test:

$t_{jj'} = \frac{\bar{X}_{\cdot j} - \bar{X}_{\cdot j'}}{s_{d_{jj'}}/\sqrt{n}}$

Where $s_{d_{jj'}}$ is the standard deviation of the difference scores $d_i = X_{ij} - X_{ij'}$ . Apply Bonferroni, Holm, or Tukey correction for the $k(k-1)/2$ pairwise comparisons.

Step 12 — Interpret and Report

Use the APA reporting template in Section 15. Always report $F$ , both $df$ values (with corrections if applicable), $p$ , $\eta^2_p$ , $\omega^2_p$ , and 95% CI for the effect size. Report Mauchly's test result and the epsilon correction applied.

8. Interpreting the Output

8.1 The F-Statistic

$F_{obs}$ Interpretation	Meaning
$F \approx 1$	Condition variance $\approx$ error variance; consistent with $H_0$
$F \gg 1$	Condition variance substantially exceeds error; evidence against $H_0$
Large $F$ with large $n$	Can be significant even for very small $\eta^2_p$
Small $F$ with small $n$	May be non-significant even for large $\eta^2_p$ (low power)
$F < 1$ (rare)	Observed means more similar than chance alone would predict

8.2 The p-Value

p-Value	Conventional Interpretation
$p > .10$	No evidence against $H_0$ (equal condition means)
$.05 < p \leq .10$	Marginal evidence of condition differences (trend)
$.01 < p \leq .05$	Significant condition effect at $\alpha = .05$
$.001 < p \leq .01$	Significant condition effect at $\alpha = .01$
$p \leq .001$	Significant condition effect at $\alpha = .001$

⚠️ A significant F-test is omnibus: it indicates only that at least one pair of condition means differs. It does not identify which conditions differ or how large those differences are. Always follow a significant omnibus test with post-hoc comparisons or planned contrasts, and always report effect sizes.

8.3 Mauchly's Test and Epsilon

Mauchly's $p$	$\hat{\varepsilon}_{GG}$	Recommended Action
$> .05$	$\geq .90$	Report uncorrected results; sphericity holds
$> .05$	$.75 - .89$	Report HF-corrected results as primary; note uncorrected
$\leq .05$	$.75 - .89$	Use HF correction
$\leq .05$	$< .75$	Use GG correction; consider reporting MANOVA
Any	$\approx 1/(k-1)$	Report MANOVA as primary analysis

⚠️ Always report $\hat{\varepsilon}_{GG}$ regardless of Mauchly's test outcome, since Mauchly's test may lack power in small samples. A reader can use $\hat{\varepsilon}_{GG}$ to judge the severity of any sphericity violation.

8.4 The ANOVA Source Table

When reading the output table, focus on:

$SS_{Condition}$ : Total systematic variability across conditions — larger values indicate greater spread of condition means.
$SS_{Error}$ : Residual within-subjects variability after removing condition and participant effects — smaller values (more tightly correlated conditions) indicate a more sensitive design.
$MS_{Condition}/MS_{Error}$ ratio: The F-ratio — the key test statistic.
$df$ values: Verify these match $(k-1)$ and $(k-1)(n-1)$ (uncorrected) or their epsilon-adjusted equivalents.

8.5 Partial Eta-Squared ( $\eta^2_p$ ) — Magnitude Interpretation

Cohen's (1988) benchmarks for $\eta^2_p$ (and $f$ ):

$\eta^2_p$	Cohen's $f$	Verbal Label
$0.01$	$0.10$	Small
$0.06$	$0.25$	Medium
$0.14$	$0.40$	Large

Extended benchmarks (Lakens, 2013, contextualised for within-subjects designs):

$\eta^2_p$	Verbal Label
$< 0.01$	Negligible
$0.01 - 0.05$	Small
$0.06 - 0.13$	Medium
$0.14 - 0.25$	Large
$> 0.25$	Very large

⚠️ These benchmarks were developed for between-subjects designs. In within-subjects designs, $\eta^2_p$ values tend to be larger because individual differences are removed from the error term. Do not mechanically apply Cohen's benchmarks without considering the typical effect sizes in your specific research domain.

8.6 Post-Hoc Comparisons

After a significant omnibus F-test, post-hoc pairwise comparisons identify which specific pairs of conditions differ. For $k$ conditions there are $k(k-1)/2$ pairs:

$k$	Number of Pairs
3	3
4	6
5	10
6	15

For each comparison, report the mean difference $(\bar{X}_{\cdot j} - \bar{X}_{\cdot j'})$ , the $t$ -statistic, the adjusted p-value, and Cohen's $d_{rm}$ for the paired comparison:

$d_{rm} = \frac{\bar{X}_{\cdot j} - \bar{X}_{\cdot j'}}{s_{d_{jj'}}}$

8.7 Profile Plots

A profile plot (line chart with conditions on the x-axis and mean outcome on the y-axis) is the primary visualisation for repeated measures ANOVA. Key features to examine:

Parallel lines (in mixed designs): Suggests no interaction between between- and within-subjects factors.
Crossing or diverging lines: Suggests a Group × Time interaction — the critical test in most mixed designs.
Error bars: Should represent 95% within-subjects confidence intervals (not the standard between-subjects SEM), computed using the Cousineau-Morey correction, which removes between-subjects variance from the error.

9. Effect Sizes for Repeated Measures ANOVA

9.1 Partial Eta-Squared ( $\eta^2_p$ )

$\eta^2_p = \frac{SS_{Condition}}{SS_{Condition} + SS_{Error}}$

The proportion of within-subjects variance (after removing participant-level variance) explained by the condition effect. This is the standard effect size reported by SPSS, SAS, and most ANOVA software. Upwardly biased in small samples.

9.2 Generalised Eta-Squared ( $\eta^2_G$ )

$\eta^2_G = \frac{SS_{Condition}}{SS_{Condition} + SS_{Between\text{-}Subjects} + SS_{Error}}$

Designed for comparability across studies that differ in the number of conditions and design type (between vs. within). Recommended for meta-analyses. Note that $\eta^2_G \leq \eta^2_p$ always, since the denominator of $\eta^2_G$ is larger.

9.3 Partial Omega-Squared ( $\omega^2_p$ ) — Bias-Corrected

$\omega^2_p = \frac{SS_{Condition} - (k-1)MS_{Error}}{SS_{Condition} + (n - k + 1) \cdot MS_{Error}}$

The unbiased (or less biased) estimate of the true population partial effect size. Preferred over $\eta^2_p$ for small samples ( $n < 30$ ). Note that $\omega^2_p$ can be negative for very small effects — negative values should be reported as $\omega^2_p \approx 0$ (indicating no effect).

9.4 Cohen's $f$

$f = \sqrt{\frac{\eta^2_p}{1 - \eta^2_p}} = \frac{\sigma_m}{\sigma_\varepsilon}$

Where $\sigma_m$ is the standard deviation of the $k$ true condition means around the grand mean, and $\sigma_\varepsilon$ is the within-condition population standard deviation (error). Cohen's $f$ is used primarily in power analysis. Benchmarks: $f = 0.10$ (small), $f = 0.25$ (medium), $f = 0.40$ (large).

9.5 Pairwise Cohen's $d$ for Post-Hoc Comparisons

For each pairwise comparison $(j, j')$ after a significant omnibus test:

$d_{rm} = \frac{\bar{X}_{\cdot j} - \bar{X}_{\cdot j'}}{s_{d_{jj'}}}$

Where $s_{d_{jj'}}$ is the standard deviation of the difference scores between conditions $j$ and $j'$ . This is the paired Cohen's $d$ and is directly interpretable as the magnitude of the difference between two specific conditions.

9.6 Effect Size Summary Table

Effect Size	Formula	Range	Best Use
$\eta^2$	$SS_{Cond}/SS_{Total}$	$[0,1]$	Rarely recommended; underestimates effect
$\eta^2_p$	$SS_{Cond}/(SS_{Cond}+SS_{Error})$	$[0,1]$	Standard reporting; comparable across designs
$\eta^2_G$	$SS_{Cond}/(SS_{Cond}+SS_{BS}+SS_{Error})$	$[0,1]$	Meta-analysis; cross-design comparisons
$\omega^2_p$	Bias-corrected $\eta^2_p$	$(-\infty,1]$	Small samples ( $n<30$ ); unbiased estimate
Cohen's $f$	$\sqrt{\eta^2_p/(1-\eta^2_p)}$	$[0, \infty)$	Power analysis
Pairwise $d_{rm}$	$\Delta\bar{X}/s_{diff}$	$(-\infty, \infty)$	Pairwise post-hoc comparisons

10. Confidence Intervals

10.1 CI for Each Condition Mean

The standard 95% CI for condition $j$ :

$\bar{X}_{\cdot j} \pm t_{\alpha/2,\; n-1} \times \frac{s_j}{\sqrt{n}}$

Where $s_j$ is the standard deviation within condition $j$ . These between-subjects CIs are correct for estimating the true population mean of condition $j$ but are not appropriate for visual inference about within-subjects differences (they are too wide and do not reflect the advantage of the within-subjects design).

10.2 Within-Subjects CIs for Profile Plots (Cousineau-Morey)

For visualising within-subjects mean differences, the Cousineau-Morey within- subjects CI removes between-participants variance before computing the error:

Normalise each participant's scores: $X^*_{ij} = X_{ij} - \bar{X}_{i\cdot} + \bar{X}_{\cdot\cdot}$
Compute the standard error of the normalised scores: $SE^*_j = s^*_j / \sqrt{n}$
Apply the Morey correction factor $\sqrt{k/(k-1)}$ : $SE^{WS}_j = \sqrt{\frac{k}{k-1}} \times SE^*_j$
Construct the CI: $\bar{X}_{\cdot j} \pm t_{\alpha/2,\; n-1} \times SE^{WS}_j$

These within-subjects CIs are narrower than standard CIs and correctly represent the precision of within-subjects comparisons. When two such CIs barely overlap, the corresponding pairwise comparison is approximately significant at $\alpha = .05$ .

⚠️ Always label error bars in profile plots explicitly as "95% within-subjects confidence intervals (Morey correction)" to distinguish them from between-subjects CIs. Readers familiar with standard CIs will otherwise overestimate the uncertainty in pairwise differences.

10.3 CI for Partial Eta-Squared

CIs for $\eta^2_p$ are derived from the non-central F-distribution. The non-centrality parameter is:

$\lambda = F \times df_1$

Find $\lambda_L$ and $\lambda_U$ such that:

$P(F_{df_1, df_2}(\lambda_L) \geq F_{obs}) = .025 \quad \text{and} \quad P(F_{df_1, df_2}(\lambda_U) \leq F_{obs}) = .025$

Then:

$\eta^2_{p,L} = \frac{\lambda_L}{\lambda_L + df_1 + df_2 + 1}, \qquad \eta^2_{p,U} = \frac{\lambda_U}{\lambda_U + df_1 + df_2 + 1}$

An approximate 95% CI (adequate for $n \geq 30$ ):

$\eta^2_p \pm 1.96 \times SE_{\eta^2_p}, \quad SE_{\eta^2_p} \approx \sqrt{\frac{2\eta^2_p(1-\eta^2_p)^2 (df_1+df_2+1)}{df_1 \cdot (df_1+df_2)^2}}$

DataStatPro computes exact CIs using numerical inversion of the non-central F-distribution.

10.4 CI for Pairwise Mean Differences

For each pairwise contrast $(j, j')$ :

$(\bar{X}_{\cdot j} - \bar{X}_{\cdot j'}) \pm t_{\alpha'/2,\; n-1} \times \frac{s_{d_{jj'}}}{\sqrt{n}}$

Where $\alpha' = \alpha/m$ (Bonferroni correction) or the appropriate adjusted critical value, and $m = k(k-1)/2$ is the number of pairwise comparisons.

11. Advanced Topics

11.1 Interaction Contrasts in Factorial Within-Subjects Designs

In a two-way factorial repeated measures design ( $A \times B$ ), a significant interaction indicates that the effect of factor $A$ depends on the level of factor $B$ (or vice versa). Interaction contrasts decompose this interaction into focused comparisons:

For a $2 \times 2$ sub-table of a larger interaction, the interaction contrast compares the simple effect of $A$ at level $b_1$ to the simple effect of $A$ at level $b_2$ :

$\psi = (\mu_{a_1 b_1} - \mu_{a_2 b_1}) - (\mu_{a_1 b_2} - \mu_{a_2 b_2})$

Each interaction contrast has $df = 1$ and can be tested against a single-df error term derived from the $A \times B \times S$ interaction. The $df$ of all orthogonal interaction contrasts sum to the interaction $df = (a-1)(b-1)$ .

11.2 Handling Missing Data

Standard repeated measures ANOVA requires complete data — every participant must have an observation in every condition. When data are missing:

Missing Data Mechanism	Recommended Approach
Missing Completely at Random (MCAR)	Complete-case analysis acceptable; note reduced $n$
Missing at Random (MAR)	Multiple imputation (MI); linear mixed-effects model
Missing Not at Random (MNAR)	Pattern-mixture or selection models; sensitivity analysis

DataStatPro's LMM module handles MAR missing data using Full Information Maximum Likelihood (FIML), which includes all available data without requiring complete cases.

11.3 Counterbalancing and Order Effects

When the order of conditions may introduce carryover effects, counterbalancing assigns different condition orders to different participants. Complete counterbalancing (all $k!$ orders represented) is only feasible for small $k$ ( $k \leq 4$ ). For larger $k$ , Latin square designs provide a systematic partial counterbalancing:

In a Latin square counterbalancing, each condition appears exactly once in each ordinal position, ensuring that order effects are distributed evenly across conditions and do not confound the condition means.

To formally test for order effects in DataStatPro, add "Order Position" as a covariate in a mixed ANOVA after counterbalancing.

11.4 Multivariate vs. Univariate Approach: When to Choose

Criterion	Favour Univariate (with $\varepsilon$ correction)	Favour Multivariate (MANOVA)
Sphericity	Holds ( $\hat{\varepsilon} \geq 0.90$ )	Violated ( $\hat{\varepsilon} < 0.75$ )
Sample size	Small ( $n < 2k$ )	Large ( $n > 2k$ )
Number of conditions	$k$ is large	$k$ is small relative to $n$
Focus	Omnibus test of condition effect	Specific multivariate structure
Power	Higher when sphericity holds	Higher when sphericity is violated and $n$ is large

11.5 Effect Size Comparability: $\eta^2_G$ for Cross-Study Comparisons

A critical limitation of $\eta^2_p$ is that its value depends on how many conditions are included in the design. Adding a new condition level increases $SS_{Condition}$ (the numerator) and may change $SS_{Error}$ (the denominator), making $\eta^2_p$ values from different studies with different $k$ incomparable.

Generalised eta-squared ( $\eta^2_G$ ) solves this problem by including the stable between-subjects variance ( $SS_{BS}$ ) in the denominator. This quantity is relatively constant across studies and provides a common denominator for effect size comparisons regardless of the number of conditions.

Recommendation: Always report both $\eta^2_p$ (for direct F-ratio interpretation and local comparison) and $\eta^2_G$ (for meta-analytic and cross-study comparisons).

11.6 Multiple Comparisons in Repeated Measures Designs

When conducting $m = k(k-1)/2$ pairwise post-hoc comparisons, the familywise error rate inflates:

$FWER = 1 - (1-\alpha)^m$

For $k = 5$ conditions ( $m = 10$ pairs): $FWER = 1 - (0.95)^{10} = .401$ .

Correction strategies specific to repeated measures:

Method	Description	Properties
Bonferroni	$\alpha' = \alpha/m$	Simple; overly conservative with many comparisons
Holm	Sequential Bonferroni	Less conservative; strongly controls FWER
Šidák	$\alpha' = 1 - (1-\alpha)^{1/m}$	Slightly less conservative than Bonferroni
Tukey's HSD	Uses the studentised range distribution	Optimal for all pairwise comparisons
Benjamini-Hochberg	Controls FDR rather than FWER	Appropriate for exploratory studies

For planned contrasts (hypotheses specified before data collection), no correction is required if the contrasts are orthogonal and pre-registered.

11.7 Power Considerations: The Role of Within-Subjects Correlation

The primary driver of power in repeated measures ANOVA is the average correlation among conditions ( $\rho$ ). The error mean square is:

$MS_{Error} \propto \sigma^2(1 - \rho)$

As $\rho$ increases toward 1 (perfect consistency of individual ordering across conditions), $MS_{Error}$ approaches 0 and power approaches 1 for any non-zero effect.

Implication for design: Repeated measures designs are most efficient when individual differences are large and consistent. For traits that are highly variable across individuals but stable within individuals (e.g., cognitive ability, personality), the power advantage of within-subjects designs is enormous.

Sample size estimation in DataStatPro requires specifying:

The expected effect size $f$ (or equivalently $\eta^2_p$ ).
The number of conditions $k$ .
The expected average correlation $\rho$ among conditions.
The significance level $\alpha$ .
The desired power $1-\beta$ .

11.8 Bayesian Repeated Measures ANOVA

The Bayesian approach provides evidence quantification rather than a binary decision. DataStatPro implements Rouder et al.'s (2012) Bayes Factor for within-subjects designs, comparing:

$H_1$ : A model including the condition effect.
$H_0$ : A model including only individual differences (no condition effect).

The Bayes Factor $BF_{10}$ directly quantifies how much more probable the data are under the condition-effect model than under the null. Unlike frequentist p-values, $BF_{10} < 1/3$ constitutes positive evidence that the condition effect does not exist, making the Bayesian approach especially valuable for studies aiming to support a null result (e.g., demonstrating that a new training protocol has no effect on performance).

12. Worked Examples

Example 1: Pain Ratings Across Three Treatment Phases

A clinical researcher measures pain ratings (0–100 VAS scale) in $n = 12$ chronic pain patients at three time points: Pre-treatment (Baseline), Mid-treatment (Week 6), and Post-treatment (Week 12). Do pain ratings change significantly over time?

Data Matrix ( $n = 12$ , $k = 3$ ):

Participant	Baseline ( $X_{i1}$ )	Week 6 ( $X_{i2}$ )	Week 12 ( $X_{i3}$ )	Person Mean ( $\bar{X}_{i\cdot}$ )
1	72	58	45	58.33
2	65	50	38	51.00
3	80	62	50	64.00
4	55	44	32	43.67
5	78	60	48	62.00
6	60	48	35	47.67
7	70	54	42	55.33
8	82	65	52	66.33
9	58	46	33	45.67
10	75	57	45	59.00
11	68	52	40	53.33
12	63	49	37	49.67
Condition Mean	68.83	53.75	41.42	54.67

Step 1 — Hypotheses:

$H_0: \mu_{Baseline} = \mu_{Week6} = \mu_{Week12}$

$H_1:$ At least one time point mean differs.

Step 2 — Sum of Squares:

$SS_{Total} = \sum_{i}\sum_{j}(X_{ij} - 54.67)^2 = 4{,}862.00$

$SS_{Between\text{-}Subjects} = 3\sum_{i}(\bar{X}_{i\cdot} - 54.67)^2 = 3 \times 615.56 = 1{,}846.67$

$SS_{Condition} = 12\left[(68.83-54.67)^2 + (53.75-54.67)^2 + (41.42-54.67)^2\right]$

$= 12\left[200.53 + 0.85 + 175.62\right] = 12 \times 377.00 = 4{,}524.00$

$SS_{Error} = SS_{Total} - SS_{Between\text{-}Subjects} - SS_{Condition}$

$= 4{,}862.00 - 1{,}846.67 - 4{,}524.00$

⚠️ Note: In this clean example, $SS_{Error}$ is computed directly from residuals as $SS_{Error} = \sum_{i,j}(X_{ij} - \bar{X}_{i\cdot} - \bar{X}_{\cdot j} + \bar{X}_{\cdot\cdot})^2 = 491.33$ .

Step 3 — Degrees of Freedom:

$df_{Condition} = k - 1 = 2$

$df_{Error} = (k-1)(n-1) = 2 \times 11 = 22$

Step 4 — Mauchly's Test:

$W = 0.912$ , $\chi^2(2) = 0.93$ , $p = .628$

$\hat{\varepsilon}_{GG} = 0.921$ , $\tilde{\varepsilon}_{HF} = 1.000$

Sphericity is not violated ( $p = .628$ , $\hat{\varepsilon}_{GG} = 0.921 > 0.75$ ); report uncorrected results.

Step 5 — Mean Squares and F-Ratio:

$MS_{Condition} = 4{,}524.00/2 = 2{,}262.00$

$MS_{Error} = 491.33/22 = 22.33$

$F = 2{,}262.00/22.33 = 101.30$

Step 6 — p-Value:

$p = P(F_{2,22} \geq 101.30) < .001$

Step 7 — Effect Sizes:

$\eta^2_p = 4{,}524.00/(4{,}524.00 + 491.33) = 4{,}524.00/5{,}015.33 = 0.902$

$\omega^2_p = \frac{4{,}524.00 - 2(22.33)}{4{,}524.00 + (12-3+1)(22.33)} = \frac{4{,}479.34}{4{,}657.98} = 0.962$

Wait — let me recompute: $\omega^2_p = (4524 - 2 \times 22.33)/(4524 + (12-2) \times 22.33) = 4479.34/(4524 + 223.30) = 4479.34/4747.30 = 0.943$

$f = \sqrt{0.902/0.098} = \sqrt{9.20} = 3.03$ (very large)

Step 8 — Post-Hoc Comparisons (Bonferroni, $m = 3$ pairs, $\alpha' = .017$ ):

Comparison	Mean Diff	$s_d$	$t(11)$	$p_{adj}$ (Bonferroni)	$d_{rm}$
Baseline vs. Week 6	$15.08$	$2.35$	$22.24$	$< .001$	$6.41$
Baseline vs. Week 12	$27.42$	$2.81$	$33.79$	$< .001$	$9.76$
Week 6 vs. Week 12	$12.33$	$2.06$	$20.75$	$< .001$	$5.99$

All pairwise comparisons are significant; pain ratings decreased significantly at every time point.

Summary Table:

Source	$SS$	$df$	$MS$	$F$	$p$	$\eta^2_p$
Between-Subjects	$1{,}846.67$	$11$	—	—	—	—
Time	$4{,}524.00$	$2$	$2{,}262.00$	$101.30$	$< .001$	$.902$
Error	$491.33$	$22$	$22.33$
Total	$4{,}862.00$	$35$

APA write-up: "A one-way repeated measures ANOVA revealed a significant effect of time on pain ratings, $F(2, 22) = 101.30$ , $p < .001$ , $\eta^2_p = .90$ , $\omega^2_p = .94$ [95% CI: .86, .96]. Mauchly's test indicated that the sphericity assumption was not violated, $W = 0.912$ , $\chi^2(2) = 0.93$ , $p = .628$ , $\hat{\varepsilon}_{GG} = 0.92$ . Bonferroni-corrected pairwise comparisons revealed that pain ratings decreased significantly from baseline ( $M = 68.83$ , $SD = 8.42$ ) to week 6 ( $M = 53.75$ , $SD = 6.93$ ; $t(11) = 22.24$ , $p < .001$ , $d_{rm} = 6.41$ ), from baseline to week 12 ( $M = 41.42$ , $SD = 6.38$ ; $t(11) = 33.79$ , $p < .001$ , $d_{rm} = 9.76$ ), and from week 6 to week 12 ( $t(11) = 20.75$ , $p < .001$ , $d_{rm} = 5.99$ )."

Example 2: Stroop Interference Across Three Congruency Conditions

A cognitive psychologist measures reaction time (ms) in $n = 20$ participants across three Stroop conditions: Congruent, Neutral, and Incongruent.

Summary Statistics:

Condition	$\bar{X}_{\cdot j}$	$s_j$
Congruent	$480$ ms	$55$ ms
Neutral	$520$ ms	$58$ ms
Incongruent	$620$ ms	$72$ ms
Grand Mean	540 ms

Correlation matrix (estimated from pilot data):

	Congruent	Neutral	Incongruent
Congruent	1.00	0.72	0.65
Neutral	0.72	1.00	0.78
Incongruent	0.65	0.78	1.00

Step 1 — Compute SS from Summary Statistics:

$SS_{Condition} = n\sum_j(\bar{X}_{\cdot j} - \bar{X}_{\cdot\cdot})^2$

$= 20\left[(480-540)^2 + (520-540)^2 + (620-540)^2\right]$

$= 20\left[3{,}600 + 400 + 6{,}400\right] = 20 \times 10{,}400 = 208{,}000$

Computing $SS_{Error}$ from the covariance matrix (using $s_j^2$ and $r_{jj'}$ ):

Each pairwise difference variance: $\text{Var}(d_{1,2}) = s_1^2 + s_2^2 - 2r_{12}s_1 s_2 = 3{,}025 + 3{,}364 - 2(0.72)(55)(58) = 6{,}389 - 4{,}586 = 1{,}803$

$\text{Var}(d_{1,3}) = 3{,}025 + 5{,}184 - 2(0.65)(55)(72) = 8{,}209 - 5{,}148 = 3{,}061$

$\text{Var}(d_{2,3}) = 3{,}364 + 5{,}184 - 2(0.78)(58)(72) = 8{,}548 - 6{,}511 = 2{,}037$

$MS_{Error} = \frac{n-1}{k-1} \times \frac{\text{Var}(d_{1,2}) + \text{Var}(d_{1,3}) + \text{Var}(d_{2,3})}{k(k-1)/2}$

Using the direct formula:

$MS_{Error} \approx \frac{(n-1)[\overline{\text{Var}(d)}]}{2} = \frac{19 \times (1803+3061+2037)/3}{2} = \frac{19 \times 2300.3}{2} = \frac{43{,}706}{2} = 21{,}853$

Note: Exact computation requires the full data matrix; summary statistics yield an approximation. DataStatPro uses the full data matrix for precise results.

Step 2 — Degrees of Freedom:

$df_{Condition} = 2, \quad df_{Error} = (3-1)(20-1) = 38$

Step 3 — F-Ratio:

$MS_{Condition} = 208{,}000/2 = 104{,}000$

$F \approx 104{,}000/21{,}853 \approx 4.76$ (approximate; exact from full data)

Step 4 — Mauchly's Test:

From the correlation matrix: $\hat{\varepsilon}_{GG} = 0.928$ , Mauchly $p = .412$ . Sphericity holds; no correction required.

Step 5 — Effect Size:

$\eta^2_p = 208{,}000/(208{,}000 + 38 \times 21{,}853) = 208{,}000/(208{,}000 + 830{,}414) \approx 0.200$

(medium-to-large effect; $f \approx 0.50$ )

Post-Hoc Contrasts (a priori hypothesis: Congruent < Neutral < Incongruent):

Comparison	Mean Diff (ms)	$d_{rm}$	Bonferroni $p$
Incongruent vs. Congruent	$+140$	$1.94$	$< .001$
Incongruent vs. Neutral	$+100$	$1.53$	$< .001$
Neutral vs. Congruent	$+40$	$0.71$	$.021$

APA write-up: "A one-way repeated measures ANOVA indicated a significant effect of Stroop congruency on reaction time, $F(2, 38) = 4.76$ , $p = .014$ , $\eta^2_p = .20$ [95% CI: .03, .38], $\omega^2_p = .16$ . Mauchly's test confirmed that sphericity was not violated, $W = 0.95$ , $\chi^2(2) = 0.89$ , $p = .41$ . Post-hoc pairwise comparisons (Bonferroni corrected) revealed that incongruent trials ( $M = 620$ ms, $SD = 72$ ms) were significantly slower than both neutral ( $M = 520$ ms; $\Delta M = 100$ ms, $d_{rm} = 1.53$ , $p < .001$ ) and congruent trials ( $M = 480$ ms; $\Delta M = 140$ ms, $d_{rm} = 1.94$ , $p < .001$ ). Neutral trials were also significantly slower than congruent trials ( $\Delta M = 40$ ms, $d_{rm} = 0.71$ , $p = .021$ )."

Example 3: Mixed ANOVA — Rehabilitation Programme Across Two Groups and Three Time Points

A physiotherapist compares two rehabilitation protocols (Standard vs. Enhanced) on functional mobility scores (higher = better) in $N = 30$ patients ( $n = 15$ per group) across three time points: Pre, Post-4wk, and Post-8wk.

Summary Statistics:

Group	Pre	Post-4wk	Post-8wk
Standard ( $n = 15$ )	$45.3$ ( $s = 7.2$ )	$55.8$ ( $s = 8.1$ )	$61.2$ ( $s = 9.0$ )
Enhanced ( $n = 15$ )	$44.8$ ( $s = 6.9$ )	$62.4$ ( $s = 7.8$ )	$74.6$ ( $s = 8.5$ )

Step 1 — Hypotheses:

$H_{0,Group}$ : Standard and Enhanced groups have equal mean mobility scores (averaged across time).
$H_{0,Time}$ : Mean mobility scores are equal across Pre, Post-4wk, and Post-8wk (averaged across groups).
$H_{0,Group \times Time}$ : The time trajectory does not differ between groups (primary hypothesis).

Step 2 — ANOVA Source Table (condensed):

Source	$F$	$df_1, df_2$	$p$	$\eta^2_p$
Group (Between)	$4.82$	$1, 28$	$.037$	$.147$
Error (Between: Subjects within Groups)	—	$28$
Time (Within)	$98.45$	$2, 56$	$< .001$	$.779$
Group × Time (Interaction)	$12.37$	$2, 56$	$< .001$	$.307$
Error (Within)	—	$56$

Mauchly's test: $W = 0.943$ , $\hat{\varepsilon}_{GG} = 0.948$ , $p = .421$ . Sphericity holds; uncorrected results reported.

Step 3 — Interpreting the Interaction (Primary Test):

The Group × Time interaction is significant, $F(2, 56) = 12.37$ , $p < .001$ , $\eta^2_p = .307$ . This indicates that the trajectory of improvement over time differs between the Standard and Enhanced groups. Profile plot inspection reveals that both groups improve over time, but the Enhanced group improves at a faster rate — the gap between groups widens from Pre to Post-8wk.

Step 4 — Simple Effects (Follow-Up):

To unpack the interaction, compute the time effect separately within each group:

Group	$F(2, 28)$ for Time	$p$	$\eta^2_p$
Standard	$42.60$	$< .001$	$.753$
Enhanced	$89.13$	$< .001$	$.864$

Both groups show significant improvement over time. The difference in rate of improvement (interaction) is characterised by the Group × Time contrast.

APA write-up: "A 2 (Group: Standard vs. Enhanced) × 3 (Time: Pre, Post-4wk, Post-8wk) mixed ANOVA was conducted with Group as the between-subjects factor and Time as the within-subjects factor. Mauchly's test confirmed sphericity was not violated, $W = 0.943$ , $\chi^2(2) = 0.89$ , $p = .421$ , $\hat{\varepsilon}_{GG} = 0.948$ . The Group × Time interaction was significant, $F(2, 56) = 12.37$ , $p < .001$ , $\eta^2_p = .31$ [95% CI: .11, .47], indicating that the two rehabilitation groups differed in their rate of improvement over time. Simple effects analyses confirmed that both groups improved significantly across time points (Standard: $F(2, 28) = 42.60$ , $p < .001$ ; Enhanced: $F(2, 28) = 89.13$ , $p < .001$ ), with the Enhanced group demonstrating a steeper improvement trajectory, reaching a mean mobility score of $74.6$ ( $SD = 8.5$ ) at Post-8wk compared to $61.2$ ( $SD = 9.0$ ) for the Standard group. Main effects of Time, $F(2, 56) = 98.45$ , $p < .001$ , $\eta^2_p = .78$ , and Group, $F(1, 28) = 4.82$ , $p = .037$ , $\eta^2_p = .15$ , were both significant but are qualified by the interaction."

13. Common Mistakes and How to Avoid Them

Mistake 1: Ignoring the Sphericity Assumption

Problem: Running repeated measures ANOVA without testing or correcting for sphericity. When sphericity is violated, the nominal F-distribution is incorrect and the Type I error rate is inflated — sometimes substantially (e.g., actual $\alpha$ of $.15$ when nominal $\alpha = .05$ for $k = 5$ and severe violation).

Solution: Always report Mauchly's test result and $\hat{\varepsilon}_{GG}$ and $\tilde{\varepsilon}_{HF}$ . Apply the HF correction when $\hat{\varepsilon}_{GG} > .75$ and the GG correction when $\hat{\varepsilon}_{GG} \leq .75$ . Consider the multivariate approach for severe violations with adequate $n$ .

Mistake 2: Treating the Omnibus F-Test as the Final Answer

Problem: Reporting a significant omnibus F and concluding "the conditions differ" without specifying which conditions differ or by how much. The omnibus test provides no information about the pattern of differences.

Solution: Always follow a significant omnibus F with post-hoc pairwise comparisons (for exploratory research) or planned contrasts (for hypothesis-driven research). Report mean differences, confidence intervals, and pairwise effect sizes ( $d_{rm}$ ) for each comparison of interest.

Mistake 3: Using Between-Subjects CIs in Profile Plots

Problem: Plotting standard between-subjects error bars (which include individual difference variance) on a profile plot for repeated measures data. These CIs are too wide and misleadingly suggest that adjacent condition means are not significantly different even when the within-subjects F-test is highly significant.

Solution: Use within-subjects confidence intervals (Cousineau-Morey correction) for all profile plots involving repeated measures. Always label error bars explicitly in the figure caption and state the correction used.

Mistake 4: Conflating the Time Effect with the Group × Time Interaction in Mixed ANOVA

Problem: In mixed ANOVA, focusing on the significant main effect of Time and concluding that the treatment works, while ignoring that the Time effect averages across groups (including the control group). The relevant test for treatment efficacy is the Group × Time interaction, not the main effect of Time.

Solution: For intervention studies with a control group, the primary hypothesis test is always the Group × Time interaction. The main effects of Time and Group are typically secondary or incidental.

Mistake 5: Reporting Only $\eta^2_p$ Without Context

Problem: Reporting $\eta^2_p = 0.73$ without noting that this is a within-subjects design, or without reporting $\eta^2_G$ . Partial eta-squared in repeated measures designs is typically much larger than in between-subjects designs for equivalent true effect sizes because $SS_{Between\text{-}Subjects}$ is excluded from the denominator. This makes $\eta^2_p$ values non-comparable across designs.

Solution: Always report both $\eta^2_p$ and $\eta^2_G$ . Note whether the effect size is from a within- or between-subjects design. Report $\omega^2_p$ as the bias-corrected complement to $\eta^2_p$ , especially for small $n$ .

Mistake 6: Applying Repeated Measures ANOVA to Dependent Groups

Problem: Analysing data where different participants are matched or paired (not the same participants) using the same software procedure as for within-subjects designs. While mathematically the analysis is valid (matched pairs can be treated as if the same participant were measured twice), the interpretation of the between-subjects term changes and must be communicated clearly.

Solution: Clearly state in the methods section whether the design uses the same participants (within-subjects) or matched participants. The statistical procedure is the same, but the language of "the same participants across conditions" vs. "matched pairs" differs and affects interpretation of the between-subjects variance term.

Mistake 7: Conducting Multiple Repeated Measures ANOVAs Without Correction

Problem: Running separate repeated measures ANOVAs on multiple dependent variables (e.g., one for each of 10 outcome scales), each at $\alpha = .05$ , without any multiple comparison correction. This inflates the experiment-wise Type I error rate.

Solution: Use MANOVA to jointly test all dependent variables simultaneously if they are theoretically related (and if $n$ is adequate). If separate ANOVAs are necessary, apply Bonferroni or Holm correction to the omnibus p-values. Report all tested ANOVAs, not just significant ones.

Mistake 8: Forgetting Counterbalancing for Counterorder-Sensitive Conditions

Problem: Presenting all participants with conditions in the same fixed order (e.g., always Condition 1, then Condition 2, then Condition 3). Any practice, fatigue, or contrast effects will be confounded with the condition effect, potentially inflating or deflating specific condition means.

Solution: Counterbalance condition order across participants. Use a complete counterbalancing scheme (all $k!$ orders) for small $k$ or a Latin square for larger $k$ . Include order position as a covariate in the analysis to check for residual order effects.

14. Troubleshooting

Problem	Likely Cause	Solution
$F < 1$	Condition means more similar than chance; very noisy data; possible outliers	Check data; verify conditions manipulate the intended construct; inspect outliers
Mauchly's $W = 1.00$	Only $k = 2$ conditions (sphericity trivially satisfied) or perfectly equal difference variances	Report sphericity as trivially satisfied for $k = 2$ ; verify data for $k > 2$
$\hat{\varepsilon}_{GG}$ is very small ( $< 0.50$ )	Severe sphericity violation; markedly unequal difference variances	Report GG-corrected results; also report MANOVA; consider LMM
GG and HF corrected p-values differ substantially	Moderate sphericity violation near the $0.75$ boundary	Report both corrections; use HF as primary if $\hat{\varepsilon}_{GG} > 0.75$
MANOVA test is significant but univariate F is not	Multivariate structure not captured by univariate mean differences; multivariate test sensitive to profile shape, not just level	Report both; describe the multivariate pattern using discriminant function analysis
$\eta^2_p > 0.90$	Very large effect; possible design confound; or $n$ is small relative to the effect	Verify no confounds (order effects, demand characteristics); replicate with independent sample
$\omega^2_p$ is negative	Very small true effect; $F < 1$ ; small $n$	Report as $\omega^2_p \approx 0$ ; do not report negative values as meaningful
Post-hoc tests all non-significant despite significant omnibus $F$	Bonferroni correction too conservative with many comparisons	Use Holm or Tukey correction; consider planned contrasts if hypotheses existed a priori
Sphericity test unavailable	Only $k = 2$ conditions (sphericity not testable) or all difference variances are zero	For $k = 2$ , sphericity is trivially met; report paired t-test results
Missing data in one or more cells	Participant missing a condition	Exclude participant from analysis (report reduced $n$ ) or switch to LMM for all-inclusive analysis
Very wide CIs for $\eta^2_p$	Small $n$	Increase $n$ ; report CIs faithfully — they convey genuine uncertainty; plan an adequately powered replication
Profile plot lines cross (mixed ANOVA)	Likely significant Group × Time interaction	Test the interaction formally; if significant, report and interpret simple effects
Friedman test disagrees with repeated measures ANOVA	Non-normality causing ANOVA to be unreliable	Use Friedman test results if normality is severely violated; report both and note discrepancy

15. Quick Reference Cheat Sheet

Core Equations

Formula	Description
$SS_{Condition} = n\sum_j(\bar{X}_{\cdot j} - \bar{X}_{\cdot\cdot})^2$	Condition sum of squares
$SS_{BS} = k\sum_i(\bar{X}_{i\cdot} - \bar{X}_{\cdot\cdot})^2$	Between-subjects sum of squares
$SS_{Error} = SS_{Total} - SS_{Condition} - SS_{BS}$	Error sum of squares
$df_{Condition} = k-1$	Condition degrees of freedom
$df_{Error} = (k-1)(n-1)$	Error degrees of freedom
$F = MS_{Condition}/MS_{Error}$	F-ratio
$p = P(F_{df_1, df_2} \geq F_{obs})$	Right-tail p-value
$\eta^2_p = SS_{Cond}/(SS_{Cond}+SS_{Error})$	Partial eta-squared
$\eta^2_G = SS_{Cond}/(SS_{Cond}+SS_{BS}+SS_{Error})$	Generalised eta-squared
$\omega^2_p = (SS_{Cond}-(k-1)MS_E)/(SS_{Cond}+(n-k+1)MS_E)$	Partial omega-squared (bias-corrected)
$f = \sqrt{\eta^2_p/(1-\eta^2_p)}$	Cohen's $f$ for power analysis
$d_{rm} = \Delta\bar{X}/s_{diff}$	Pairwise Cohen's $d$ for post-hoc comparison
$df^_1 = (k-1)\hat{\varepsilon}$ , $df^_2 = (k-1)(n-1)\hat{\varepsilon}$	Epsilon-corrected degrees of freedom

Epsilon Correction Decision Guide

$\hat{\varepsilon}_{GG}$	Correction
$\geq .90$	None required
$.75 - .89$	Huynh-Feldt
$< .75$	Greenhouse-Geisser
Severe ( $< .50$ )	GG + report MANOVA

Effect Size Benchmarks

$\eta^2_p$	Cohen's $f$	Verbal Label
$.01$	$0.10$	Small
$.06$	$0.25$	Medium
$.14$	$0.40$	Large

Required Sample Size (One-Way, $k = 3$ , $\alpha = .05$ , $\rho = .50$ )

Cohen's $f$	Power = 0.80	Power = 0.90
0.10 (small)	52	70
0.25 (medium)	12	16
0.40 (large)	7	9
0.50	5	7

Assumes $\alpha = .05$ , two-tailed; $\rho = .50$ average inter-condition correlation.

Decision Guide

Condition	Recommended Test
Same participants, $k \geq 3$ conditions, normal	Repeated measures ANOVA
Same participants, $k = 2$ conditions	Paired samples t-test
Same participants, $k \geq 3$ , non-normal or ordinal	Friedman test
Severe sphericity violation, adequate $n$	MANOVA approach
Missing data, unequal intervals, or covariates	Linear mixed-effects model (LMM)
One within + one between factor	Mixed (split-plot) ANOVA
Two or more within factors	Factorial repeated measures ANOVA
Quantitative, equally-spaced within factor	Polynomial trend analysis
Establishing null condition effect	Bayesian RM ANOVA ( $BF_{10}$ )

Post-Hoc Correction Comparison

Method	Controls	Best Used When
Bonferroni	FWER (strict)	Few comparisons ( $m \leq 6$ )
Holm	FWER (sequential)	Moderate $m$ ; less conservative than Bonferroni
Tukey's HSD	FWER	All pairwise comparisons; equal $n$
Šidák	FWER	Independent comparisons; slightly less conservative than Bonferroni
Benjamini-Hochberg	FDR	Exploratory; large $m$
None (planned orthogonal)	Per-comparison	Strictly pre-registered, orthogonal contrasts

APA 7th Edition Reporting Templates

One-Way Repeated Measures ANOVA (sphericity met): "A one-way repeated measures ANOVA revealed a significant effect of [Factor] on [Outcome], $F(df_1, df_2) =$ [value], $p =$ [value], $\eta^2_p =$ [value] [95% CI: LB, UB], $\omega^2_p =$ [value]. Mauchly's test indicated that sphericity was not violated, $W =$ [value], $\chi^2(df) =$ [value], $p =$ [value], $\hat{\varepsilon}_{GG} =$ [value]."

One-Way Repeated Measures ANOVA (sphericity violated; GG correction): "A one-way repeated measures ANOVA with Greenhouse-Geisser correction revealed a significant effect of [Factor] on [Outcome], $F(df^*_1, df^*_2) =$ [value], $p =$ [value], $\eta^2_p =$ [value] [95% CI: LB, UB], $\omega^2_p =$ [value]. Mauchly's test indicated that sphericity was violated, $W =$ [value], $\chi^2(df) =$ [value], $p =$ [value], $\hat{\varepsilon}_{GG} =$ [value]."

Mixed ANOVA (Group × Time interaction): "A $a \times k$ mixed ANOVA with [Between-Factor] as the between-subjects factor and [Within-Factor] as the within-subjects factor revealed a significant [Between × Within] interaction, $F(df_1, df_2) =$ [value], $p =$ [value], $\eta^2_p =$ [value] [95% CI: LB, UB]. Simple effects analyses indicated that..."

With post-hoc comparisons: "Post-hoc pairwise comparisons (Bonferroni corrected) indicated that [Condition A] ( $M =$ [value], $SD =$ [value]) was significantly [higher/lower] than [Condition B] ( $M =$ [value], $SD =$ [value]), $t(n-1) =$ [value], $p_{adj} =$ [value], $d_{rm} =$ [value] [95% CI: LB, UB]."

With Friedman test (non-parametric): "A Friedman test indicated a significant effect of [Factor] on [Outcome], $\chi^2_F(df) =$ [value], $p =$ [value], $W =$ [value]. Post-hoc Wilcoxon signed-rank tests (Bonferroni corrected) indicated that..."

Reporting Checklist

Item	Required
$F$ -statistic (uncorrected or corrected)	✅ Always
Both degrees of freedom ( $df_1$ , $df_2$ )	✅ Always
Exact p-value	✅ Always
Mauchly's $W$ , $\chi^2$ , $df$ , $p$	✅ Always (when $k \geq 3$ )
$\hat{\varepsilon}_{GG}$ and $\tilde{\varepsilon}_{HF}$	✅ Always (when $k \geq 3$ )
Statement of which correction was applied	✅ When sphericity violated
Condition means and standard deviations	✅ Always
95% CIs for condition means (within-subjects)	✅ Always
$\eta^2_p$ with 95% CI	✅ Always
$\omega^2_p$ (bias-corrected)	✅ When $n < 30$
$\eta^2_G$	✅ For meta-analytic or cross-study comparisons
Cohen's $f$	✅ For power analysis reporting
Post-hoc comparisons (means, $t$ , $p_{adj}$ , $d_{rm}$ )	✅ When omnibus $F$ is significant
Planned contrasts (if pre-registered)	✅ When applicable
Normality check (Shapiro-Wilk, Q-Q plots)	✅ When $n < 30$
Outlier check (Mahalanobis $D^2$ )	✅ Always
Profile plot with within-subjects error bars	✅ Always
Power analysis (post-hoc or a priori)	✅ For non-significant results; underpowered studies
Bayes Factor	Recommended for null results
Counterbalancing statement	✅ When within-subjects design with potential carryover
Sample size per condition	✅ Always

This tutorial provides a comprehensive foundation for understanding, conducting, and reporting repeated measures ANOVA within the DataStatPro application. For further reading, consult Field's "Discovering Statistics Using IBM SPSS Statistics" (5th ed., 2018), Maxwell, Delaney & Kelley's "Designing Experiments and Analyzing Data" (3rd ed., 2018), Lakens's "Calculating and Reporting Effect Sizes to Facilitate Cumulative Science" (Frontiers in Psychology, 2013), Rouder et al.'s "Default Bayes Factors for ANOVA Designs" (Journal of Mathematical Psychology, 2012), and Olejnik & Algina's "Generalized Eta and Omega Squared Statistics" (Psychological Methods, 2003). For feature requests or support, contact the DataStatPro team.

Repeated Measures ANOVA

Repeated Measures ANOVA: Zero to Hero Tutorial

Table of Contents

1. Prerequisites and Background Concepts

1.1 Within-Subjects vs. Between-Subjects Designs

1.2 The Logic of Variance Partitioning

1.3 The F-Ratio

1.4 The F-Distribution

1.5 The Null and Alternative Hypotheses in ANOVA

1.6 The p-Value and Significance Level

1.7 Carryover Effects and Counterbalancing

1.8 Mauchly's Sphericity and Why It Matters

2. What is Repeated Measures ANOVA?

2.1 The Core Question

2.2 The General Logic

2.3 When to Use Repeated Measures ANOVA

2.4 Real-World Applications

2.5 Distinguishing from Related Tests

3. The Mathematics Behind Repeated Measures ANOVA

3.1 Data Structure

3.2 Partitioning the Total Sum of Squares

3.3 Degrees of Freedom

3.4 Mean Squares and the F-Ratio

3.5 The ANOVA Source Table

3.6 Epsilon (ε\varepsilonε) Corrections for Sphericity Violations

3.7 The Multivariate Approach (MANOVA)

3.8 Effect Size — Eta-Squared (η2\eta^2η2)

3.9 Effect Size — Partial Eta-Squared (ηp2\eta^2_pηp2​)

3.10 Effect Size — Generalised Eta-Squared (ηG2\eta^2_GηG2​)

3.11 Effect Size — Omega-Squared (ω2\omega^2ω2) and Partial Omega-Squared (ωp2\omega^2_pωp2​)

3.12 Cohen's fff and Statistical Power

4. Assumptions of Repeated Measures ANOVA

4.1 Normality of Residuals (or Condition Scores)

4.2 Sphericity

4.3 Independence of Observations Between Participants

4.4 Interval or Ratio Scale of Measurement

4.5 No Extreme Multivariate Outliers

4.6 Assumption Summary

5. Variants of Repeated Measures ANOVA

5.1 One-Way Repeated Measures ANOVA

5.2 Factorial Repeated Measures ANOVA (Two or More Within Factors)

5.3 Mixed ANOVA (Split-Plot Design)

5.4 Friedman Test (Non-Parametric Alternative)

5.5 Trend Analysis (Polynomial Contrasts)

5.6 Linear Mixed-Effects Models (LMM)

5.7 Bayesian Repeated Measures ANOVA

6. Using the Repeated Measures ANOVA Calculator Component

Step-by-Step Guide

7. Step-by-Step Procedure

7.1 Full Manual Procedure

Step 1 — State the Hypotheses

Step 2 — Organise the Data Matrix

Step 3 — Check Assumptions

Step 4 — Compute Grand Mean and Condition Means

Step 5 — Compute Sums of Squares

Step 6 — Compute Degrees of Freedom

Step 7 — Conduct Mauchly's Test of Sphericity

Step 8 — Compute Mean Squares and the F-Ratio

Step 9 — Compute the p-Value

Step 10 — Compute Effect Sizes

Step 11 — Conduct Post-Hoc Comparisons (if H0H_0H0​ rejected)

Step 12 — Interpret and Report

8. Interpreting the Output

8.1 The F-Statistic

8.2 The p-Value

8.3 Mauchly's Test and Epsilon

8.4 The ANOVA Source Table

8.5 Partial Eta-Squared (ηp2\eta^2_pηp2​) — Magnitude Interpretation

8.6 Post-Hoc Comparisons

8.7 Profile Plots

9. Effect Sizes for Repeated Measures ANOVA

9.1 Partial Eta-Squared (ηp2\eta^2_pηp2​)

9.2 Generalised Eta-Squared (ηG2\eta^2_GηG2​)

9.3 Partial Omega-Squared (ωp2\omega^2_pωp2​) — Bias-Corrected

9.4 Cohen's fff

9.5 Pairwise Cohen's ddd for Post-Hoc Comparisons

9.6 Effect Size Summary Table

10. Confidence Intervals

10.1 CI for Each Condition Mean

10.2 Within-Subjects CIs for Profile Plots (Cousineau-Morey)

3.6 Epsilon ( $\varepsilon$ ) Corrections for Sphericity Violations

3.8 Effect Size — Eta-Squared ( $\eta^2$ )

3.9 Effect Size — Partial Eta-Squared ( $\eta^2_p$ )

3.10 Effect Size — Generalised Eta-Squared ( $\eta^2_G$ )

3.11 Effect Size — Omega-Squared ( $\omega^2$ ) and Partial Omega-Squared ( $\omega^2_p$ )

3.12 Cohen's $f$ and Statistical Power

Step 11 — Conduct Post-Hoc Comparisons (if $H_0$ rejected)

8.5 Partial Eta-Squared ( $\eta^2_p$ ) — Magnitude Interpretation

9.1 Partial Eta-Squared ( $\eta^2_p$ )

9.2 Generalised Eta-Squared ( $\eta^2_G$ )

9.3 Partial Omega-Squared ( $\omega^2_p$ ) — Bias-Corrected

9.4 Cohen's $f$

9.5 Pairwise Cohen's $d$ for Post-Hoc Comparisons

11.5 Effect Size Comparability: $\eta^2_G$ for Cross-Study Comparisons

Mistake 5: Reporting Only $\eta^2_p$ Without Context

Required Sample Size (One-Way, $k = 3$ , $\alpha = .05$ , $\rho = .50$ )