Effect Size Calculator: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of Effect Sizes all the way through advanced estimation, interpretation, reporting, and practical usage within the DataStatPro application. Whether you are encountering effect sizes for the first time or looking to deepen your understanding of practical significance in research, this guide builds your knowledge systematically from the ground up.

Prerequisites and Background Concepts
What is an Effect Size?
The Mathematics Behind Effect Sizes
Assumptions of Effect Size Estimation
Types of Effect Sizes
Using the Effect Size Calculator Component
Effect Sizes for Mean Differences
Effect Sizes for Variance Explained
Effect Sizes for Associations and Categorical Data
Model Fit and Evaluation
Advanced Topics
Worked Examples
Common Mistakes and How to Avoid Them
Troubleshooting
Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into effect sizes, it is essential to be comfortable with the following foundational statistical concepts. Each is briefly reviewed below.

1.1 Statistical Significance vs. Practical Significance

A p-value answers the question: "If the null hypothesis were true, how likely is it that we would observe data at least as extreme as what we actually observed?"

A small p-value tells us the result is unlikely under $H_0$ — but it does not tell us how large the effect is or whether it matters in practice.

Consider two studies, both with $p < .001$ :

Study A: $n = 10{,}000$ , mean difference = 0.2 points on a 100-point scale.
Study B: $n = 40$ , mean difference = 15 points on a 100-point scale.

Study A has a highly significant but trivially small effect. Study B has a large, practically meaningful effect. Effect sizes quantify the magnitude of an effect independently of sample size — they answer the question: "How big is the effect?"

1.2 Standard Deviation and Variance

The standard deviation $\sigma$ (population) or $s$ (sample) measures the spread of a distribution:

$s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2}$

Most effect sizes for mean differences are standardised by dividing the raw difference by a standard deviation. This makes the effect size unit-free and comparable across studies using different measurement scales.

1.3 The Normal Distribution

Many effect size formulas assume that data come from normally distributed populations. The standard normal distribution $Z \sim \mathcal{N}(0, 1)$ is used to convert effect sizes into probabilities such as the common language effect size and probability of superiority.

The relationship between an effect size $d$ and the area of non-overlap between two normal distributions is fundamental to interpreting effect sizes in terms of real-world probabilities.

1.4 Correlation and Covariance

The Pearson correlation coefficient $r$ is a standardised measure of linear association:

$r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{(n-1)s_x s_y}$

It ranges from $-1$ to $+1$ and is itself an effect size for the strength of a linear relationship between two continuous variables.

1.5 Variance Decomposition

Many effect sizes for ANOVA and regression are ratios of variances:

$\eta^2 = \frac{SS_{effect}}{SS_{total}}$

Understanding the decomposition of variance into between-group (explained) and within-group (unexplained) components is essential for interpreting these effect sizes.

1.6 Confidence Intervals

A confidence interval (CI) for an effect size gives a range of plausible values for the true population effect, given the sample data. A 95% CI means that if we repeated the study 100 times, approximately 95 of the resulting intervals would contain the true population effect size.

Always report effect sizes with confidence intervals — a point estimate alone is insufficient because it conveys no information about precision or uncertainty.

1.7 The Non-Central Distributions

Effect sizes such as Cohen's $d$ and $\eta^2$ follow non-central distributions in finite samples — their sampling distributions are not symmetric, especially when the true population effect is non-zero.

Cohen's $d$ follows a non-central $t$ -distribution.
$\eta^2$ and $\omega^2$ follow non-central F-distributions.

Confidence intervals for these effect sizes must account for the non-centrality of the sampling distribution, which is why exact CIs require iterative numerical methods rather than simple $\pm z \times SE$ formulas.

2. What is an Effect Size?

2.1 The Core Idea

An effect size is a standardised, scale-free numerical index that quantifies the magnitude of a phenomenon — how large a difference is, how strong an association is, or how much variance is explained. Effect sizes are:

Unitless: Not expressed in the original measurement units, enabling comparison across studies.
Sample-size independent: Unlike $p$ -values, effect sizes do not change simply because the sample size changes.
Interpretable: Translate statistical results into practically meaningful statements.
Meta-analytically combinable: Effect sizes from multiple studies can be aggregated in a meta-analysis.

2.2 Why Effect Sizes Are Essential

The limitations of p-values alone:

With large samples, even trivially small effects produce significant p-values.
With small samples, even large effects may be non-significant.
A p-value carries no information about the size or direction of an effect.
p-values cannot be meaningfully compared across studies with different sample sizes.

What effect sizes add:

Quantify the practical significance of a finding.
Enable power analysis for planning future studies.
Facilitate meta-analysis by providing a common metric.
Allow assessment of clinical significance in applied settings.
Provide context for evaluating whether a finding is worth acting upon.

2.3 The Effect Size Framework

Every effect size belongs to one of three broad families:

Family	What It Measures	Examples
$d$ -family	Standardised mean differences	Cohen's $d$ , Hedges' $g$ , Glass's $\Delta$
$r$ -family	Strength of association	Pearson $r$ , $r^2$ , $\eta^2$ , $\omega^2$ , $\varepsilon^2$
Risk/Odds family	Probability-based contrasts	Odds ratio, Risk ratio, NNT, ARD

2.4 Real-World Applications

Field	Effect Size Application	Common Measure
Clinical Psychology	Effectiveness of CBT vs. control on depression	Cohen's $d$ , Hedges' $g$
Medicine	Drug vs. placebo on blood pressure	Cohen's $d$ , Risk ratio, NNT
Education	Effect of tutoring on exam scores	Cohen's $d$ , $\eta^2$
Marketing	Brand A vs. B on purchase intent	Cohen's $d$ , Cramér's $V$
Neuroscience	Brain region activation between groups	Cohen's $d$ , $\eta_p^2$
Genetics	SNP association with disease risk	Odds ratio, $R^2$
Organisational Psychology	Leadership training on productivity	Cohen's $d$ , $f^2$
Public Health	Vaccination programme on infection rate	Risk ratio, ARD, NNT
Ecology	Species richness across habitats	Cohen's $d$ , $\eta^2$

2.5 Statistical Significance vs. Effect Size: A Unified View

The relationship between sample size, effect size, and statistical significance can be summarised by the power equation. For a $t$ -test:

$t = d \cdot \sqrt{\frac{n}{2}} \quad \text{(independent samples)}$

This shows that the $t$ -statistic (and therefore p-value) is a joint function of both the effect size $d$ AND the sample size $n$ . A non-significant result could mean:

The true effect is zero or negligible, OR
The sample is too small to detect a real effect (low power).

A significant result could mean:

There is a genuine, meaningful effect, OR
The sample is so large that even a trivially small effect becomes significant.

Effect sizes disentangle magnitude from sample size.

3. The Mathematics Behind Effect Sizes

3.1 Cohen's $d$ — The Fundamental Standardised Mean Difference

Cohen's $d$ is the cornerstone effect size for comparing two means. It expresses the difference between two means in standard deviation units.

For two independent groups:

$d = \frac{\bar{x}_1 - \bar{x}_2}{s_{pooled}}$

Where the pooled standard deviation is:

$s_{pooled} = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}}$

For a one-sample design (comparing a sample mean to a known population value $\mu_0$ ):

$d = \frac{\bar{x} - \mu_0}{s}$

For a paired/repeated-measures design:

$d_{paired} = \frac{\bar{d}}{s_d}$

Where $\bar{d}$ is the mean of the difference scores and $s_d$ is the standard deviation of the difference scores.

Interpretation: $d = 1.0$ means the two group means are 1 standard deviation apart — a group with mean 50 differs from a group with mean 60 if both have $s = 10$ .

3.2 Hedges' $g$ — Bias-Corrected Cohen's $d$

Cohen's $d$ is slightly positively biased in small samples — it overestimates the true population effect size. Hedges' $g$ applies a correction factor $J$ to remove this bias:

$g = d \times J$

Where the correction factor is:

$J = 1 - \frac{3}{4\nu - 1}$

With $\nu = n_1 + n_2 - 2$ degrees of freedom (for independent samples) or $\nu = n - 1$ (for one-sample or paired designs).

A more precise version uses the gamma function:

$J = \frac{\Gamma(\nu/2)}{\sqrt{\nu/2} \cdot \Gamma((\nu-1)/2)}$

The bias is negligible for $n > 20$ per group but can be substantial ( $> 5\%$ ) for very small samples ( $n < 10$ ).

3.3 Glass's $\Delta$ — Using the Control Group SD

When the two groups have different variances (especially in pre-post or treatment-control designs where the treatment may change variability), Glass's $\Delta$ standardises by only the control group standard deviation:

$\Delta = \frac{\bar{x}_{treatment} - \bar{x}_{control}}{s_{control}}$

This makes the effect size interpretable as "how many standard deviation units above (or below) the control group distribution is the average treatment participant?"

3.4 Confidence Intervals for $d$ Using the Non-Central $t$ -Distribution

The exact 95% CI for Cohen's $d$ uses the non-central $t$ -distribution. The observed $t$ -statistic has a non-central $t$ -distribution with non-centrality parameter:

$\lambda = d \sqrt{\frac{n_1 n_2}{n_1 + n_2}}$

The confidence limits for $\lambda$ are found by solving:

$P(t_\nu(\lambda_L) \geq t_{obs}) = 0.025$ and $P(t_\nu(\lambda_U) \leq t_{obs}) = 0.025$

Then converting back to $d$ :

$d_{L} = \lambda_L \sqrt{\frac{n_1 + n_2}{n_1 n_2}}, \quad d_{U} = \lambda_U \sqrt{\frac{n_1 + n_2}{n_1 n_2}}$

This requires numerical iteration (no closed form) and is computed automatically by DataStatPro.

An approximate 95% CI (adequate for $n > 20$ per group) uses:

$d \pm 1.96 \times SE_d, \quad SE_d \approx \sqrt{\frac{n_1 + n_2}{n_1 n_2} + \frac{d^2}{2(n_1 + n_2)}}$

3.5 Eta Squared ( $\eta^2$ ) — Proportion of Variance Explained

Eta squared is the proportion of total variance in the dependent variable attributable to the independent variable (group membership in ANOVA):

$\eta^2 = \frac{SS_{effect}}{SS_{total}}$

For a one-way ANOVA:

$SS_{total} = SS_{between} + SS_{within}$

$\eta^2 = \frac{SS_{between}}{SS_{between} + SS_{within}}$

Relationship to Cohen's $d$ (two groups only):

$\eta^2 = \frac{d^2}{d^2 + 4}, \quad d = \frac{2\sqrt{\eta^2}}{\sqrt{1 - \eta^2}}$

Limitation: $\eta^2$ is biased upward — it overestimates the population effect because it uses the total sum of squares from the sample. It should not be reported for multi-factor ANOVA (use partial $\eta^2$ or $\omega^2$ instead).

3.6 Partial Eta Squared ( $\eta_p^2$ ) — Controlling for Other Effects

In factorial ANOVA (multiple IVs), partial eta squared estimates the proportion of variance explained by one effect after removing the variance attributable to other effects:

$\eta_p^2 = \frac{SS_{effect}}{SS_{effect} + SS_{error}}$

Note that in a one-way ANOVA (single IV), $\eta_p^2 = \eta^2$ . In multi-factor ANOVA, $\eta_p^2 > \eta^2$ for every effect, and the sum of all partial $\eta^2$ values can exceed 1.0.

⚠️ Because partial $\eta^2$ values can sum to more than 1.0 across all effects in a factorial design, they should never be compared to the "proportion of total variance explained" — that interpretation applies only to $\eta^2$ , not $\eta_p^2$ .

3.7 Omega Squared ( $\omega^2$ ) — Unbiased Variance-Explained Effect Size

Omega squared ( $\omega^2$ ) is a bias-corrected version of $\eta^2$ that better estimates the population proportion of variance explained:

For one-way ANOVA:

$\omega^2 = \frac{SS_{between} - (K-1) \cdot MS_{within}}{SS_{total} + MS_{within}}$

Where:

$K$ = number of groups.
$MS_{within} = SS_{within}/(n - K)$ = mean square within groups.

Partial omega squared for factorial designs:

$\omega_p^2 = \frac{SS_{effect} - df_{effect} \cdot MS_{error}}{SS_{total} + MS_{error}}$

$\omega^2$ is generally preferred over $\eta^2$ because it does not inflate with small samples and provides a less biased estimate of the population effect.

3.8 Epsilon Squared ( $\varepsilon^2$ ) — Another Unbiased Estimate

Epsilon squared is an alternative to omega squared, computationally simpler:

$\varepsilon^2 = \frac{SS_{between} - (K-1) \cdot MS_{within}}{SS_{total}}$

Like $\omega^2$ , $\varepsilon^2$ corrects for positive bias and can be slightly negative in small samples when the true population effect is near zero.

3.9 Cohen's $f$ and $f^2$ — Effect Size for ANOVA and Regression

Cohen's $f$ converts variance-explained effect sizes into a ratio suitable for power analysis:

$f = \sqrt{\frac{\eta^2}{1 - \eta^2}}$

Or from $\omega^2$ :

$f = \sqrt{\frac{\omega^2}{1 - \omega^2}}$

Cohen's $f^2$ is used for multiple regression and includes several variants:

Global $f^2$ (overall model fit):

$f^2_{global} = \frac{R^2}{1 - R^2}$

Local $f^2$ (effect of a specific predictor or set of predictors, controlling for others):

$f^2_{local} = \frac{R^2_{full} - R^2_{reduced}}{1 - R^2_{full}} = \frac{\Delta R^2}{1 - R^2_{full}}$

3.10 Pearson's $r$ and $r^2$ — Correlation and Coefficient of Determination

Pearson's $r$ is the effect size for the linear relationship between two continuous variables:

$r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2 \cdot \sum_{i=1}^n(y_i-\bar{y})^2}}$

$r^2$ (the coefficient of determination) is the proportion of variance in $Y$ explained by $X$ :

$r^2 = 1 - \frac{SS_{residual}}{SS_{total}}$

Confidence interval for $r$ using Fisher's $z$ -transformation:

$z_r = \frac{1}{2}\ln\left(\frac{1+r}{1-r}\right) = \text{arctanh}(r)$

$SE_{z_r} = \frac{1}{\sqrt{n-3}}$

95% CI for $z_r$ :

$z_r \pm 1.96 \cdot \frac{1}{\sqrt{n-3}}$

Converting CI bounds back to $r$ :

$r = \frac{e^{2z_r} - 1}{e^{2z_r} + 1} = \tanh(z_r)$

3.11 Odds Ratio, Risk Ratio, and Number Needed to Treat

For binary outcomes (event vs. no event) in two groups, the primary effect sizes are:

$2 \times 2$ Contingency Table:

	Event	No Event	Total
Group 1 (Treatment)	$a$	$b$	$a+b$
Group 2 (Control)	$c$	$d$	$c+d$

Risk (Probability) in each group:

$p_1 = \frac{a}{a+b}, \quad p_2 = \frac{c}{c+d}$

Absolute Risk Difference (ARD):

$\text{ARD} = p_1 - p_2$

Risk Ratio (Relative Risk, RR):

$\text{RR} = \frac{p_1}{p_2} = \frac{a/(a+b)}{c/(c+d)}$

Odds Ratio (OR):

$\text{OR} = \frac{p_1/(1-p_1)}{p_2/(1-p_2)} = \frac{a/b}{c/d} = \frac{ad}{bc}$

Number Needed to Treat (NNT):

$\text{NNT} = \frac{1}{\lvert \text{ARD} \rvert} = \frac{1}{\lvert p_1 - p_2 \rvert}$

NNT is the number of patients who must receive the treatment for one additional patient to benefit (or be harmed, if NNT is expressed as NNH — Number Needed to Harm).

95% CI for the log Odds Ratio:

$\ln(\widehat{\text{OR}}) \pm 1.96 \times SE_{\ln(OR)}$

Where:

$SE_{\ln(OR)} = \sqrt{\frac{1}{a} + \frac{1}{b} + \frac{1}{c} + \frac{1}{d}}$

Back-transforming:

$\text{OR}_{95\%\text{CI}} = \left[e^{\ln(\widehat{\text{OR}}) - 1.96 \cdot SE}, \; e^{\ln(\widehat{\text{OR}}) + 1.96 \cdot SE}\right]$

3.12 Effect Sizes for Categorical Association

Phi coefficient ( $\phi$ ) for $2 \times 2$ tables:

$\phi = \frac{ad - bc}{\sqrt{(a+b)(c+d)(a+c)(b+d)}}$

Equivalent to Pearson $r$ for two binary variables. Ranges from $-1$ to $+1$ .

Cramér's $V$ for $r \times c$ contingency tables ( $r$ rows, $c$ columns):

$V = \sqrt{\frac{\chi^2}{n \cdot \min(r-1, c-1)}}$

Ranges from 0 (no association) to 1 (perfect association).

Cohen's $w$ for goodness-of-fit and $\chi^2$ tests:

$w = \sqrt{\sum_{i=1}^{k} \frac{(P_{0i} - P_{1i})^2}{P_{0i}}}$

Where $P_{0i}$ are the null (expected) proportions and $P_{1i}$ are the alternative (observed/hypothesised) proportions.

3.13 Rank-Biserial Correlation

The rank-biserial correlation ( $r_{rb}$ ) is the effect size for the Mann-Whitney U test (non-parametric alternative to Cohen's $d$ when normality is not assumed):

$r_{rb} = 1 - \frac{2U}{n_1 n_2} = \frac{2U}{n_1 n_2} - 1$

Or equivalently:

$r_{rb} = \frac{\bar{R}_1 - \bar{R}_2}{n/2}$

Where $\bar{R}_1$ and $\bar{R}_2$ are the mean ranks of the two groups, and $n = n_1 + n_2$ .

Ranges from $-1$ to $+1$ . $r_{rb} = 0.5$ means that 75% of observations in Group 1 exceed those in Group 2.

4. Assumptions of Effect Size Estimation

4.1 Correct Scale and Direction of Variables

Effect sizes are only meaningful when variables are measured on an appropriate scale and when the direction of differences is clearly defined.

Why it matters: Reversing the direction of scoring (e.g., higher score = worse outcome vs. higher score = better outcome) changes the sign of the effect size. Ambiguous scoring leads to misinterpretation.

How to check: Before computing any effect size, clearly state:

What constitutes a positive vs. negative effect.
Whether higher or lower scores are better.
What the reference/baseline condition is.

4.2 Normally Distributed Populations (for $d$ -family)

Cohen's $d$ , Hedges' $g$ , and Glass's $\Delta$ assume that the observed scores come from normally distributed populations. Violations of normality can distort the pooled standard deviation and produce misleading effect size estimates.

How to check:

Shapiro-Wilk test for small samples ( $n < 50$ ).
Q-Q plots and histograms for larger samples.
Skewness ( $\lvert z \rvert < 2$ ) and kurtosis ( $\lvert z \rvert < 7$ ).

When violated: Use rank-biserial correlation ( $r_{rb}$ ) for the Mann-Whitney test or common language effect size (CL) as alternatives that do not assume normality.

4.3 Equal or Known Population Variances

Cohen's $d$ uses the pooled standard deviation, which implicitly assumes homogeneity of variance. When population variances differ substantially:

Glass's $\Delta$ is preferred (uses only the control group SD).
Use Welch's $d$ (sometimes called $d_s$ ), which uses only the denominator group's SD.
Report both groups' variances alongside the effect size.

Variance ratio rule of thumb: If $s^2_{larger}/s^2_{smaller} > 4$ , consider using Glass's $\Delta$ rather than Cohen's $d$ .

4.4 Independence of Observations

Effect sizes based on means (Cohen's $d$ ), correlations ( $r$ ), and variance-explained measures ( $\eta^2$ ) all assume that observations are independent of each other.

When violated:

Paired designs: Use paired $d$ (based on difference scores) rather than independent samples $d$ .
Clustered data: Use multilevel effect sizes (e.g., intraclass correlation for between-cluster effects).
Repeated measures: Report generalised $\eta^2$ or partial $\omega^2$ from the correct ANOVA model.

4.5 Adequate Sample Size for Stable Estimates

Effect size estimates are very unstable in small samples. The sampling variability of $d$ can be enormous with $n < 10$ per group:

$n$ per group	$SE_d$ (for true $d = 0.5$ )	95% CI width
5	0.70	2.74
10	0.46	1.80
20	0.33	1.28
50	0.21	0.81
100	0.15	0.57
200	0.10	0.40

This table shows that with only 5 observations per group, the 95% CI for $d = 0.5$ spans nearly 3 standard deviation units — essentially uninformative. Effect sizes require adequate sample sizes to be interpretable.

4.6 No Selective Reporting (Publication Bias)

When effect sizes are extracted from published literature, they are subject to publication bias — studies with larger, significant effects are more likely to be published than those with small, non-significant effects. This means that the average published effect size overestimates the true population effect.

Remedies:

Pre-register studies to reduce selective reporting.
In meta-analysis, use funnel plots and Egger's test to detect publication bias.
Use trim-and-fill methods to correct for publication bias.

5. Types of Effect Sizes

5.1 The Three Families of Effect Sizes

The $d$ -Family (Standardised Mean Differences)

These effect sizes express the difference between means in standard deviation units.

Effect Size	Formula	Standardiser	Best For
Cohen's $d$	$(\bar{x}_1-\bar{x}_2)/s_{pooled}$	Pooled SD	Independent samples, equal variances
Hedges' $g$	$d \times J$	Pooled SD (bias-corrected)	Small samples ( $n < 20$ per group)
Glass's $\Delta$	$(\bar{x}_1-\bar{x}_2)/s_{control}$	Control group SD	Unequal variances; treatment-control
$d_{paired}$	$\bar{d}/s_d$	SD of differences	Paired/repeated measures
$d_{av}$	$\bar{d}/s_{av}$	Average of group SDs	Paired; avoids population assumption
$d_z$	$t/\sqrt{n}$	—	Paired; directly from $t$ -test

The $r$ -Family (Variance-Explained and Correlation)

These effect sizes express how much of the total variance is explained by the effect.

Effect Size	Formula	Range	Best For
Pearson $r$	$\text{Cov}(X,Y)/(s_X s_Y)$	$[-1, 1]$	Linear correlation
$r^2$	$SS_{explained}/SS_{total}$	$[0, 1]$	Simple regression
$R^2$ (multiple)	$1 - SS_{res}/SS_{tot}$	$[0, 1]$	Multiple regression
$\eta^2$	$SS_{effect}/SS_{total}$	$[0, 1]$	One-way ANOVA
$\eta_p^2$	$SS_{effect}/(SS_{effect}+SS_{error})$	$[0, 1]$	Factorial ANOVA
$\omega^2$	Bias-corrected $\eta^2$	$(-\infty, 1]$	ANOVA (preferred)
$\omega_p^2$	Bias-corrected $\eta_p^2$	$(-\infty, 1]$	Factorial ANOVA (preferred)
$\varepsilon^2$	Alternative bias correction	$(-\infty, 1]$	One-way ANOVA
Cohen's $f$	$\sqrt{\eta^2/(1-\eta^2)}$	$[0, \infty)$	Power analysis for ANOVA
Cohen's $f^2$	$R^2/(1-R^2)$	$[0, \infty)$	Power analysis for regression
Rank-biserial $r$	$1 - 2U/(n_1 n_2)$	$[-1, 1]$	Mann-Whitney U test

The Risk/Odds Family

These effect sizes are appropriate for binary outcomes.

Effect Size	Formula	Range	Best For
Absolute Risk Difference	$p_1 - p_2$	$[-1, 1]$	Clinical decision-making
Risk Ratio (RR)	$p_1/p_2$	$(0, \infty)$	Prospective / cohort studies
Odds Ratio (OR)	$(p_1/(1-p_1))/(p_2/(1-p_2))$	$(0, \infty)$	Case-control studies
Number Needed to Treat	$1/\lvert p_1 - p_2 \rvert$	$(1, \infty)$	Clinical applicability
Phi ( $\phi$ )	$\chi^2/\sqrt{n}$ (for $2\times 2$ )	$[-1, 1]$	$2\times 2$ tables
Cramér's $V$	$\sqrt{\chi^2/(n \cdot \min(r-1,c-1))}$	$[0, 1]$	$r \times c$ tables
Cohen's $w$	$\sqrt{\sum(P_0-P_1)^2/P_0}$	$[0, \infty)$	$\chi^2$ goodness-of-fit

5.2 Choosing the Right Effect Size

The table below provides a quick reference for selecting the appropriate effect size measure based on your statistical test:

Statistical Test	Effect Size to Report	Notes
t-test (independent samples)	Cohen's $d$ or Hedges' $g$	Hedges' $g$ preferred for small samples ( $n < 20$ )
t-test (one sample or paired)	Cohen's $d_{paired}$ or $d_z$	Use $d_z$ when comparing to a known parameter
One-way ANOVA	$\omega^2$ (preferred), $\eta^2$ (common)	$\omega^2$ is less biased; $\eta^2$ tends to overestimate
Factorial ANOVA	$\omega_p^2$ (preferred), $\eta_p^2$ (common)	Use partial versions for factorial designs
Multiple regression	$R^2$ , adjusted $R^2$ , Cohen's $f^2$	Cohen's $f^2$ for local/effect-size-specific measures
Correlation	Pearson $r$ , $r^2$	$r^2$ shows variance explained
Chi-squared ( $2 \times 2$ )	Phi ( $\phi$ )	Special case of Pearson $r$ for $2 \times 2$ tables
Chi-squared ( $r \times c$ )	Cramér's $V$	Generalized version of Phi for larger tables
Risk comparison (binary, two groups)	Risk Ratio + ARD + NNT	ARD = Absolute Risk Difference; NNT = Number Needed to Treat
Case-control study (binary)	Odds Ratio	Standard measure for case-control studies
Mann-Whitney U / Wilcoxon	Rank-biserial correlation ( $r_{rb}$ )	Non-parametric alternative to $r$

6. Using the Effect Size Calculator Component

The Effect Size Calculator component in DataStatPro provides a comprehensive tool for computing, visualising, and interpreting effect sizes across all major statistical designs.

Step-by-Step Guide

Step 1 — Select the Effect Size Family

Choose from the "Effect Size Type" dropdown:

Mean Difference (d-family): For comparing means from $t$ -tests or similar.
Variance Explained (r-family): For ANOVA, regression, and correlation.
Categorical / Binary Outcomes: For $\chi^2$ tests, risk ratios, odds ratios.
Non-Parametric: For Mann-Whitney U, Wilcoxon tests.

Step 2 — Select the Specific Effect Size

Based on your design, select the specific effect size:

Independent samples: Cohen's $d$ , Hedges' $g$ , Glass's $\Delta$ .
One-sample or paired: Cohen's $d_{paired}$ , $d_z$ .
ANOVA: $\eta^2$ , $\eta_p^2$ , $\omega^2$ , $\omega_p^2$ , $\varepsilon^2$ , Cohen's $f$ .
Regression: $R^2$ , adjusted $R^2$ , Cohen's $f^2$ (global or local).
Correlation: Pearson $r$ , $r^2$ .
Binary outcomes: OR, RR, ARD, NNT, $\phi$ , Cramér's $V$ .
Non-parametric: Rank-biserial $r_{rb}$ .

💡 Recommendation: For ANOVA, always compute and report $\omega^2$ (or $\omega_p^2$ for factorial designs) in addition to or instead of $\eta^2$ . Omega squared is less biased and is increasingly required by journals.

Step 3 — Input Method

Choose how to provide the data:

Raw data: Provide the actual dataset; the app computes summary statistics and effect sizes automatically.
Summary statistics: Enter means, SDs, and $n$ values directly.
Test statistics: Enter the $t$ , $F$ , or $\chi^2$ statistic with degrees of freedom.

💡 Tip: When computing effect sizes from published papers that only report test statistics, use the "From test statistics" input method. For example, $d = t\sqrt{1/n_1 + 1/n_2}$ and $\eta^2 = F \cdot df_{between} / (F \cdot df_{between} + df_{within})$ .

Step 4 — Specify Design Details

Number of groups (for ANOVA).
Factorial structure (for multi-factor designs — specify main effects and interactions).
Sample sizes per group.
Degrees of freedom (if computing from test statistics).

Step 5 — Select Confidence Level

Choose the confidence level for intervals (default: 95%). The application computes:

Exact CIs based on non-central $t$ and $F$ distributions for $d$ , $\eta^2$ , $\omega^2$ .
Fisher $z$ -transformation CIs for Pearson $r$ .
Wald CIs for OR, RR, ARD (on the log scale, back-transformed).
Bootstrap CIs when raw data are provided.

Step 6 — Select Benchmarks

Choose the benchmark system for interpreting the magnitude:

Cohen (1988) — the original benchmarks (small/medium/large).
Field (2013) — discipline-specific benchmarks.
Funder & Ozer (2019) — benchmarks based on social science effect size distributions.
Sawilowsky (2009) — extended benchmarks for very small and very large effects.
Custom — enter your own domain-specific thresholds.

⚠️ Important: Cohen's benchmarks were intended as rough conventions when no better information is available. Always prioritise domain-specific benchmarks and contextual interpretation over generic small/medium/large labels.

Step 7 — Display Options

Select which outputs and visualisations to display:

✅ Effect size point estimate with 95% CI.
✅ Benchmark classification (small / medium / large) with the selected system.
✅ Visualisation of the two distributions with overlap.
✅ Common Language Effect Size (CL) percentage.
✅ Probability of Superiority.
✅ Number Needed to Treat (for binary outcomes).
✅ Variance overlap diagram ( $U_1$ , $U_2$ , $U_3$ statistics).
✅ Forest plot (when multiple effect sizes are provided).
✅ Conversion to other effect size types.
✅ Power analysis based on observed effect size.

Step 8 — Run the Calculation

Click "Calculate Effect Size". The application will:

Compute the requested effect size(s) from the provided data or statistics.
Construct confidence intervals using the appropriate method.
Classify the magnitude using the selected benchmark system.
Generate all selected visualisations.
Provide an interpretation paragraph in plain language.

7. Effect Sizes for Mean Differences

7.1 Cohen's $d$ for Independent Samples — Full Procedure

Step 1 — Compute group means and standard deviations

$\bar{x}_1 = \frac{1}{n_1}\sum_{i=1}^{n_1} x_{1i}, \quad \bar{x}_2 = \frac{1}{n_2}\sum_{i=1}^{n_2} x_{2i}$

$s_1 = \sqrt{\frac{1}{n_1-1}\sum_{i=1}^{n_1}(x_{1i}-\bar{x}_1)^2}, \quad s_2 = \sqrt{\frac{1}{n_2-1}\sum_{i=1}^{n_2}(x_{2i}-\bar{x}_2)^2}$

Step 2 — Compute pooled standard deviation

$s_{pooled} = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}}$

Step 3 — Compute Cohen's $d$

$d = \frac{\bar{x}_1 - \bar{x}_2}{s_{pooled}}$

Step 4 — Apply Hedges' correction (especially if $n < 20$ per group)

$g = d \times \left(1 - \frac{3}{4(n_1 + n_2 - 2) - 1}\right)$

Step 5 — Compute the 95% CI (exact, via non-central $t$ )

The exact CI is computed numerically. The approximate 95% CI is:

$d \pm 1.96 \times \sqrt{\frac{n_1 + n_2}{n_1 n_2} + \frac{d^2}{2(n_1 + n_2 - 2)}}$

Step 6 — Compute the Common Language Effect Size (CL)

The Common Language Effect Size (McGraw & Wong, 1992) is the probability that a randomly selected person from Group 1 scores higher than a randomly selected person from Group 2:

$CL = \Phi\left(\frac{d}{\sqrt{2}}\right)$

Where $\Phi$ is the standard normal CDF.

$CL = 0.50$ means 50% probability of superiority (no effect); $CL = 0.75$ means 75% of the time, a person from Group 1 outscores a person from Group 2.

7.2 Cohen's Benchmark Classification for $d$ and $g$

Cohen (1988) proposed the following conventions, intended as rough guides only:

Cohen's $d$	Verbal Label	Equivalent $r$	Overlap (%)
$0.00$	No effect	$0.00$	$100\%$
$0.20$	Small	$0.10$	$85\%$
$0.50$	Medium	$0.24$	$67\%$
$0.80$	Large	$0.37$	$53\%$
$1.20$	Very large	$0.51$	$40\%$
$2.00$	Huge	$0.71$	$18\%$

⚠️ Cohen himself warned against mechanical application of these benchmarks. He stated: "The effect size conventions are offered as conventions of last resort, to be used only when no better basis for setting the ES is available." Always contextualise effect sizes within your specific research domain.

Extended benchmarks (Sawilowsky, 2009):

Label	$d$
Tiny	$< 0.10$
Very small	$0.10 - 0.19$
Small	$0.20 - 0.49$
Medium	$0.50 - 0.79$
Large	$0.80 - 1.19$
Very large	$1.20 - 1.99$
Huge	$\geq 2.00$

7.3 Variance Overlap Statistics (Cohen's $U$ )

To complement Cohen's $d$ , three overlap statistics provide intuitive, probabilistic interpretations of the separation between two normal distributions:

$U_1$ : The proportion of the combined distributions that is NOT overlapping:

$U_1 = 2\Phi\left(\frac{\lvert d \rvert}{2}\right) - 1$

$U_2$ : The proportion of one distribution that exceeds the same proportion in the other distribution (percentage of the non-treatment distribution exceeded by the median of the treatment distribution):

$U_2 = \Phi\left(\frac{\lvert d \rvert}{2}\right) \times 100\%$

$U_3$ (Cohen's $U_3$ ): The proportion of the treatment distribution that exceeds the median of the control distribution:

$U_3 = \Phi(\lvert d \rvert) \times 100\%$

Example for $d = 0.50$ :

$U_3 = \Phi(0.50) = 0.691 \to 69.1\%$

Interpretation: 69.1% of the treatment group scores above the median of the control group.

$d$	$U_3$	CL (%)	Overlap (%)
0.20	57.9%	55.6%	85.3%
0.50	69.1%	63.8%	66.9%
0.80	78.8%	71.4%	52.5%
1.00	84.1%	76.0%	44.8%
1.50	93.3%	85.6%	28.1%
2.00	97.7%	92.1%	16.9%

7.4 Computing $d$ from Common Test Statistics

When raw data are unavailable, $d$ can be computed from reported test statistics:

From an independent samples $t$ -test:

$d = t \sqrt{\frac{1}{n_1} + \frac{1}{n_2}} = \frac{t\sqrt{n_1 + n_2}}{\sqrt{n_1 n_2}}$

From a one-sample or paired $t$ -test:

$d_z = \frac{t}{\sqrt{n}}$

From an $F$ -ratio (two-group ANOVA, $df_1 = 1$ ):

$d = \sqrt{F \left(\frac{1}{n_1} + \frac{1}{n_2}\right)}$

From a $\chi^2$ statistic (for $\phi$ or Cramér's $V$ ):

$\phi = \sqrt{\frac{\chi^2}{n}}, \quad V = \sqrt{\frac{\chi^2}{n \cdot \min(r-1,c-1)}}$

Converting between effect size families:

$r = \frac{d}{\sqrt{d^2 + \frac{(n_1+n_2)^2}{n_1 n_2}}}, \quad d = \frac{2r}{\sqrt{1-r^2}}$ (for equal group sizes)

$\eta^2 = \frac{d^2}{d^2+4}$ (for equal group sizes)

8. Effect Sizes for Variance Explained

8.1 $\eta^2$ , $\omega^2$ , and $\varepsilon^2$ Comparison

For a one-way ANOVA with $K$ groups and $n$ total observations:

Given the ANOVA table:

Source	SS	df	MS
Between (Effect)	$SS_B$	$K-1$	$MS_B = SS_B/(K-1)$
Within (Error)	$SS_W$	$n-K$	$MS_W = SS_W/(n-K)$
Total	$SS_T$	$n-1$

Eta squared:

$\eta^2 = \frac{SS_B}{SS_T}$

Omega squared (preferred):

$\omega^2 = \frac{SS_B - (K-1)MS_W}{SS_T + MS_W}$

Epsilon squared:

$\varepsilon^2 = \frac{SS_B - (K-1)MS_W}{SS_T}$

Relationship: $\omega^2 \leq \varepsilon^2 \leq \eta^2$

All three measure the same thing (proportion of variance explained) but $\omega^2$ and $\varepsilon^2$ are corrected for the positive bias of $\eta^2$ in finite samples.

8.2 Benchmark Interpretations for Variance-Explained Effect Sizes

Cohen (1988) benchmarks:

Label	$\eta^2$ or $\omega^2$	$f$	$f^2$	$r$ or $R$
Small	$0.01$	$0.10$	$0.02$	$0.10$
Medium	$0.06$	$0.25$	$0.15$	$0.30$
Large	$0.14$	$0.40$	$0.35$	$0.50$

Note on $\eta^2$ benchmarks: These were established when $\eta^2$ was the standard report. Since $\omega^2$ and $\varepsilon^2$ are systematically smaller than $\eta^2$ , the same verbal benchmarks do not directly transfer. Use the $f$ or $f^2$ conversions for power analysis regardless of which variance-explained index you report.

8.3 Generalised Eta Squared ( $\eta_G^2$ )

Generalised eta squared ( $\eta_G^2$ ; Olejnik & Algina, 2003) is designed for between-subjects comparison across studies by distinguishing between:

Manipulated variables (experimental factors set by the researcher, e.g., treatment).
Measured variables (participant characteristics, e.g., age, sex).

$\eta_G^2 = \frac{SS_{effect}}{SS_{effect} + \sum_{m} SS_{measured} + SS_{error}}$

Where the summation is over all measured (non-manipulated) variables in the design.

$\eta_G^2$ is more comparable across different experimental designs (between-subjects, within-subjects, mixed) than either $\eta^2$ or $\eta_p^2$ and is increasingly recommended for factorial and mixed ANOVA designs.

8.4 $R^2$ and Adjusted $R^2$ for Regression

The coefficient of determination $R^2$ is the proportion of variance in $Y$ explained by the regression model:

$R^2 = 1 - \frac{SS_{residual}}{SS_{total}} = \frac{SS_{regression}}{SS_{total}}$

Adjusted $R^2$ corrects for the number of predictors $p$ in the model:

$R^2_{adj} = 1 - (1-R^2)\frac{n-1}{n-p-1}$

Adjusted $R^2$ can be negative when the model fits worse than a horizontal line.

$R^2$ change ( $\Delta R^2$ ) for evaluating the increment from adding predictors:

$\Delta R^2 = R^2_{full} - R^2_{reduced}$

Cohen's $f^2$ for the increment:

$f^2_{local} = \frac{\Delta R^2}{1 - R^2_{full}}$

8.5 Confidence Intervals for $\eta^2$ and $\omega^2$

CIs for variance-explained effect sizes use the non-central F-distribution. The observed $F$ -ratio has a non-central $F$ distribution:

$F \sim F'(df_{between}, df_{within}, \lambda)$

Where $\lambda$ is the non-centrality parameter:

$\lambda = \frac{\eta^2 \cdot n}{1 - \eta^2}$

The CI bounds for $\lambda$ are found numerically, then converted to $\eta^2$ :

$\eta^2_L = \frac{\lambda_L}{\lambda_L + n}, \quad \eta^2_U = \frac{\lambda_U}{\lambda_U + n}$

For $\omega^2$ and $\varepsilon^2$ , a transformation approach is used: first compute the CI for $\eta^2$ (or $f$ ), then convert to the desired effect size.

9. Effect Sizes for Associations and Categorical Data

9.1 Pearson $r$ — Correlation Effect Size

The Pearson correlation $r$ is simultaneously a descriptive statistic and an effect size. It requires no additional calculation — the correlation coefficient itself IS the standardised effect size for the strength of a linear relationship.

Benchmarks for $r$ :

$\lvert r \rvert$	Cohen (1988)	Funder & Ozer (2019)
$< 0.10$	Negligible	Very small (potentially negligible)
$0.10 - 0.29$	Small	Small
$0.30 - 0.49$	Medium	Medium / large
$\geq 0.50$	Large	Very large

💡 Funder & Ozer (2019) argued that Cohen's benchmarks are too conservative for social/behavioural science, where $r = 0.30$ is actually a large effect in practice. Consider the base rates in your field when applying benchmarks.

9.2 Interpreting the Odds Ratio

The Odds Ratio (OR) is the most common effect size in case-control studies and logistic regression.

OR	Interpretation
$1.0$	No difference in odds between groups
$> 1.0$	Increased odds of event in Group 1 vs. Group 2
$< 1.0$	Decreased odds of event in Group 1 vs. Group 2
$2.0$	Twice the odds
$0.5$	Half the odds (equivalent to OR = 2.0 in opposite direction)

Benchmark (Chen et al., 2010 for medical research):

$\lvert \ln(OR) \rvert$	OR	Label
$0.20$	$1.22$	Small
$0.50$	$1.65$	Medium
$0.80$	$2.23$	Large

Converting OR to Cohen's $d$ (for meta-analytic purposes):

$d = \frac{\ln(OR) \cdot \sqrt{3}}{\pi} \approx \ln(OR) \times 0.5513$

Converting OR to $r$ :

$r = \frac{d}{\sqrt{d^2 + 4}}$

9.3 Risk Ratio vs. Odds Ratio — When to Use Which

Situation	Recommended Effect Size	Why
Prospective/cohort study	Risk Ratio (RR)	Probabilities are directly estimable
Case-control study	Odds Ratio (OR)	Incidence not estimable; OR is invariant
Clinical trial (binary outcome)	RR + ARD + NNT	All provide different, complementary information
Rare events ( $p < 0.10$ )	Either (OR $\approx$ RR when rare)	OR approximates RR for rare outcomes
Common events ( $p > 0.10$ )	RR preferred	OR exaggerates the effect vs. RR
Logistic regression	OR	Natural output of logistic model

⚠️ A common mistake is interpreting an Odds Ratio as a Risk Ratio when the event is common ( $p > 0.10$ ). The OR always exaggerates the RR when the event is common. For example, OR = 3.0 may correspond to RR = 2.0 when the control event rate is 30%. Always report the ARD and NNT alongside OR or RR for clinical interpretability.

9.4 Number Needed to Treat (NNT)

The NNT is one of the most clinically interpretable effect sizes:

$\text{NNT} = \frac{1}{\lvert p_{treatment} - p_{control} \rvert} = \frac{1}{\lvert \text{ARD} \rvert}$

A treatment with NNT = 5 means that on average, 5 patients must be treated for 1 additional patient to benefit compared to control.

NNT	Clinical Interpretation
$1$	Every treated patient benefits (perfect)
$2 - 5$	Excellent — highly effective treatment
$5 - 10$	Good — meaningful clinical benefit
$10 - 50$	Moderate — benefit for a minority of treated patients
$> 50$	Small — many patients treated for little benefit
$\infty$	No treatment benefit (ARD = 0)

95% CI for NNT (Altman method):

$\text{NNT}_{CI} = \frac{1}{\text{ARD} \pm 1.96 \times SE_{ARD}}$

Where $SE_{ARD} = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}$ .

⚠️ NNT CIs can be awkward when the CI for ARD includes 0, producing a CI that spans from a negative NNT (Needed to Harm, NNH) to a positive NNT through an infinite discontinuity. In this case, report both sides of the CI as NNT and NNH separately.

9.5 Cramér's $V$ for Multi-Way Tables

Cramér's $V$ is the standard effect size for $\chi^2$ tests of independence in tables larger than $2 \times 2$ :

$V = \sqrt{\frac{\chi^2}{n \cdot \min(r-1, c-1)}}$

Benchmarks (Cohen, 1988, adjusted by $\min(r-1,c-1)$ ):

$\min(r-1,c-1)$	Small	Medium	Large
1 ( $2\times 2$ )	$0.10$	$0.30$	$0.50$
2 ( $3\times 3$ )	$0.07$	$0.21$	$0.35$
3 ( $4\times 4$ )	$0.06$	$0.17$	$0.29$
4 ( $5\times 5$ )	$0.05$	$0.15$	$0.25$

Corrected Cramér's $V$ (Bergsma, 2013 correction for small samples and sparse tables):

$\tilde{V} = \max\left(0, \sqrt{\tilde{\phi}^2 - \frac{1}{n}\frac{(r-1)(c-1)}{(\min(r,c)-1)}}\right)$

Where $\tilde{\phi}^2 = \chi^2/n - \frac{(r-1)(c-1)}{n-1}$ .

The corrected version corrects for positive bias and is recommended for small samples or sparse tables.

10. Model Fit and Evaluation

10.1 Evaluating Effect Size Precision — The Confidence Interval

The primary evaluation criterion for an effect size is its confidence interval (CI). The CI communicates both the direction and magnitude of the effect AND the uncertainty around the estimate.

Rules for interpreting effect size CIs:

CI property	Interpretation
CI entirely above zero (or positive null)	Effect is significantly positive
CI entirely below zero (or negative null)	Effect is significantly negative
CI contains zero	Effect not statistically significant
Narrow CI	Precise estimate (large $n$ )
Wide CI	Imprecise estimate (small $n$ ) — interpret point estimate cautiously
CI range entirely within small range	Effect is definitely small
CI range spans from small to large	Effect magnitude is uncertain

10.2 Precision as a Function of Sample Size

The width of the 95% CI for Cohen's $d$ decreases as $n$ increases:

$\text{CI Width} \approx 2 \times 1.96 \times SE_d \approx 2 \times 1.96 \times \sqrt{\frac{n_1+n_2}{n_1 n_2} + \frac{d^2}{2(n_1+n_2)}}$

For equal group sizes ( $n_1 = n_2 = n$ ) and $d = 0.5$ :

$n$ per group	Approx CI Width	Interpretation
10	1.86	Very imprecise
20	1.28	Imprecise
50	0.80	Moderate precision
100	0.57	Good precision
200	0.40	High precision
500	0.25	Very high precision

10.3 The Minimal Effect Size of Interest (MESI) — Equivalence Testing

For many applications, researchers are not just interested in whether an effect is non-zero, but whether it exceeds a minimum meaningful threshold. The Minimum Effect Size of Interest (MESI) defines the smallest effect that would be practically or clinically important.

Two One-Sided Tests (TOST) equivalence testing:

Define bounds $-\Delta$ and $+\Delta$ as the MESI (e.g., $\Delta = 0.20$ for a "trivially small" effect). The null hypothesis of the equivalence test is:

$H_0: \lvert d \rvert \geq \Delta$ (the effect is NOT negligible)

$H_1: \lvert d \rvert < \Delta$ (the effect IS negligible)

The equivalence is supported when both one-sided tests reject their respective nulls. Practically, the effect is declared equivalent to zero (negligible) when the 90% CI for $d$ falls entirely within $(-\Delta, +\Delta)$ .

💡 Equivalence testing is increasingly important for null results. A study that fails to reject $H_0: d = 0$ does not establish that the effect is zero or negligible — only equivalence testing can establish negligibility.

10.4 Power Analysis Based on Effect Size

Effect sizes are the primary input to a priori power analysis — determining the required sample size before conducting a study.

Required sample size for a two-sample $t$ -test at power $1-\beta$ and significance $\alpha$ :

$n_{per\;group} = \frac{2(z_{1-\alpha/2} + z_{1-\beta})^2}{d^2}$

For $\alpha = .05$ and power = $0.80$ ( $z_{.975} = 1.96$ , $z_{.80} = 0.84$ ):

$n_{per\;group} = \frac{2(1.96 + 0.84)^2}{d^2} = \frac{15.68}{d^2}$

Cohen's $d$	$n$ per group (power = 0.80)	$n$ per group (power = 0.90)
0.20 (small)	394	527
0.50 (medium)	64	85
0.80 (large)	26	34
1.00 (large)	17	22

10.5 Sensitivity Analysis — Detectable Effect for a Given $n$

The sensitivity analysis (retrospective power analysis) asks: given the sample size already collected, what is the smallest effect size that could be detected with 80% power?

$d_{min} = \sqrt{\frac{2(z_{1-\alpha/2} + z_{1-\beta})^2}{n_{per\;group}}}$

For $n = 30$ per group:

$d_{min} = \sqrt{\frac{15.68}{30}} = \sqrt{0.523} = 0.72$

This means a study with $n = 30$ per group can only reliably detect effects of $d \geq 0.72$ (close to Cohen's "large" threshold). Effects smaller than this may exist but will often be missed.

⚠️ Sensitivity analysis (post-hoc power) should not be used to "explain" a non-significant result — this is circular reasoning. Sensitivity analysis is valuable for communicating what magnitudes of effects could have been detected, but it does not address whether a true effect exists.

10.6 Comparing Effect Sizes Across Studies

When comparing effect sizes across studies, ensure:

Same family: $d$ and $r$ are different families; convert to a common metric.
Same design: Paired $d$ and independent $d$ are not directly comparable.
Same sample type: Clinical vs. community samples may have systematically different effect sizes.
Bias correction: Use Hedges' $g$ (not Cohen's $d$ ) when comparing across studies with different sample sizes.

Converting between families for comparison:

$r = \frac{d}{\sqrt{d^2 + \frac{(n_1+n_2)^2}{n_1 n_2}}}$ (exact)

$r \approx \frac{d}{\sqrt{d^2 + 4}}$ (equal group sizes)

$d \approx \frac{2r}{\sqrt{1-r^2}}$ (equal group sizes)

11. Advanced Topics

11.1 Meta-Analytic Pooling of Effect Sizes

Meta-analysis combines effect sizes from multiple independent studies using weighted averaging. The weight of each study is the inverse of its variance.

Fixed-effects model (assumes all studies estimate the same true effect $\theta$ ):

$\hat{\theta}_{FE} = \frac{\sum_{i=1}^{k} w_i \hat{\theta}_i}{\sum_{i=1}^{k} w_i}$

Where $w_i = 1/v_i$ and $v_i = SE_i^2$ is the variance of the effect size in study $i$ .

Random-effects model (allows true effects to vary across studies, $\theta_i \sim \mathcal{N}(\mu, \tau^2)$ ):

$w_i^* = \frac{1}{v_i + \hat{\tau}^2}$

$\hat{\mu}_{RE} = \frac{\sum_{i=1}^{k} w_i^* \hat{\theta}_i}{\sum_{i=1}^{k} w_i^*}$

Where $\hat{\tau}^2$ is the estimated between-study variance (heterogeneity), computed using the DerSimonian-Laird estimator.

Heterogeneity statistics:

$Q = \sum_{i=1}^k w_i(\hat{\theta}_i - \hat{\theta}_{FE})^2 \sim \chi^2(k-1)$

$I^2 = \max\left(0, \frac{Q - (k-1)}{Q}\right) \times 100\%$

$I^2$	Heterogeneity
$0 - 25\%$	Low
$25 - 50\%$	Moderate
$50 - 75\%$	Substantial
$> 75\%$	Considerable

11.2 Effect Sizes for Multilevel and Longitudinal Designs

In multilevel models (e.g., students within schools, patients within hospitals), effect sizes must account for the nested structure.

ICC-based $d$ for between-cluster effects:

$d_{cluster} = \frac{\mu_1 - \mu_2}{\sigma_{total}} = \frac{\mu_1 - \mu_2}{\sqrt{\sigma^2_{between} + \sigma^2_{within}}}$

$R^2$ for multilevel models (Nakagawa & Schielzeth, 2013):

$R^2_{marginal} = \frac{\sigma^2_f}{\sigma^2_f + \sigma^2_u + \sigma^2_e}$ (fixed effects only)

$R^2_{conditional} = \frac{\sigma^2_f + \sigma^2_u}{\sigma^2_f + \sigma^2_u + \sigma^2_e}$ (fixed + random effects)

Where $\sigma^2_f$ = variance explained by fixed effects, $\sigma^2_u$ = random effects variance, $\sigma^2_e$ = residual variance.

11.3 Standardised vs. Unstandardised Effect Sizes

Not all effect size applications require standardisation. Unstandardised effect sizes (raw mean differences, regression coefficients in original units) are often more informative and actionable than standardised counterparts.

When to use unstandardised effects:

The original measurement units are widely understood (e.g., blood pressure in mmHg, cognitive test points).
The audience is practitioners who think in original units.
The scale is the same across studies being compared.

When to use standardised effects:

Variables are measured on arbitrary scales (psychological questionnaires).
Comparing effects across studies using different instruments.
Meta-analysis.
Power analysis.

The "point of controversy": Some methodologists (Lenth, 2001; Tukey, 1991) argue that standardised effect sizes are frequently misinterpreted and that the denominator (which SD is used) is itself a critical and often overlooked choice.

11.4 Effect Size for Interaction Effects in Factorial ANOVA

Interaction effect sizes in factorial designs require special treatment:

Partial omega squared for interaction:

$\omega^2_{p,AxB} = \frac{SS_{AxB} - df_{AxB} \cdot MS_{error}}{SS_{total} + MS_{error}}$

Generalised eta squared for the interaction:

$\eta^2_{G,AxB} = \frac{SS_{AxB}}{SS_{AxB} + SS_{error} + \sum_{m}SS_{measured}}$

💡 For interaction effects, always compute and report the simple effects (main effects at each level of the other factor) alongside the overall interaction effect size. The interaction effect size alone does not communicate the direction or pattern of the interaction.

11.5 Rank-Based Effect Sizes for Non-Parametric Tests

When parametric assumptions are violated, rank-based effect sizes should be used:

Wilcoxon Signed-Rank test (paired or one-sample):

$r_W = \frac{Z}{\sqrt{n}}$

Where $Z$ is the standardised Wilcoxon test statistic.

Kruskal-Wallis test (non-parametric ANOVA equivalent):

$\eta^2_H = \frac{H - K + 1}{n - K}$

Where $H$ is the Kruskal-Wallis $H$ statistic, $K$ is the number of groups, and $n$ is the total sample size.

Spearman's $\rho$ (non-parametric correlation, itself an effect size):

$\rho_s = 1 - \frac{6\sum_{i=1}^n d_i^2}{n(n^2-1)}$

Where $d_i$ is the difference between ranks of the $i$ -th observation on $X$ and $Y$ .

11.6 Reporting Effect Sizes According to APA and Journal Standards

The APA Publication Manual (7th ed.) and major journals increasingly require reporting effect sizes with confidence intervals for all primary analyses. Best practice:

Minimum reporting requirements:

Report the effect size point estimate.
Report the 95% CI for the effect size.
Specify which effect size was computed (not just "effect size = ").
State which benchmark system was used for interpretation.
Report whether $\eta^2$ or $\omega^2$ was used (specify which type).

Example APA-compliant report: "The CBT group showed significantly lower depression scores than the control group ( $t(118) = 4.21$ , $p < .001$ , $d = 0.77$ , 95% CI [0.40, 1.14]), indicating a large treatment effect."

12. Worked Examples

Example 1: Cohen's $d$ and Hedges' $g$ — CBT vs. Control for Depression

A clinical trial randomises $n_1 = 35$ participants to CBT and $n_2 = 35$ to a waitlist control. Depression is measured on the PHQ-9 (0–27 scale, lower = less depression).

Summary statistics:

Group	$n$	Mean PHQ-9	SD
CBT	35	$8.4$	$4.2$
Control	35	$13.1$	$4.8$

Step 1 — Pooled SD:

$s_{pooled} = \sqrt{\frac{(35-1)(4.2)^2 + (35-1)(4.8)^2}{35+35-2}} = \sqrt{\frac{34(17.64) + 34(23.04)}{68}} = \sqrt{\frac{599.76 + 783.36}{68}} = \sqrt{\frac{1383.12}{68}} = \sqrt{20.34} = 4.51$

Step 2 — Cohen's $d$ :

$d = \frac{8.4 - 13.1}{4.51} = \frac{-4.7}{4.51} = -1.042$

The negative sign indicates CBT has lower (better) depression scores. By convention, report the absolute value with direction: $\lvert d \rvert = 1.042$ .

Step 3 — Hedges' $g$ (bias correction):

$\nu = 35 + 35 - 2 = 68$

$J = 1 - \frac{3}{4(68) - 1} = 1 - \frac{3}{271} = 1 - 0.0111 = 0.9889$

$g = 1.042 \times 0.9889 = 1.031$

(Minimal correction since $n$ is moderate.)

Step 4 — Approximate 95% CI:

$SE_d = \sqrt{\frac{35+35}{35 \times 35} + \frac{1.042^2}{2(35+35-2)}} = \sqrt{\frac{70}{1225} + \frac{1.086}{136}} = \sqrt{0.0571 + 0.0080} = \sqrt{0.0651} = 0.255$

$95\% \text{ CI}: 1.042 \pm 1.96(0.255) = [0.542, 1.542]$

Step 5 — Common Language Effect Size:

$CL = \Phi\left(\frac{1.042}{\sqrt{2}}\right) = \Phi(0.737) = 0.770$

77.0% of CBT participants score lower (better) than the average control participant.

Step 6 — $U_3$ Statistic:

$U_3 = \Phi(1.042) = 0.851$

85.1% of CBT participants have PHQ-9 scores below the mean of the control group.

Summary:

Statistic	Value	Interpretation
Cohen's $d$	$1.042$	Large effect (Cohen's benchmark: large $\geq 0.80$ )
Hedges' $g$	$1.031$	Large effect (negligible bias correction)
95% CI for $d$	$[0.542, 1.542]$	Entirely above zero — significant effect
CL	$77.0\%$	77% of CBT patients score better than average control
$U_3$	$85.1\%$	85% of CBT patients below the control mean

Conclusion: CBT produced a large, statistically significant reduction in depression compared to waitlist control ( $d = 1.04$ , 95% CI [0.54, 1.54]). The effect size indicates that approximately 85% of CBT participants had depression scores below the average control participant. This is a clinically meaningful and large treatment effect.

Example 2: $\omega^2$ and $\eta^2$ — Effect of Teaching Method on Exam Scores

An educational researcher tests three teaching methods (Lecture, Flipped Classroom, Project- Based) on exam performance ( $Y$ , %). Total sample: $n = 90$ (30 per group).

ANOVA table:

Source	SS	df	MS	$F$	$p$
Between (Method)	2840	2	1420	8.94	$< .001$
Within (Error)	13800	87	158.6
Total	16640	89

Step 1 — Eta squared:

$\eta^2 = \frac{SS_{between}}{SS_{total}} = \frac{2840}{16640} = 0.171$

Step 2 — Omega squared (bias-corrected):

$\omega^2 = \frac{SS_{between} - (K-1)MS_{within}}{SS_{total} + MS_{within}} = \frac{2840 - (2)(158.6)}{16640 + 158.6} = \frac{2840 - 317.2}{16798.6} = \frac{2522.8}{16798.6} = 0.150$

Step 3 — Epsilon squared:

$\varepsilon^2 = \frac{SS_{between} - (K-1)MS_{within}}{SS_{total}} = \frac{2840 - 317.2}{16640} = \frac{2522.8}{16640} = 0.152$

Step 4 — Cohen's $f$ :

$f = \sqrt{\frac{\omega^2}{1-\omega^2}} = \sqrt{\frac{0.150}{0.850}} = \sqrt{0.176} = 0.420$

Step 5 — 95% CI for $\omega^2$ (via non-central $F$ ):

Using the non-central $F$ approach with $F_{obs} = 8.94$ , $df_1 = 2$ , $df_2 = 87$ :

Non-centrality parameter $\hat{\lambda} = F \times df_1 = 8.94 \times 2 = 17.88$

95% CI for $\lambda$ : $[7.20, 32.18]$ (numerical)

Converting: $\omega^2_{L} = 7.20/(7.20+90) = 0.074$ , $\omega^2_U = 32.18/(32.18+90) = 0.263$

$\omega^2 = 0.150$ , 95% CI $[0.074, 0.263]$

Comparison of estimates:

Measure	Value	Benchmark (Cohen)	Label
$\eta^2$	$0.171$	Large ( $\geq 0.14$ )	Large (biased)
$\omega^2$	$0.150$	Large ( $\geq 0.14$ )	Large (unbiased)
$\varepsilon^2$	$0.152$	Large ( $\geq 0.14$ )	Large (unbiased)
Cohen's $f$	$0.420$	Large ( $\geq 0.40$ )	Large

Conclusion: Teaching method has a large effect on exam performance ( $\omega^2 = 0.150$ , 95% CI [0.074, 0.263]). The bias-corrected estimate ( $\omega^2 = 0.150$ ) is slightly smaller than $\eta^2 = 0.171$ , as expected. Approximately 15% of the variance in exam scores is attributable to teaching method. Note that $\eta^2 = 0.171$ slightly overestimates the true population effect due to sampling bias, illustrating why $\omega^2$ is preferred.

Example 3: Odds Ratio, Risk Ratio, and NNT — Vaccine Effectiveness

A clinical trial evaluates a new vaccine. Among $n_1 = 1{,}000$ vaccinated participants, $a = 20$ develop the disease. Among $n_2 = 1{,}000$ control participants, $c = 80$ develop the disease.

$2 \times 2$ Table:

	Disease	No Disease	Total
Vaccinated	20	980	1000
Control	80	920	1000

Step 1 — Risks:

$p_1 = \frac{20}{1000} = 0.020$ (vaccinated)

$p_2 = \frac{80}{1000} = 0.080$ (control)

Step 2 — Absolute Risk Difference:

$\text{ARD} = p_1 - p_2 = 0.020 - 0.080 = -0.060$

Vaccination reduces disease risk by 6 percentage points.

Step 3 — Risk Ratio:

$\text{RR} = \frac{0.020}{0.080} = 0.25$

Vaccinated participants have 25% the risk of unvaccinated (a 75% risk reduction).

Step 4 — Odds Ratio:

$\text{OR} = \frac{20 \times 920}{80 \times 980} = \frac{18400}{78400} = 0.235$

$\ln(\text{OR}) = \ln(0.235) = -1.449$

$SE_{\ln(OR)} = \sqrt{\frac{1}{20} + \frac{1}{980} + \frac{1}{80} + \frac{1}{920}} = \sqrt{0.0500 + 0.0010 + 0.0125 + 0.0011} = \sqrt{0.0646} = 0.254$

95% CI for OR:

$e^{-1.449 \pm 1.96(0.254)} = e^{[-1.947, -0.951]} = [0.143, 0.387]$

Step 5 — NNT:

$\text{NNT} = \frac{1}{\lvert -0.060 \rvert} = \frac{1}{0.060} = 16.7 \approx 17$

Step 6 — Vaccine Effectiveness (VE):

$\text{VE} = (1 - \text{RR}) \times 100\% = (1 - 0.25) \times 100\% = 75\%$

$SE_{\text{ARD}} = \sqrt{\frac{0.02(0.98)}{1000} + \frac{0.08(0.92)}{1000}} = \sqrt{0.0000196 + 0.0000736} = \sqrt{0.0000932} = 0.00965$

95% CI for ARD: $-0.060 \pm 1.96(0.00965) = [-0.079, -0.041]$

95% CI for NNT: $[1/0.079, 1/0.041] = [12.7, 24.4] \approx [13, 25]$

Summary:

Effect Size	Value	95% CI	Interpretation
ARD	$-0.060$	$[-0.079, -0.041]$	Vaccine reduces risk by 6%
Risk Ratio	$0.250$	$[0.161, 0.389]$	75% risk reduction
Odds Ratio	$0.235$	$[0.143, 0.387]$	Significantly protective
NNT	$17$	$[13, 25]$	17 vaccinated to prevent 1 case
Vaccine Effectiveness	$75\%$	$[61\%, 84\%]$	High effectiveness

Conclusion: The vaccine is highly effective, with a risk ratio of 0.25 (75% risk reduction) and an NNT of 17 (13–25). For every 17 people vaccinated, one additional case of disease is prevented compared to no vaccination. All three complementary effect sizes (ARD, RR, NNT) consistently demonstrate a clinically important and statistically significant protective effect of the vaccine.

Example 4: Cramér's $V$ — Association Between Study Method and Grade

A researcher surveys $n = 300$ students on their primary study method (Flashcards, Practice Tests, Re-Reading) and their final grade (A/B, C, D/F). The chi-squared test yields $\chi^2(4) = 22.8$ , $p < .001$ .

Cramér's $V$ :

$V = \sqrt{\frac{\chi^2}{n \cdot \min(r-1, c-1)}} = \sqrt{\frac{22.8}{300 \times \min(3-1, 3-1)}} = \sqrt{\frac{22.8}{300 \times 2}} = \sqrt{\frac{22.8}{600}} = \sqrt{0.038} = 0.195$

95% CI for $V$ (using non-central $\chi^2$ approach):

Non-centrality parameter $\hat{\lambda}_{\chi^2} = \chi^2 - df = 22.8 - 4 = 18.8$

95% CI for $\lambda$ : $[8.2, 33.4]$ (numerical iteration)

$V_L = \sqrt{8.2/(300 \times 2)} = \sqrt{0.01367} = 0.117$

$V_U = \sqrt{33.4/(300 \times 2)} = \sqrt{0.05567} = 0.236$

Benchmark (for $\min(r-1,c-1) = 2$ ): Small = 0.07, Medium = 0.21, Large = 0.35.

$V = 0.195$ falls just below the medium threshold.

Conclusion: There is a small-to-medium association between study method and grade ( $V = 0.195$ , 95% CI [0.117, 0.236], $p < .001$ ). Study method explains approximately $V^2 = 0.038$ (3.8%) of the variance in grade outcomes, indicating a modest but statistically significant relationship. Practice testing and flashcard use appear to produce better grade distributions than re-reading, consistent with retrieval practice research.

13. Common Mistakes and How to Avoid Them

Mistake 1: Conflating Statistical Significance with Effect Size

Problem: Concluding that because $p < .05$ , the effect is "large" or "important." Conversely, concluding that because $p > .05$ , the effect is "zero" or "negligible." Statistical significance is entirely about the strength of evidence against $H_0$ , not about the magnitude of the effect.
Solution: Always report BOTH the p-value AND the effect size with its CI. A significant result with $d = 0.08$ (tiny effect) and a non-significant result with $d = 0.60$ (large effect, underpowered study) tell very different stories.

Mistake 2: Using $\eta^2$ Instead of $\omega^2$ for ANOVA Effect Sizes

Problem: $\eta^2$ is systematically biased upward — it always overestimates the true population effect size, especially in small samples with few groups. Many researchers report $\eta^2$ simply because it is the default output of SPSS.
Solution: Always report $\omega^2$ (or $\omega_p^2$ for factorial designs) as the primary ANOVA effect size. Report $\eta^2$ only if explicitly required by a journal, and clearly label it as biased. In many cases, the difference is small but the correct labelling matters.

Mistake 3: Using Cohen's $d$ When Glass's $\Delta$ is Appropriate

Problem: When group variances differ substantially (variance ratio $> 4$ ), pooling the standard deviations to compute Cohen's $d$ produces a denominator that reflects neither group well and leads to a misleading effect size.
Solution: When $s_1/s_2 > 2$ (or $< 0.5$ ), report Glass's $\Delta$ (standardising by the control group SD) alongside Cohen's $d$ . Clearly state which SD was used as the standardiser.

Mistake 4: Reporting Effect Sizes Without Confidence Intervals

Problem: A point estimate of $d = 0.50$ from a study of $n = 20$ per group has a 95% CI of approximately [0.12, 0.88] — a range spanning from small to large. Reporting only $d = 0.50$ without the CI gives a false sense of precision.
Solution: Always report the 95% CI alongside every effect size. DataStatPro automatically computes exact CIs for all effect sizes using non-central distributions. This is increasingly required by APA and major journals.

Mistake 5: Applying Cohen's Benchmarks Without Context

Problem: Mechanically classifying $d = 0.21$ as "small" based on Cohen's benchmarks regardless of the research context. In some fields (e.g., cognitive neuroscience or social psychology in field settings), $d = 0.21$ is a large, practically important effect.
Solution: Use Cohen's benchmarks only as a last resort. Prioritise domain-specific benchmarks, compare to average effect sizes in your field (e.g., from meta-analyses), and consider the practical or clinical implications of the effect size given the context.

Mistake 6: Interpreting the OR as the RR

Problem: When the event is common ( $p > 0.10$ ), the Odds Ratio is numerically larger (more extreme) than the Risk Ratio. For example, if $p_1 = 0.40$ and $p_2 = 0.20$ , then RR = 2.0 but OR = 2.67. Reporting "the odds of the event are 2.67 times higher" and implying that "the risk is 2.67 times higher" substantially overstates the effect.
Solution: Always report the RR (not OR) when the outcome is common ( $p > 0.10$ ) and absolute probabilities are estimable (prospective study). Clearly distinguish between "odds" (OR) and "risk" (RR) in all reporting. Always accompany OR with the ARD for context.

Mistake 7: Computing Paired $d$ as Independent Samples $d$

Problem: Using the independent samples formula (with pooled SD) for paired or repeated measures data ignores the correlation between the two measurements, dramatically underestimating the true within-person effect size (because $s_{pooled}$ includes between- person variability, whereas $s_d$ does not).
Solution: For paired designs, always use $d_{paired} = \bar{d}/s_d$ where $\bar{d}$ is the mean of the difference scores and $s_d$ is the SD of the difference scores. The paired $d$ will typically be larger than the independent $d$ for the same data when the pre-post correlation is positive.

Mistake 8: Reporting $\eta_p^2$ Values as Proportions of "Total Variance"

Problem: Partial eta squared ( $\eta_p^2$ ) in factorial ANOVA is NOT the proportion of total variance. In a $2 \times 2$ ANOVA with interaction, the values of $\eta_p^2$ for the two main effects and interaction can sum to well over 1.0 — clearly impossible if they were proportions of total variance.
Solution: When reporting $\eta_p^2$ , state explicitly that it is "the proportion of variance in the DV attributable to this effect after removing variance associated with other effects." Use $\eta^2$ (not $\eta_p^2$ ) if you want to convey what fraction of total variance each effect explains.

Mistake 9: Using the Wrong $n$ for Computing $d$ from $t$

Problem: When computing $d$ from a reported $t$ -statistic, researchers sometimes use the total $N$ instead of the per-group $n$ in the formula $d = t\sqrt{1/n_1 + 1/n_2}$ , or confuse the sample sizes when groups are unequal.
Solution: For independent samples: $d = t\sqrt{(n_1+n_2)/(n_1 n_2)}$ . For paired or one-sample: $d_z = t/\sqrt{n}$ . Always double-check which $t$ -test formula was used by the original authors.

Mistake 10: Reporting NNT Without Specifying the Time Horizon and Base Rate

Problem: An NNT of 20 means very different things depending on whether the outcome is "prevent a heart attack over 5 years" vs. "cure a headache in 2 hours." Without specifying the comparison condition (vs. what?), the time horizon, the baseline event rate, and the population, NNT is not interpretable.
Solution: Always specify the NNT with: (1) the comparison condition (treatment vs. placebo/control), (2) the outcome, (3) the time horizon, and (4) the baseline event rate (control group risk). Example: "NNT = 17 (95% CI: 13–25) to prevent one case of disease in vaccinated vs. unvaccinated adults over 12 months, given a baseline risk of 8%."

14. Troubleshooting

Problem	Likely Cause	Solution
$d$ is extremely large ( $> 3.0$ )	Data entry error; outlier dominating; very small SD	Check raw data for errors; screen for outliers; verify SD calculation
$\omega^2$ or $\varepsilon^2$ is negative	True effect is near zero; small sample; $MS_{error} > MS_{between}/K$	Negative $\omega^2$ should be reported as 0 (convention); increase sample size
$\eta_p^2$ values sum to more than 1.0	This is expected in factorial ANOVA; $\eta_p^2$ is not a proportion of total variance	Switch to $\eta^2$ or $\omega^2$ if total-variance proportions are needed
OR and RR give very different conclusions	Common event (base rate $> 10\%$ ) — OR exaggerates relative to RR	Report RR (or ARD + NNT) for common outcomes; OR is appropriate for case-control
95% CI for NNT includes infinity	ARD CI includes zero (non-significant result)	Report NNT from each bound of the ARD CI separately; report as NNH for lower bound if negative
$d$ from test statistic differs from $d$ from summary statistics	Different formula used; unequal group sizes	For unequal $n$ : use $d = t\sqrt{(n_1+n_2)/(n_1 n_2)}$ ; verify which formula applies
CL (Common Language) effect size close to 0.50 for large $d$	Possible — check calculation	CL = $\Phi(d/\sqrt{2})$ , not $\Phi(d)$ ; CL of 0.50 corresponds to $d=0$
Fisher's $z$ CI for $r$ extends beyond $[-1, 1]$	Very small $n$ or $r$ close to $\pm 1$	Check that $n \geq 4$ ; for $r = \pm 1$ the CI is degenerate; consider Bayesian credible interval
Cramér's $V$ is larger than expected for sparse table	Small sample bias in $V$	Use bias-corrected Cramér's $\tilde{V}$ (Bergsma, 2013)
Paired $d$ is larger than independent samples $d$ for same data	This is expected — paired $d$ removes between-person variance	Both are correct but measure different things; report paired $d$ for paired designs
Power calculation requires larger $n$ than resources allow	Effect size is small or power requirement is high	Accept lower power (state this as limitation); use a one-tailed test if directional; consider sequential design
$r$ and $d$ conversions give inconsistent results	Unequal group sizes affecting the conversion formula	Use the exact formula $r = d/\sqrt{d^2 + (n_1+n_2)^2/(n_1 n_2)}$ not the equal- $n$ approximation

15. Quick Reference Cheat Sheet

Core Equations

Formula	Description
$d = (\bar{x}_1 - \bar{x}_2)/s_{pooled}$	Cohen's $d$ (independent samples)
$s_{pooled} = \sqrt{[(n_1-1)s_1^2+(n_2-1)s_2^2]/(n_1+n_2-2)}$	Pooled standard deviation
$g = d \times (1 - 3/(4\nu-1))$	Hedges' $g$ (bias-corrected $d$ )
$\Delta = (\bar{x}_1 - \bar{x}_2)/s_{control}$	Glass's $\Delta$
$d_{paired} = \bar{d}/s_d$	Cohen's $d$ for paired designs
$SE_d \approx \sqrt{(n_1+n_2)/(n_1 n_2) + d^2/(2(n_1+n_2))}$	SE of Cohen's $d$
$CL = \Phi(d/\sqrt{2})$	Common Language Effect Size
$U_3 = \Phi(\lvert d \rvert)$	Cohen's $U_3$
$\eta^2 = SS_{effect}/SS_{total}$	Eta squared
$\eta_p^2 = SS_{effect}/(SS_{effect}+SS_{error})$	Partial eta squared
$\omega^2 = (SS_B - (K-1)MS_W)/(SS_T + MS_W)$	Omega squared (one-way)
$\omega_p^2 = (SS_{eff} - df_{eff} \cdot MS_{err})/(SS_T + MS_{err})$	Partial omega squared
$f = \sqrt{\eta^2/(1-\eta^2)}$	Cohen's $f$
$f^2 = R^2/(1-R^2)$	Cohen's $f^2$ (global)
$f^2_{local} = \Delta R^2/(1-R^2_{full})$	Cohen's $f^2$ (local/incremental)
$z_r = \text{arctanh}(r)$ , $SE_{z_r} = 1/\sqrt{n-3}$	Fisher's $z$ for $r$ CI
$\text{OR} = (p_1/(1-p_1))/(p_2/(1-p_2)) = ad/bc$	Odds Ratio
$\text{RR} = p_1/p_2$	Risk Ratio
$\text{NNT} = 1/\lvert p_1-p_2 \rvert$	Number Needed to Treat
$V = \sqrt{\chi^2/(n \cdot \min(r-1,c-1))}$	Cramér's $V$
$r_{rb} = 1 - 2U/(n_1 n_2)$	Rank-biserial correlation

Effect Size Family Selection Guide

Test	Effect Size	Notes
One-sample $t$ -test	$d = (\bar{x}-\mu_0)/s$	Compared to known value
Independent $t$ -test	Cohen's $d$ or Hedges' $g$	$g$ for small $n$
Paired $t$ -test	$d_{paired} = \bar{d}/s_d$	Uses difference scores
One-way ANOVA	$\omega^2$ or $\varepsilon^2$	NOT $\eta^2$ (biased)
Factorial ANOVA	$\omega_p^2$ or $\eta_p^2$	Partial versions
ANCOVA	$\omega_p^2$ (adjusted)	After covariate removal
Simple regression	$r$ , $r^2$	Both informative
Multiple regression	$R^2$ , $f^2_{global/local}$	Report adjusted $R^2$
$\chi^2$ (2×2)	$\phi$	Same as $r$ for binary
$\chi^2$ ( $r \times c$ )	Cramér's $V$	Use corrected $V$ if small $n$
Binary, prospective	ARD, RR, NNT	All three recommended
Binary, case-control	OR	RR not estimable
Mann-Whitney $U$	$r_{rb}$	Non-parametric
Wilcoxon signed-rank	$r_W = Z/\sqrt{n}$	Non-parametric
Kruskal-Wallis	$\eta^2_H$	Non-parametric ANOVA

Cohen's Benchmarks (1988) — All Families

Label	$d$	$r$	$\eta^2 / \omega^2$	$f$	$f^2$	$V$	OR
Small	$0.20$	$0.10$	$0.01$	$0.10$	$0.02$	$0.10$	$1.22$
Medium	$0.50$	$0.30$	$0.06$	$0.25$	$0.15$	$0.30$	$1.65$
Large	$0.80$	$0.50$	$0.14$	$0.40$	$0.35$	$0.50$	$2.23$

Conversion Formulas

From	To	Formula
$d$ (equal $n$ )	$r$	$r = d/\sqrt{d^2+4}$
$r$ (equal $n$ )	$d$	$d = 2r/\sqrt{1-r^2}$
$d$ (unequal $n$ )	$r$	$r = d/\sqrt{d^2 + (n_1+n_2)^2/(n_1 n_2)}$
$d$	$\eta^2$ (2 groups)	$\eta^2 = d^2/(d^2+4)$
$\eta^2$	$d$ (2 groups)	$d = 2\sqrt{\eta^2/(1-\eta^2)}$
$\eta^2$	$f$	$f = \sqrt{\eta^2/(1-\eta^2)}$
$f$	$\eta^2$	$\eta^2 = f^2/(1+f^2)$
OR	$d$	$d = \ln(\text{OR}) \times \sqrt{3}/\pi \approx \ln(\text{OR}) \times 0.5513$
$d$	OR	$\text{OR} = e^{d \pi/\sqrt{3}}$
$t$ (independent)	$d$	$d = t\sqrt{(n_1+n_2)/(n_1 n_2)}$
$t$ (paired/one-sample)	$d$	$d_z = t/\sqrt{n}$
$F$ (2 groups, $df_1=1$ )	$d$	$d = \sqrt{F(1/n_1+1/n_2)}$
$F$	$\eta^2$	$\eta^2 = F \cdot df_1/(F \cdot df_1 + df_2)$

NNT Interpretation Guide

NNT	Clinical Impact
$1 - 2$	Extraordinary benefit
$3 - 5$	Excellent
$6 - 10$	Good
$11 - 50$	Moderate
$51 - 100$	Small
$> 100$	Minimal
$\infty$	No benefit

Required Sample Size for 80% Power (Two-Sided $\alpha = .05$ )

$d$	$n$ per group	$r$	$n$ total	$\omega^2$	$n$ total (3 groups)
0.20	394	0.10	783	0.01	969
0.35	130	0.20	193	0.04	279
0.50	64	0.30	84	0.06	159
0.65	38	0.40	46	0.10	90
0.80	26	0.50	29	0.14	66
1.00	17	0.60	19	0.25	36

Effect Size Reporting Checklist

Item	Required
Point estimate of effect size	✅ Always
95% CI for effect size	✅ Always
Which specific effect size (e.g., $\omega^2$ not just "effect size")	✅ Always
Which benchmark system used	✅ Always
Sample sizes for each group/condition	✅ Always
Direction of effect (which group is higher)	✅ Always
Whether bias correction was applied ( $g$ vs. $d$ )	✅ When $n < 30$
ARD + NNT for binary outcomes	✅ For clinical/applied
Power analysis or sensitivity analysis	✅ For null results
Domain-specific context for benchmark	✅ Recommended

This tutorial provides a comprehensive foundation for understanding, computing, and interpreting Effect Sizes using the DataStatPro application. For further reading, consult Cohen's "Statistical Power Analysis for the Behavioral Sciences" (2nd ed., 1988), Ellis's "The Essential Guide to Effect Sizes" (2010), Cumming's "Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis" (2012), and Lakens's "Calculating and Reporting Effect Sizes to Facilitate Cumulative Science: A Practical Primer for t-Tests and ANOVAs" (Frontiers in Psychology, 2013). For feature requests or support, contact the DataStatPro team.

Effect Size Calculator