Reliability Analysis: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of Reliability Analysis all the way through advanced estimation, evaluation, item analysis, and practical usage within the DataStatPro application. Whether you are encountering reliability analysis for the first time or looking to deepen your understanding of measurement quality, this guide builds your knowledge systematically from the ground up.

Prerequisites and Background Concepts
What is Reliability Analysis?
The Mathematics Behind Reliability
Assumptions of Reliability Analysis
Types of Reliability
Using the Reliability Analysis Component
Cronbach's Alpha
McDonald's Omega
Split-Half Reliability
Inter-Rater Reliability
Item Analysis
Model Fit and Evaluation
Advanced Topics
Worked Examples
Common Mistakes and How to Avoid Them
Troubleshooting
Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into reliability analysis, it is helpful to understand several foundational statistical and psychometric concepts. Each is briefly reviewed below.

1.1 Measurement and Scales

Measurement is the process of assigning numbers to objects or events according to rules. In social and behavioural sciences, we frequently measure latent constructs — unobservable psychological or social attributes such as intelligence, anxiety, depression, or customer satisfaction.

These constructs are measured indirectly through observable indicators (items or questions on a questionnaire). The quality of this measurement process is what reliability analysis evaluates.

Scales of measurement:

Scale	Properties	Examples
Nominal	Categories only; no order	Gender, blood type, nationality
Ordinal	Ordered categories; unequal intervals	Likert scale responses (1–5), satisfaction ratings
Interval	Equal intervals; arbitrary zero	Temperature (°C), IQ scores
Ratio	Equal intervals; true zero	Height, weight, reaction time

Most psychological questionnaires use ordinal scales (Likert items), though they are often treated as approximately interval for the purpose of reliability analysis.

1.2 Variance and Covariance

The variance of a variable $X$ quantifies how spread out its values are around the mean:

$\sigma^2_X = \frac{1}{n-1}\sum_{i=1}^{n}(X_i - \bar{X})^2$

The covariance between two variables $X$ and $Y$ measures how they vary together:

$\sigma_{XY} = \frac{1}{n-1}\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})$

The correlation is the standardised covariance:

$r_{XY} = \frac{\sigma_{XY}}{\sigma_X \sigma_Y}$

Reliability analysis is fundamentally about analysing the covariance structure of a set of items — how strongly they co-vary with each other determines how reliably they measure the underlying construct.

1.3 The Inter-Item Correlation Matrix

Given $p$ items, the inter-item correlation matrix $\mathbf{R}$ is a $p \times p$ symmetric matrix where:

Diagonal elements = 1.0 (each item correlates perfectly with itself).
Off-diagonal element $r_{jk}$ = Pearson correlation between items $j$ and $k$ .

High inter-item correlations (typically $r > 0.30$ ) indicate that the items are measuring a common construct. Items with very low correlations ( $r < 0.10$ ) with all others may be measuring something different and should be reviewed.

1.4 Composite Scores

A composite score (also called a total score or scale score) is the sum or average of responses across multiple items:

$X_{\text{total}} = \sum_{j=1}^{p} X_j \quad \text{or} \quad \bar{X} = \frac{1}{p}\sum_{j=1}^{p} X_j$

The reliability of a composite score depends on:

The reliability of each individual item.
The number of items ( $p$ ) — more items generally means higher reliability.
The average inter-item correlation — higher correlations mean higher reliability.

1.5 The Signal-to-Noise Analogy

Reliability can be thought of in terms of signal and noise:

Signal = true score variance (the construct you actually want to measure).
Noise = error variance (random fluctuations due to item wording, context, mood, etc.).

$\text{Reliability} = \frac{\text{Signal}}{\text{Signal} + \text{Noise}} = \frac{\sigma^2_T}{\sigma^2_X}$

A perfectly reliable measure would have all signal and no noise ( $R = 1.0$ ). In practice, all measurement has some noise, and reliability values above 0.70 are generally considered acceptable for research purposes.

1.6 True Score Theory (Brief Preview)

The concept of true scores comes from Classical Test Theory (CTT), which is the mathematical framework underlying most reliability analyses. In CTT, every observed score $X_i$ is composed of a true score $T_i$ and a random error $E_i$ :

$X_i = T_i + E_i$

This deceptively simple equation is the foundation for all reliability estimation methods covered in this tutorial.

2. What is Reliability Analysis?

2.1 The Core Question

Reliability analysis evaluates whether a measurement instrument (a questionnaire, scale, test, or rating system) produces consistent, stable, and reproducible results. It answers the fundamental question:

"If we measure the same thing again under the same conditions, will we get the same result?"

A reliable instrument gives similar scores across:

Items (internal consistency): Do all items in the scale measure the same thing?
Time (test-retest reliability): Do scores remain stable when measured again?
Raters (inter-rater reliability): Do different raters give the same scores?
Parallel forms (alternate-form reliability): Do different versions of the test agree?

2.2 Reliability vs. Validity

Reliability and validity are the two cornerstones of measurement quality, but they are distinct:

Property	Definition	Question Asked
Reliability	Consistency of measurement	"Does it measure consistently?"
Validity	Accuracy of measurement	"Does it measure what it claims to?"

The critical relationship is:

Reliability is a necessary but not sufficient condition for validity.

A measure can be highly reliable but invalid (consistently measuring the wrong thing — like a miscalibrated scale that consistently reads 5 kg too heavy). A valid measure, by definition, must also be reliable (you cannot accurately measure something if the results are random). Perfect Reliability + High Validity = Excellent measurement (target) Perfect Reliability + Low Validity = Consistent but wrong (systematic bias) Low Reliability + Any Validity = Impossible (random measurement cannot be valid)

2.3 The Role of Reliability in Research

Reliability affects research in critical ways:

Statistical power: Unreliable measures attenuate (reduce) observed correlations and effect sizes. The true correlation $r_{XY}$ between constructs $X$ and $Y$ is related to the observed correlation $r_{obs}$ by:

$r_{obs} = r_{XY} \cdot \sqrt{R_{XX} \cdot R_{YY}}$

Where $R_{XX}$ and $R_{YY}$ are the reliabilities of the two measures. Low reliability reduces the observed correlation below the true value - called attenuation.

Precision of measurement: The Standard Error of Measurement (SEM) quantifies the uncertainty in an individual's score:

$\text{SEM} = \sigma_X \sqrt{1 - R_{XX}}$

Lower reliability → larger SEM → less precise measurement of individual scores.

Sample size requirements: Studies using unreliable measures require larger samples to achieve the same statistical power, because unreliability adds noise to all effect size estimates.

2.4 Real-World Applications

Psychology: Evaluating whether a depression scale (e.g., PHQ-9), anxiety inventory, or personality questionnaire measures the intended construct consistently.
Education: Assessing whether exam items reliably distinguish between high- and low-knowledge students.
Medicine: Verifying that clinicians agree on disease severity ratings or diagnostic classifications (inter-rater reliability).
Marketing Research: Confirming that customer satisfaction items consistently reflect the same underlying attitude.
Human Resources: Evaluating the reliability of performance appraisal systems or structured interview ratings.
Engineering / Quality Control: Assessing the consistency of measurement instruments, inspectors, or testing procedures (gauge R&R studies).

3. The Mathematics Behind Reliability

3.1 Classical Test Theory (CTT)

Classical Test Theory (CTT) is the dominant framework for reliability analysis in the social sciences. Its foundation is the true score model:

$X_i = T_i + E_i$

Where:

$X_i$ is the observed score of person $i$ (what we actually measure).
$T_i$ is the true score of person $i$ (the score they would obtain with perfect, error-free measurement).
$E_i$ is the measurement error for person $i$ (random, unpredictable fluctuation).

3.2 Assumptions of the True Score Model

The CTT model rests on four key mathematical assumptions:

Linearity: $X_i = T_i + E_i$ (observed = true + error, additively).
Zero mean error: $E(E_i) = 0$ — measurement errors average to zero across many measurements.
Uncorrelated error and true score: $\text{Cov}(T_i, E_i) = 0$ — error is unrelated to the person's true score.
Uncorrelated errors across items: $\text{Cov}(E_j, E_k) = 0$ for $j \neq k$ — errors on different items are independent.

3.3 Variance Decomposition

Under the CTT assumptions, the variance of the observed score decomposes into two parts:

$\sigma^2_X = \sigma^2_T + \sigma^2_E$

Where:

$\sigma^2_X$ = total observed score variance.
$\sigma^2_T$ = true score variance (signal).
$\sigma^2_E$ = error variance (noise).

3.4 The Reliability Coefficient

The reliability coefficient $\rho_{XX}$ is defined as the ratio of true score variance to total observed score variance:

$\rho_{XX} = \frac{\sigma^2_T}{\sigma^2_X} = \frac{\sigma^2_T}{\sigma^2_T + \sigma^2_E}$

This is the population reliability — it ranges from 0 (completely unreliable) to 1 (perfectly reliable). In practice, reliability is estimated from sample data using methods such as Cronbach's alpha, McDonald's omega, or split-half reliability.

Key properties of $\rho_{XX}$ :

$\rho_{XX} = 0$ : All variance is error; the measure is completely random.
$\rho_{XX} = 1$ : All variance is true score variance; perfect measurement.
$\rho_{XX} = r_{X X'}$ : Reliability equals the correlation between two parallel forms of the measure.

3.5 Standard Error of Measurement

The Standard Error of Measurement (SEM) quantifies the average error in an individual's observed score:

$\text{SEM} = \sigma_X \sqrt{1 - \rho_{XX}}$

The SEM defines a confidence interval around a person's observed score:

$\text{True Score CI}_{95\%}: X_i \pm 1.96 \cdot \text{SEM}$

For example, if $\sigma_X = 10$ and $\rho_{XX} = 0.84$ :

$\text{SEM} = 10\sqrt{1 - 0.84} = 10\sqrt{0.16} = 10 \times 0.40 = 4.0$

A person scoring 65 has a 95% CI for their true score of $65 \pm 1.96(4) = [57.2, 72.8]$ .

3.6 The Spearman-Brown Prophecy Formula

The Spearman-Brown Prophecy Formula predicts the reliability of a lengthened (or shortened) test. If the current test has reliability $\rho_{XX}$ and is changed to $n$ times its current length:

$\rho_{XX}^{(n)} = \frac{n \cdot \rho_{XX}}{1 + (n-1) \cdot \rho_{XX}}$

Where $n$ is the multiplication factor (e.g., $n = 2$ means doubling the number of items).

Inverse formula — how many times longer must the test be to reach a target reliability $\rho^*$ ?

$n = \frac{\rho^*(1 - \rho_{XX})}{\rho_{XX}(1 - \rho^*)}$

Example: Current reliability = 0.60. How many times longer must the test be to reach 0.80?

$n = \frac{0.80(1 - 0.60)}{0.60(1 - 0.80)} = \frac{0.80 \times 0.40}{0.60 \times 0.20} = \frac{0.32}{0.12} = 2.67$

The test must be approximately 2.67 times longer (about 167% more items).

3.7 Attenuation and Correction for Attenuation

The correction for attenuation formula estimates the true (disattenuated) correlation between two constructs from their observed correlation, correcting for the attenuation caused by measurement unreliability:

$r^*_{XY} = \frac{r_{XY}}{\sqrt{\rho_{XX} \cdot \rho_{YY}}}$

Where:

$r_{XY}$ = observed correlation between measures $X$ and $Y$ .
$\rho_{XX}$ , $\rho_{YY}$ = reliabilities of measures $X$ and $Y$ .
$r^*_{XY}$ = estimated true correlation (disattenuated).

⚠️ The disattenuated correlation can exceed 1.0 if both reliabilities are low and the observed correlation is moderate or high. Values $> 1.0$ are inadmissible and indicate that the reliabilities or correlation are inaccurate. Treat disattenuated correlations with caution.

3.8 The Covariance Matrix of a Scale

For a $p$ -item scale, the covariance matrix $\boldsymbol{\Sigma}$ is $p \times p$ with:

Diagonal elements: item variances $\sigma^2_j$ .
Off-diagonal elements: inter-item covariances $\sigma_{jk}$ .

The total scale variance is the sum of all elements:

$\sigma^2_X = \sum_{j=1}^{p}\sigma^2_j + 2\sum_{j < k}\sigma_{jk}$

Or in matrix notation:

$\sigma^2_X = \mathbf{1}^T \boldsymbol{\Sigma} \mathbf{1}$

Where $\mathbf{1}$ is a $p \times 1$ vector of ones.

This covariance matrix is the fundamental input to Cronbach's alpha and McDonald's omega.

4. Assumptions of Reliability Analysis

4.1 Unidimensionality

The most critical assumption for most reliability coefficients (especially Cronbach's alpha) is that all items in the scale measure a single underlying construct (unidimensionality).

Why it matters: Cronbach's alpha is a measure of internal consistency, not unidimensionality. A scale with multiple distinct dimensions (e.g., a scale with "anxiety" and "depression" items combined) can still produce a high alpha, but the alpha in that case is misleading — it does not reflect a single coherent construct.

How to check:

Run an Exploratory Factor Analysis (EFA) before reliability analysis.
Inspect the scree plot and parallel analysis.
If more than one factor is extracted, consider running reliability analysis separately for each subscale.
Check the ratio of the first to second eigenvalue: a ratio $> 4:1$ suggests approximate unidimensionality.

4.2 Tau-Equivalence (for Cronbach's Alpha)

Cronbach's alpha is theoretically justified only when items are tau-equivalent — meaning all items have:

Equal true score variances (all items are equally good at measuring the construct), AND
Different error variances (items can differ in their reliability).

In practice, most scales are congeneric — items have different factor loadings (not tau- equivalent) — which means Cronbach's alpha underestimates the true reliability. McDonald's omega (Section 8) does not require tau-equivalence and is therefore more appropriate for congeneric scales.

4.3 Uncorrelated Errors

CTT assumes that measurement errors on different items are uncorrelated — the error on item $j$ is independent of the error on item $k$ . Correlated errors arise when:

Two items use very similar wording (e.g., "I feel anxious" and "I feel worried").
Two items share a reverse-score format (systematic method effect).
Two items are administered sequentially and participants remember their previous response (carry-over effects).

Correlated errors inflate Cronbach's alpha and can produce reliability estimates that exceed the true reliability. When correlated errors are suspected, use a model-based approach (CFA) to explicitly model and account for them.

4.4 Continuous or Approximately Continuous Items

Standard reliability formulas assume that item scores are continuous (or at minimum, ordinal with many ordered categories treated as approximately continuous). For binary items (0/1 responses), use the Kuder-Richardson Formula 20 (KR-20), which is a special case of Cronbach's alpha. For ordinal items with 3–4 categories, use polychoric correlations as input (ordinal alpha or ordinal omega).

4.5 Adequate Sample Size

Reliability estimation requires a sufficient sample size for stable results:

Reliability Statistic	Minimum $n$	Recommended $n$
Cronbach's alpha	50	$\geq 200$
McDonald's omega (CFA-based)	100	$\geq 200$
ICC (inter-rater)	30 subjects	$\geq 50$ subjects
Test-retest (Pearson/ICC)	30	$\geq 50$
Split-half	50	$\geq 100$

⚠️ With small samples, reliability estimates are highly unstable and have wide confidence intervals. Always report a confidence interval alongside the point estimate.

4.6 No Extreme Outliers

Outliers in item responses can substantially distort covariances and variances, leading to inaccurate reliability estimates. Screen data for:

Straight-line responding (participant gives the same response to all items).
Acquiescence bias (tendency to agree with all items regardless of content).
Extreme response styles (always using only the highest or lowest scale point).

5. Types of Reliability

5.1 Internal Consistency Reliability

Internal consistency measures whether the items in a scale consistently measure the same construct. It is assessed using a single administration of the scale.

Method	Basis	When to Use
Cronbach's Alpha ( $\alpha$ )	Average inter-item covariance	Default for multi-item scales; assumes tau-equivalence
McDonald's Omega ( $\omega$ )	Factor model (CFA-based)	Better for congeneric items; most recommended
Ordinal Alpha / Omega	Polychoric correlations	Ordinal items with $\leq 5$ categories
KR-20 / KR-21	Binary items	Dichotomous response scales (correct/incorrect)
Split-Half (Spearman-Brown)	Two halves of the scale	Quick estimate; now largely superseded
Greatest Lower Bound (GLB)	Maximisation over all splittings	Upper bound for internal consistency

5.2 Test-Retest Reliability (Stability)

Test-retest reliability assesses the temporal stability of a measure — whether the same people get the same scores when measured at two different time points.

$\rho_{\text{test-retest}} = r(X_1, X_2)$

Where $X_1$ and $X_2$ are scores from the same measure at Time 1 and Time 2.

Key considerations:

The time interval between assessments must be chosen carefully:
- Too short (days): Carryover/memory effects inflate reliability.
- Too long (months/years): True change in the construct deflates reliability.
- Optimal: 2–4 weeks for stable psychological traits.
Use Intraclass Correlation Coefficient (ICC) rather than Pearson $r$ for test-retest reliability, as ICC also accounts for systematic mean shifts between time points.

5.3 Inter-Rater Reliability (Agreement)

Inter-rater reliability assesses the degree to which different raters (judges, coders, or observers) agree in their assessments of the same subjects.

Method	Data Type	When to Use
Percent Agreement	Nominal / Ordinal	Simple; ignores chance agreement
Cohen's Kappa ( $\kappa$ )	Nominal	Two raters; categorical ratings
Weighted Kappa ( $\kappa_w$ )	Ordinal	Two raters; ordered categories
Fleiss' Kappa	Nominal	Three or more raters
Intraclass Correlation (ICC)	Continuous / Interval	Two or more raters; continuous ratings
Krippendorff's Alpha	Any scale	Multiple raters; any data type

5.4 Parallel-Forms Reliability (Alternate Forms)

Parallel-forms reliability (also called alternate-form or equivalent-form reliability) assesses agreement between two different versions of the same test administered at the same time:

$\rho_{\text{parallel}} = r(X_{\text{Form A}}, X_{\text{Form B}})$

When used:

High-stakes testing where test security is a concern (using multiple test forms).
Longitudinal studies where learning effects would contaminate test-retest assessment.

This type is less common in everyday practice because developing two truly parallel forms is resource-intensive.

5.5 Summary: Choosing the Right Reliability Type

Research Scenario	Recommended Type
Single-administration questionnaire (Likert items)	Internal consistency (alpha/omega)
Binary scored test (correct/incorrect)	KR-20
Longitudinal measurement (same scale, two time points)	Test-retest ICC
Observational coding scheme (two coders)	Cohen's Kappa or ICC
Observational coding scheme (three+ coders)	Fleiss' Kappa or ICC
Continuous ratings by multiple raters	ICC
Two test forms used interchangeably	Parallel-forms $r$
Ordinal items ( $\leq 5$ categories)	Ordinal alpha or omega

6. Using the Reliability Analysis Component

The Reliability Analysis component in DataStatPro provides a complete workflow for evaluating the reliability of multi-item scales and rating systems.

Step-by-Step Guide

Step 1 — Select Dataset

Choose the dataset from the "Dataset" dropdown. Ensure:

Item responses are in separate columns (one column per item).
The dataset contains only the items you wish to include in the scale.
All items are coded in the same direction (or reverse-code negatively-worded items first).

💡 Tip: Use the DataStatPro data editor to reverse-code negatively worded items before running reliability analysis. For a 5-point scale, reverse coding transforms responses as: $X_{\text{reversed}} = (\text{max} + \text{min}) - X_{\text{original}} = 6 - X$ .

Step 2 — Select Scale Items

Select all variables (columns) that constitute the scale from the "Scale Items" dropdown. You can select multiple items. All selected items should:

Be numeric.
Use the same response scale (e.g., all 1–5 Likert).
Theoretically measure the same underlying construct.

⚠️ Important: Do not mix items from different subscales in a single reliability analysis. Run separate analyses for each subscale.

Step 3 — Select Reliability Method

Choose from the "Method" dropdown:

Cronbach's Alpha: Default, most widely used. Suitable for scales with many items on the same response scale. Assumes tau-equivalence.
McDonald's Omega: Preferred method. Does not assume tau-equivalence. Uses a factor model internally. Recommended for congeneric scales.
Split-Half (Spearman-Brown): Splits the scale into two halves and correlates them. Quick estimate, now largely superseded by alpha and omega.
Inter-Rater Reliability (ICC / Kappa): For assessing agreement between raters.

Step 4 — Select Model (for ICC)

If using ICC for inter-rater reliability, select the appropriate ICC model:

One-Way Random: Single raters randomly selected; raters differ across subjects.
Two-Way Random: Raters are a random sample; interested in generalisability to all raters.
Two-Way Mixed: Raters are fixed (same raters rate all subjects); generalise to these specific raters only.

Step 5 — Select ICC Type (for ICC)

Consistency: Ignores systematic differences in rater means (appropriate when raters use different scale ranges habitually).
Absolute Agreement: Accounts for systematic mean differences between raters (required when interchangeability of raters matters).

Step 6 — Select Confidence Level

Choose the confidence level for confidence intervals (default: 95%).

Step 7 — Display Options

Select which outputs to display:

✅ Overall Reliability Coefficient (with 95% CI)
✅ Item Statistics Table (mean, SD, skewness, kurtosis per item)
✅ Inter-Item Correlation Matrix
✅ Item-Total Statistics Table (corrected item-total correlation, alpha-if-deleted)
✅ Scale Statistics Summary
✅ Omega Hierarchical and Omega Total (if McDonald's Omega selected)
✅ Scree Plot / Factor Loadings (for omega)
✅ ICC Confidence Intervals and F-test (for inter-rater)
✅ Kappa Table with per-category agreement (for Kappa)

Step 8 — Run the Analysis

Click "Run Reliability Analysis". The application will:

Compute item descriptive statistics.
Calculate the inter-item correlation matrix.
Estimate the chosen reliability coefficient(s) and confidence intervals.
Produce item-total statistics (corrected correlations, alpha/omega-if-item-deleted).
Display all selected outputs and visualisations.

7. Cronbach's Alpha

7.1 Definition and Formula

Cronbach's alpha ( $\alpha$ ) is the most widely used measure of internal consistency reliability. It estimates the proportion of total scale variance attributable to the common factor shared by all items.

For a scale with $p$ items and covariance matrix $\boldsymbol{\Sigma}$ :

$\alpha = \frac{p}{p-1}\left(1 - \frac{\sum_{j=1}^{p}\sigma^2_j}{\sigma^2_X}\right)$

Where:

$p$ = number of items.
$\sigma^2_j$ = variance of item $j$ .
$\sigma^2_X = \sum_{j=1}^p \sigma^2_j + 2\sum_{j<k}\sigma_{jk}$ = total scale score variance.

Equivalent form using the average inter-item covariance $\bar{c}$ and average item variance $\bar{v}$ :

$\alpha = \frac{p \cdot \bar{c}}{\bar{v} + (p-1)\bar{c}}$

This form clearly shows that alpha depends on two quantities:

The average inter-item covariance $\bar{c}$ : how strongly items co-vary.
The number of items $p$ : more items → higher alpha (Spearman-Brown effect).

7.2 Standardised Alpha

When items are measured on different response scales (e.g., some items 1–5, others 1–7), it is appropriate to use the standardised alpha, which is based on the correlation matrix $\mathbf{R}$ rather than the covariance matrix:

$\alpha_{\text{std}} = \frac{p \cdot \bar{r}}{1 + (p-1)\bar{r}}$

Where $\bar{r}$ is the average inter-item correlation.

This formula, known as the Spearman-Brown formula applied to the average inter-item correlation, shows that standardised alpha is solely determined by the number of items and the average correlation between them.

7.3 Cronbach's Alpha as a Lower Bound

Cronbach's alpha is a lower bound on the true reliability — it underestimates the true reliability under most conditions:

$\alpha \leq \rho_{XX}$

with equality holding only when items are tau-equivalent (all true score variances are equal, i.e., all factor loadings are equal). Because psychological items are almost never tau-equivalent in practice, alpha typically underestimates the true reliability.

This is why McDonald's omega (Section 8) is theoretically preferred — it is an exact estimate of reliability under the congeneric model.

7.4 The Confidence Interval for Alpha

The sampling distribution of Cronbach's alpha is complex. An approximate 95% confidence interval based on the Fisher transformation is:

$\text{CI}_{95\%}(\alpha) = \left[\alpha_L, \alpha_U\right]$

The Feldt (1965) exact CI method transforms alpha through:

$F = \frac{1 - \alpha}{1 - \alpha^*}$

Where $\alpha^*$ is the population value, which follows an F-distribution with degrees of freedom $df_1 = n - 1$ and $df_2 = (n-1)(p-1)$ . The CI bounds are:

$\alpha_L = 1 - (1 - \alpha) \cdot F_{1-\alpha/2, df_1, df_2}$

$\alpha_U = 1 - (1 - \alpha) \cdot F_{\alpha/2, df_1, df_2}$

Where $n$ is the sample size and $p$ is the number of items.

💡 Always report the confidence interval alongside the point estimate of alpha. With small samples ( $n < 100$ ), the confidence interval can be very wide (e.g., $\alpha = 0.80$ , 95% CI: [0.68, 0.89]), signalling considerable uncertainty in the estimate.

7.5 Interpreting Cronbach's Alpha

$\alpha$ Value	Interpretation	Typical Use
$\geq 0.95$	Excellent — but may indicate redundancy	High-stakes clinical decisions
$0.90 - 0.94$	Excellent	High-stakes clinical decisions
$0.80 - 0.89$	Good	Most research applications
$0.70 - 0.79$	Acceptable	Exploratory research
$0.60 - 0.69$	Questionable — scale needs revision	Pilot studies only
$0.50 - 0.59$	Poor — major revision needed	Unacceptable for most purposes
$< 0.50$	Unacceptable	Should not be used as a scale

⚠️ Very high alpha ( $> 0.95$ ) is not always desirable. It can indicate item redundancy — that items are so similar in wording that they add no unique information. Aim for $0.80 - 0.90$ for most psychological scales, with not too many highly redundant items.

7.6 Alpha for Binary Items: KR-20 and KR-21

When all items are dichotomous (scored 0 or 1, as in knowledge tests), Cronbach's alpha reduces to the Kuder-Richardson Formula 20 (KR-20):

$\text{KR-20} = \frac{p}{p-1}\left(1 - \frac{\sum_{j=1}^{p} p_j q_j}{\sigma^2_X}\right)$

Where $p_j$ is the proportion of respondents answering item $j$ correctly, $q_j = 1 - p_j$ , and $\sigma^2_X$ is the total test score variance.

When item difficulties are assumed equal ( $p_j = \bar{p}$ for all items), a simpler formula is the Kuder-Richardson Formula 21 (KR-21):

$\text{KR-21} = \frac{p}{p-1}\left(1 - \frac{\bar{p}(1-\bar{p}) \cdot p}{\sigma^2_X}\right)$

KR-21 is always $\leq$ KR-20. Both formulas produce the same estimate as Cronbach's alpha applied to binary data.

7.7 Item-Total Statistics and Alpha-if-Item-Deleted

The item-total statistics table provides four critical pieces of information for each item:

Statistic	Description	Use
Scale Mean if Item Deleted	Mean of total score if this item is removed	Identifies items that skew the scale
Scale Variance if Item Deleted	Variance of total score if this item is removed	Identifies items that affect scale spread
Corrected Item-Total Correlation	Correlation between item score and total minus that item	Primary item quality indicator
Alpha if Item Deleted	Alpha of the remaining items if this item is removed	Shows whether removing the item improves alpha

Corrected Item-Total Correlation (CITC):

$r_{j(X-j)} = \frac{\text{Cov}(X_j, X_{\text{total}} - X_j)}{\sigma_j \cdot \sigma_{X-j}}$

where the subscript $(X-j)$ means the total score with item $j$ removed. This correction prevents the artificial inflation that would result from correlating an item with a total that includes the item itself.

Interpretation of CITC:

CITC	Interpretation
$\geq 0.50$	Excellent discriminator — strongly related to the construct
$0.30 - 0.49$	Good discriminator — acceptable item
$0.20 - 0.29$	Marginal — consider revision or removal
$< 0.20$	Poor — item should be revised or removed
Negative	Item is negatively related to the scale — likely needs reverse coding or removal

7.8 Worked Manual Calculation of Cronbach's Alpha

Suppose we have a 3-item scale with the following covariance matrix from $n = 100$ respondents:

$\boldsymbol{\Sigma} = \begin{pmatrix} 1.44 & 0.72 & 0.60 \\ 0.72 & 1.21 & 0.55 \\ 0.60 & 0.55 & 1.00 \end{pmatrix}$

Step 1 — Sum of item variances:

$\sum_{j=1}^{3}\sigma^2_j = 1.44 + 1.21 + 1.00 = 3.65$

Step 2 — Total scale variance:

$\sigma^2_X = 1.44 + 1.21 + 1.00 + 2(0.72 + 0.60 + 0.55) = 3.65 + 2(1.87) = 3.65 + 3.74 = 7.39$

Step 3 — Apply Cronbach's formula:

$\alpha = \frac{3}{3-1}\left(1 - \frac{3.65}{7.39}\right) = \frac{3}{2}\left(1 - 0.494\right) = 1.5 \times 0.506 = 0.759$

Interpretation: $\alpha = 0.759$ — acceptable internal consistency for a 3-item scale.

8. McDonald's Omega

8.1 Limitations of Cronbach's Alpha and the Case for Omega

While Cronbach's alpha is the most widely reported reliability index, it has several well-documented limitations:

Assumes tau-equivalence — all items must have equal factor loadings. Violated in virtually all real scales.
Underestimates true reliability when items are congeneric (unequal loadings).
Can be artificially inflated by correlated errors or multidimensionality.
Does not distinguish between general factor variance and group factor variance in multidimensional scales.

McDonald's omega ( $\omega$ ) overcomes these limitations by explicitly modelling the factor structure of the scale. It is now recommended by major methodologists as the preferred reliability index over Cronbach's alpha.

8.2 The Congeneric Model

McDonald's omega is based on the congeneric measurement model — a single-factor CFA where items can have different factor loadings (unlike the tau-equivalent model assumed by alpha):

$X_j = \lambda_j F + \epsilon_j, \quad j = 1, 2, \dots, p$

Where:

$\lambda_j$ = factor loading of item $j$ (not constrained to be equal).
$F$ = the common latent factor (standardised: $\text{Var}(F) = 1$ ).
$\epsilon_j$ = unique factor for item $j$ , with variance $\theta_j$ .

The model-implied covariance matrix is:

$\boldsymbol{\Sigma} = \boldsymbol{\lambda}\boldsymbol{\lambda}^T + \boldsymbol{\Theta}$

Where $\boldsymbol{\lambda} = (\lambda_1, \lambda_2, \dots, \lambda_p)^T$ and $\boldsymbol{\Theta} = \text{diag}(\theta_1, \theta_2, \dots, \theta_p)$ .

8.3 Omega Total ( $\omega_t$ )

Omega total ( $\omega_t$ ) is the reliability of the total composite score from a congeneric single-factor model. It equals the squared correlation between the true score and the observed total score:

$\omega_t = \frac{(\sum_{j=1}^{p}\lambda_j)^2}{(\sum_{j=1}^{p}\lambda_j)^2 + \sum_{j=1}^{p}\theta_j}$

Where $\lambda_j$ are the standardised factor loadings and $\theta_j = 1 - \lambda_j^2$ are the unique variances (uniquenesses).

This formula has a clear interpretation:

Numerator: The variance attributable to the common factor (the signal).
Denominator: The total variance (signal + noise).

💡 Omega total is equivalent to Cronbach's alpha when items are tau-equivalent, and exceeds alpha when items are congeneric (unequal loadings). In practice, omega is usually somewhat higher than alpha.

8.4 Omega Hierarchical ( $\omega_h$ )

For multidimensional scales with both a general factor and group-specific factors (bifactor structure), omega hierarchical ( $\omega_h$ ) quantifies the proportion of total score variance attributable to the general factor alone:

$\omega_h = \frac{(\sum_{j=1}^{p}\lambda_{jg})^2}{\sigma^2_X}$

Where $\lambda_{jg}$ is the loading of item $j$ on the general factor $g$ (from a bifactor model), and $\sigma^2_X = \mathbf{1}^T\boldsymbol{\Sigma}\mathbf{1}$ is the total scale variance.

Omega hierarchical subscale ( $\omega_{hs}$ ) is the proportion of variance in a subscale's total score attributable to the group-specific factor:

$\omega_{hs} = \frac{(\sum_{j \in S}\lambda_{js})^2}{\sigma^2_{X_S}}$

Where the sum is over items $j$ in subscale $S$ and $\lambda_{js}$ is the loading on the group-specific factor $s$ .

8.5 Comparison of Alpha and Omega

Property	Cronbach's Alpha	McDonald's Omega (Total)
Assumes tau-equivalence	Yes	No
Appropriate for congeneric items	No (underestimates)	Yes (correct estimate)
Requires factor analysis	No	Yes (single-factor CFA)
Sensitive to multidimensionality	Yes (can inflate)	Partially
Can separate general/group factors	No	Yes ( $\omega_h$ vs. $\omega_t$ )
Currently recommended by APA	Increasingly	Increasingly
Sensitivity to correlated errors	Inflated	Can model explicitly

General rule: When items are tau-equivalent → alpha ≈ omega. When items are congeneric (different loadings) → omega > alpha. The difference between omega and alpha is larger when loadings vary more across items.

8.6 Interpreting Omega Values

The same benchmarks as Cronbach's alpha apply to omega:

$\omega_t$	Interpretation
$\geq 0.90$	Excellent reliability
$0.80 - 0.89$	Good reliability
$0.70 - 0.79$	Acceptable reliability
$0.60 - 0.69$	Questionable — revision needed
$< 0.60$	Poor — major revision required

For omega hierarchical ( $\omega_h$ ), which represents the reliability attributable only to the general factor:

$\omega_h$	Interpretation
$\geq 0.80$	Strong general factor; composite score is justified
$0.65 - 0.79$	Moderate general factor; composite score is defensible
$0.50 - 0.64$	Weak general factor; subscale scores may be preferable
$< 0.50$	Very weak general factor; total score is not recommended

8.7 The Ratio $\omega_h / \omega_t$ (Explained Common Variance)

The ratio of omega hierarchical to omega total is sometimes called the ECV (Explained Common Variance) and quantifies how much of the reliable variance is attributable to the general factor vs. group factors:

$\text{ECV} = \frac{\omega_h}{\omega_t}$

$\text{ECV} \approx 1.0$ : Almost all reliable variance is due to the general factor — the scale is essentially unidimensional and a total score is well-justified.
$\text{ECV} < 0.70$ : Substantial group factor variance — subscale scores carry unique meaning beyond the total score.

9. Split-Half Reliability

9.1 The Split-Half Method

Split-half reliability estimates reliability by dividing the scale into two halves, computing the total score for each half, and correlating the two half-scores. This provides an estimate based on a single test administration (unlike test-retest reliability).

The Pearson correlation between the two half-scores ( $X_A$ and $X_B$ ) is:

$r_{AB} = r(X_A, X_B)$

However, this correlation estimates the reliability of a half-length test, not the full test. The Spearman-Brown correction is applied to estimate the reliability of the full test:

$\rho_{XX}^{SB} = \frac{2r_{AB}}{1 + r_{AB}}$

9.2 Methods for Splitting the Scale

Method	How	Issue
Odd-Even Split	Odd-numbered items → Half A; even-numbered → Half B	Assumes order of items does not matter
First-Last Split	First $p/2$ items → Half A; Last $p/2$ → Half B	Favours scales where item order is random
Random Split	Items randomly assigned to halves	More reproducible with many iterations
Matched-Random Split	Items matched on difficulty/content then split	Best for heterogeneous item sets

⚠️ The split-half method gives different results depending on how the scale is split. This is a major weakness. Cronbach's alpha can be interpreted as the average of all possible split-half reliabilities — making it a more stable and preferred estimate. Split-half is primarily of historical interest today.

9.3 The Guttman Lambda Coefficients

The Guttman (1945) lambda coefficients are a family of reliability lower bounds. The most useful are:

Lambda 2 ( $\lambda_2$ ): The tightest lower bound computable without factor analysis:

$\lambda_2 = \lambda_1 + \sqrt{\frac{p}{p-1}\left(\sum_{j \neq k}\sigma^2_{jk}\right)}$

Lambda 4 ( $\lambda_4$ ): The maximum split-half reliability over all possible splits (the greatest split-half). It equals Cronbach's alpha when items are tau-equivalent, and typically exceeds alpha for congeneric items.

Lambda 6 ( $\lambda_6$ ): Based on the squared multiple correlations of each item with all others:

$\lambda_6 = 1 - \frac{\sum_{j=1}^{p}(1 - R^2_j)}{\sigma^2_X}$

Where $R^2_j$ is the $R^2$ from regressing item $j$ on all other items.

9.4 The Greatest Lower Bound (GLB)

The Greatest Lower Bound (GLB) is the maximum possible reliability lower bound, computed by maximising over all possible splits and decompositions of the covariance matrix:

$\text{GLB} = 1 - \frac{\min\left(\sum_{j}\psi_j\right)}{\sigma^2_X}$

Subject to the constraint that $\boldsymbol{\Sigma} - \boldsymbol{\Psi}$ is positive semidefinite, where $\boldsymbol{\Psi} = \text{diag}(\psi_1, \dots, \psi_p)$ .

GLB $\geq \lambda_4 \geq \lambda_2 \geq \alpha$ — the GLB is never less than alpha or any other lower bound. However, the GLB can be severely positively biased in small samples and may overestimate reliability more than omega. Use with caution when $n < 500$ .

10. Inter-Rater Reliability

10.1 Why Inter-Rater Reliability Matters

When data collection relies on human judgment — observations, coding of qualitative data, clinical assessments, interview ratings — different raters may disagree. Inter-rater reliability (IRR) quantifies the degree of agreement between raters and determines whether ratings can be trusted as objective.

Low IRR suggests:

Ambiguous category definitions or coding rules.
Insufficient rater training.
The phenomenon being rated is inherently subjective.
The rating scale is poorly designed.

10.2 Percent Agreement

The simplest IRR measure is percent agreement — the proportion of ratings on which all raters agree:

$PA = \frac{\text{Number of agreements}}{N} \times 100\%$

Critical limitation: Percent agreement does not correct for the level of agreement expected purely by chance. Two raters randomly assigning ratings to binary categories (50/50 split) would agree about 50% of the time by chance alone.

10.3 Cohen's Kappa ( $\kappa$ )

Cohen's Kappa corrects for chance agreement. For two raters assigning subjects to $k$ categories:

$\kappa = \frac{P_o - P_e}{1 - P_e}$

Where:

$P_o$ = observed proportion of agreement.
$P_e$ = expected proportion of agreement by chance.

Computing $P_e$ : For a $k \times k$ contingency table with row proportions $p_{i+}$ and column proportions $p_{+j}$ :

$P_e = \sum_{i=1}^{k} p_{i+} \cdot p_{+i}$

Example for a 2-category rating (agreement / disagreement):

Suppose two raters classify 100 subjects as "Case" or "Non-Case":

	Rater B: Case	Rater B: Non-Case	Row Total
Rater A: Case	45	10	55
Rater A: Non-Case	5	40	45
Column Total	50	50	100

$P_o = \frac{45 + 40}{100} = 0.85$

$P_e = \frac{55}{100} \times \frac{50}{100} + \frac{45}{100} \times \frac{50}{100} = 0.275 + 0.225 = 0.50$

$\kappa = \frac{0.85 - 0.50}{1 - 0.50} = \frac{0.35}{0.50} = 0.70$

Standard Error of Kappa:

$SE(\kappa) \approx \sqrt{\frac{P_o(1 - P_o)}{N(1 - P_e)^2}}$

95% Confidence Interval:

$\kappa \pm 1.96 \cdot SE(\kappa)$

10.4 Interpreting Cohen's Kappa

$\kappa$	Strength of Agreement
$< 0$	Less than chance agreement (worse than random)
$0.00 - 0.20$	Slight
$0.21 - 0.40$	Fair
$0.41 - 0.60$	Moderate
$0.61 - 0.80$	Substantial
$0.81 - 1.00$	Almost Perfect

(Landis & Koch, 1977 benchmarks — widely used but not universally accepted)

⚠️ Kappa is sensitive to the prevalence (base rate) of each category. When one category is very rare, even high percent agreement can yield a very low kappa. Always report percent agreement alongside kappa.

10.5 Weighted Kappa ( $\kappa_w$ )

For ordinal rating scales (where disagreements of different magnitudes are not equally serious), weighted kappa assigns weights based on the severity of disagreement:

$\kappa_w = \frac{\sum_{i}\sum_{j} w_{ij}(p_{ij} - p_{ij}^e)}{\sum_{i}\sum_{j} w_{ij}(p_{ij}^m - p_{ij}^e)}$

Where $w_{ij}$ are the weights, $p_{ij}$ are the observed proportions, and $p_{ij}^e$ are the expected proportions under independence.

Common weighting schemes:

Weight Type	$w_{ij}$ Formula	Suitable For
Linear weights	$1 - \frac{	i-j
Quadratic weights	$1 - \left(\frac{	i-j

Note: Weighted kappa with quadratic weights is mathematically equivalent to the ICC(2,1) model (see Section 10.6).

10.6 Intraclass Correlation Coefficient (ICC)

The Intraclass Correlation Coefficient (ICC) is the most versatile measure of inter-rater reliability for continuous or interval-scale ratings. Unlike Cohen's Kappa, the ICC can handle:

More than two raters.
Continuous ratings.
Both consistency (relative agreement) and absolute agreement.

The ICC is based on a one-way or two-way ANOVA decomposition of the total variance in ratings into:

Between-subjects variance ( $\text{MS}_{\text{between}}$ ): True differences between subjects.
Within-subjects / between-rater variance ( $\text{MS}_{\text{rater}}$ ): Systematic rater differences (in two-way models).
Error / residual variance ( $\text{MS}_{\text{error}}$ ): Random inconsistency.

Six standard ICC models (Shrout & Fleiss, 1979; McGraw & Wong, 1996):

ICC Model	Notation	Rater Design	Measures
One-Way Random, Single	ICC(1,1)	Each subject rated by a different random rater	Consistency
One-Way Random, Mean of $k$	ICC(1, $k$ )	Each subject rated by different raters; average used	Consistency
Two-Way Random, Single	ICC(2,1)	Same raters rate all subjects; raters random	Absolute agreement
Two-Way Random, Mean of $k$	ICC(2, $k$ )	Same raters rate all; raters random; average used	Absolute agreement
Two-Way Mixed, Single	ICC(3,1)	Same fixed raters; single rating used	Consistency
Two-Way Mixed, Mean of $k$	ICC(3, $k$ )	Same fixed raters; average of $k$ ratings used	Consistency

ICC Formulas (Two-Way Mixed Model):

For $n$ subjects, $k$ raters, with mean squares from a two-way ANOVA:

$\text{MS}_{\text{between}} = \frac{\text{SS}_{\text{between}}}{n-1}$

$\text{MS}_{\text{within}} = \frac{\text{SS}_{\text{within}}}{n(k-1)}$

$\text{MS}_{\text{rater}} = \frac{\text{SS}_{\text{rater}}}{k-1}$

$\text{MS}_{\text{error}} = \frac{\text{SS}_{\text{error}}}{(n-1)(k-1)}$

ICC(3,1) — Consistency:

$\text{ICC}(3,1) = \frac{\text{MS}_{\text{between}} - \text{MS}_{\text{error}}}{\text{MS}_{\text{between}} + (k-1)\text{MS}_{\text{error}}}$

ICC(2,1) — Absolute Agreement:

$\text{ICC}(2,1) = \frac{\text{MS}_{\text{between}} - \text{MS}_{\text{error}}}{\text{MS}_{\text{between}} + (k-1)\text{MS}_{\text{error}} + \frac{k}{n}(\text{MS}_{\text{rater}} - \text{MS}_{\text{error}})}$

For $k$ averaged ratings — ICC(3, $k$ ) — Consistency:

$\text{ICC}(3,k) = \frac{\text{MS}_{\text{between}} - \text{MS}_{\text{error}}}{\text{MS}_{\text{between}}}$

10.7 Confidence Intervals for ICC

The 95% CI for ICC is computed using the F-distribution:

$F_L = F_{\alpha/2, df_1, df_2}, \quad F_U = F_{1-\alpha/2, df_1, df_2}$

With $df_1 = n - 1$ and $df_2 = (n-1)(k-1)$ .

$\text{ICC}_L = \frac{F_{\text{obs}}/F_U - 1}{F_{\text{obs}}/F_U + k - 1}$

$\text{ICC}_U = \frac{F_{\text{obs}}/F_L - 1}{F_{\text{obs}}/F_L + k - 1}$

Where $F_{\text{obs}} = \text{MS}_{\text{between}} / \text{MS}_{\text{error}}$ .

10.8 Interpreting ICC Values

ICC	Reliability Quality
$< 0.50$	Poor
$0.50 - 0.74$	Moderate
$0.75 - 0.90$	Good
$> 0.90$	Excellent

(Koo & Mae, 2016 benchmarks — widely used in clinical research)

10.9 Fleiss' Kappa (Multiple Raters, Nominal Scale)

When more than two raters independently classify subjects into $k$ categories, Fleiss' Kappa generalises Cohen's Kappa:

$\kappa_F = \frac{\bar{P} - \bar{P}_e}{1 - \bar{P}_e}$

Where:

$\bar{P} = \frac{1}{n \cdot r(r-1)} \sum_{i=1}^{n} \sum_{j=1}^{k} n_{ij}(n_{ij} - 1)$

$\bar{P}_e = \sum_{j=1}^{k} \left(\frac{\sum_{i=1}^{n} n_{ij}}{n \cdot r}\right)^2$

With $n$ = number of subjects, $r$ = number of raters, $n_{ij}$ = number of raters assigning subject $i$ to category $j$ .

10.10 Krippendorff's Alpha

Krippendorff's alpha ( $\alpha_K$ ) is a versatile agreement measure that:

Handles any number of raters.
Works for any scale of measurement (nominal, ordinal, interval, ratio).
Handles missing data (not all subjects need to be rated by all raters).

$\alpha_K = 1 - \frac{D_o}{D_e}$

Where:

$D_o$ = observed disagreement (average squared/metric disagreement between raters).
$D_e$ = expected disagreement under chance agreement.

For ordinal data, the metric function $d^2_{ck}$ is the squared difference between category ranks. For interval data, $d^2_{ck} = (c - k)^2$ . For nominal data, $d^2_{ck} = \mathbf{1}[c \neq k]$ (0 if equal, 1 if different).

Krippendorff recommends $\alpha_K \geq 0.80$ for reliable conclusions, with $0.67 \leq \alpha_K < 0.80$ allowing only tentative conclusions.

11. Item Analysis

11.1 What is Item Analysis?

Item analysis is the process of evaluating the statistical properties of individual items to determine which items contribute positively to the scale's reliability and validity, and which should be revised or removed.

Item analysis is typically performed as part of reliability analysis and is crucial during:

Initial scale development (identifying weak items).
Scale revision (improving problematic items).
Test construction (selecting the best items from a larger item pool).

11.2 Item Difficulty (for Knowledge Tests)

For knowledge tests with correct/incorrect scoring, item difficulty ( $p$ -value) is the proportion of respondents who answer the item correctly:

$p_j = \frac{\text{Number correct on item } j}{N}$

Difficulty ( $p_j$ )	Interpretation
$< 0.20$	Very difficult — too hard for most
$0.20 - 0.39$	Difficult
$0.40 - 0.60$	Moderate — optimal for discrimination
$0.61 - 0.80$	Easy
$> 0.80$	Very easy — too easy for most

Items with $p_j \approx 0.50$ provide the most information about differences between individuals (maximum variance). However, items at extremes ( $p_j < 0.20$ or $p_j > 0.80$ ) have low variance and contribute little to reliability.

11.3 Item Discrimination Index

The item discrimination index ( $D$ ) measures how well an item differentiates between high-scoring and low-scoring respondents. It is computed using the extreme groups method:

Divide respondents into the top 27% (High group, $H$ ) and bottom 27% (Low group, $L$ ) based on total score.
Compute the proportion correct in each group: $p_H$ and $p_L$ .
Compute: $D = p_H - p_L$

$D$	Interpretation
$\geq 0.40$	Excellent discriminator
$0.30 - 0.39$	Good discriminator
$0.20 - 0.29$	Marginal — consider revision
$< 0.20$	Poor — revise or remove
Negative	Perverse — high scorers do worse (review carefully)

11.4 Item-Rest Correlation (Corrected Item-Total Correlation)

As introduced in Section 7.7, the corrected item-total correlation (CITC) is the primary item quality indicator for Likert-type scales. It is equivalent to the item discrimination index for continuous scales and should be:

$\geq 0.30$ : Item is a satisfactory indicator of the construct.
$0.20 - 0.29$ : Item is marginal; consider revision.
$< 0.20$ : Item is a poor indicator; strong candidate for removal.
Negative: Item is inversely related to the construct (check for reverse coding).

11.5 Inter-Item Correlation Analysis

Beyond item-total correlations, examining the inter-item correlation matrix reveals:

Items that are too highly correlated ( $r > 0.80$ ): May be redundant — they are essentially asking the same question twice and add little unique information. One should be removed or both should be revised to be more distinct.

Items that are too weakly correlated ( $r < 0.10$ with most other items): Likely measuring a different construct. These items should be examined theoretically and may need to be placed in a different subscale or removed.

Average inter-item correlation: Values of $\bar{r} = 0.20$ to $0.40$ are typically considered optimal. Very high average correlations ( $\bar{r} > 0.60$ ) with many items indicate excessive redundancy.

11.6 Floor and Ceiling Effects

Floor effects occur when most respondents score near the minimum possible score. Ceiling effects occur when most respondents score near the maximum possible score.

Both effects:

Reduce variance (compress the distribution).
Attenuate correlations with other variables.
Reduce reliability.
Make it difficult to detect differences or changes.

Check for floor/ceiling effects by inspecting:

The distribution of item responses (histograms).
Skewness: Extreme skewness ( $|z| > 2$ ) signals potential floor/ceiling issues.
Proportion of respondents at the minimum or maximum: $> 15\%$ suggests an issue.

11.7 Item Response Curves

For knowledge tests (binary items), the item response curve (IRC) or item characteristic curve (ICC) plots the probability of a correct response as a function of total test score.

A well-functioning item should show a monotonically increasing S-shaped curve — the probability of a correct answer should consistently increase with the total score. Items that show a non-monotonic curve (e.g., high scorers are less likely to answer correctly than medium scorers) are flagged as problematic discriminators.

11.8 The Item Analysis Decision Framework

For each item: | v Is CITC < 0.20? Yes → Flag for removal or revision No → Continue | v Is item-item correlation > 0.80 with any other item? Yes → Flag for redundancy; remove one of the pair No → Continue | v Does alpha-if-deleted substantially exceed current alpha (by > 0.05)? Yes → Strong candidate for removal No → Retain item | v Is skewness > |2| or kurtosis > |7|? (floor/ceiling effects) Yes → Consider item revision or transformation No → Retain item with confidence

12. Model Fit and Evaluation

12.1 Reporting Reliability: Minimum Requirements

At minimum, a reliability report should include:

The reliability coefficient (alpha, omega, ICC, kappa, etc.).
The 95% confidence interval around the coefficient.
The number of items included in the analysis.
The sample size ( $n$ ).
The method used (Cronbach's alpha, McDonald's omega, ICC model, etc.).
Item-level statistics (means, SDs, corrected item-total correlations).

Example APA-style reporting:

"Internal consistency of the 10-item Emotional Regulation Scale was evaluated using McDonald's omega ( $\omega$ ), as items were expected to have unequal factor loadings (congeneric model). Omega total was $\omega_t = 0.87$ (95% CI [0.84, 0.90]), indicating good internal consistency. Omega hierarchical was $\omega_h = 0.74$ , suggesting that the majority of reliable variance was attributable to the general factor. Corrected item-total correlations ranged from 0.41 to 0.68, with all items exceeding the acceptable threshold of 0.30."

12.2 Scale-Level Statistics

Beyond the reliability coefficient, the following scale-level statistics should be computed and reported:

Statistic	Formula	Interpretation
Scale Mean	$\bar{X} = \frac{1}{n}\sum_i X_i^{\text{total}}$	Average composite score
Scale Variance	$\sigma^2_X$	Spread of composite scores
Scale SD	$\sigma_X$	SD of composite scores
SEM	$\sigma_X\sqrt{1-\rho_{XX}}$	Average error in individual scores
Range	Max − Min	Spread of composite scores observed
Skewness & Kurtosis	Standard formulas	Check normality of composite

12.3 Assessing the Factor Structure Before Reliability Analysis

Before running reliability analysis, it is best practice to verify the factor structure:

Step 1 — Exploratory Factor Analysis (EFA):

Run EFA with parallel analysis to determine the number of factors.
If a single dominant factor is confirmed (parallel analysis retains 1 factor), the scale is approximately unidimensional → proceed with alpha or omega.
If 2+ factors are retained → split into subscales and analyse each separately.

Step 2 — Confirmatory Factor Analysis (CFA):

Specify a single-factor CFA model.
Evaluate fit (CFI $\geq 0.95$ , RMSEA $\leq 0.08$ , SRMR $\leq 0.08$ ).
If fit is good → use omega total from the CFA solution.
If fit is poor → consider multidimensional model; use omega hierarchical from bifactor model.

12.4 Evaluating Convergent and Discriminant Validity

Convergent validity: The scale should correlate strongly with other measures of the same or similar constructs (theoretically related measures). Typically evaluated using Pearson or Spearman correlations.

Discriminant validity: The scale should correlate weakly with measures of theoretically unrelated constructs.

Using reliability information, the disattenuated correlation (Section 3.7) provides the best estimate of the true relationship between constructs, corrected for measurement error.

12.5 Minimum Acceptable Reliability by Context

The required level of reliability depends on the stakes and purpose of measurement:

Context	Minimum Acceptable $\rho_{XX}$	Preferred
Group-level research (comparing means)	0.70	$\geq 0.80$
Individual-level decisions (clinical)	0.90	$\geq 0.95$
High-stakes testing (licensure)	0.90	$\geq 0.95$
Pilot / exploratory research	0.60	$\geq 0.70$
Inter-rater agreement (research)	0.70 ICC	$\geq 0.80$ ICC
Inter-rater agreement (clinical)	0.90 ICC	$\geq 0.95$ ICC

13. Advanced Topics

13.1 Ordinal Reliability: Polychoric Correlations

When scale items use fewer than 5 ordinal categories (e.g., a 3-point or 4-point Likert scale), treating Likert responses as continuous can distort covariances and underestimate reliability. A more appropriate approach uses polychoric correlations as the input matrix.

The polychoric correlation between two ordinal items $j$ and $k$ estimates the correlation between the underlying continuous latent variables that generate the observed ordinal responses. It is estimated by maximum likelihood, assuming bivariate normality of the latent variables.

Ordinal alpha is Cronbach's alpha computed on the polychoric correlation matrix:

$\alpha_{\text{ordinal}} = \frac{p \cdot \bar{r}_{\text{poly}}}{1 + (p-1)\bar{r}_{\text{poly}}}$

Ordinal omega is McDonald's omega estimated from a factor model fit to the polychoric correlation matrix (using WLSMV or similar ordinal estimator in CFA).

Ordinal alpha and omega are typically higher than their Pearson-based counterparts for coarsely-rated Likert items, because polychoric correlations are less attenuated by the coarse ordinal scaling.

13.2 Reliability in Generalisability Theory (G-Theory)

Generalisability Theory (G-Theory) extends CTT by recognising that measurement error can have multiple sources (facets). In a rating study, error might come from:

Items (some items are harder or easier than others).
Raters (some raters are more lenient than others).
Occasions (scores vary across testing sessions).
Interactions (some raters are harsher on certain items).

A G-study uses a fully crossed (or nested) ANOVA to partition the total variance into components corresponding to each facet and their interactions:

$\sigma^2_{X_{ijk}} = \sigma^2_p + \sigma^2_i + \sigma^2_r + \sigma^2_{pi} + \sigma^2_{pr} + \sigma^2_{ir} + \sigma^2_{pir}$

Where $p$ = persons, $i$ = items, $r$ = raters.

The Generalisation Coefficient (G-coefficient) is analogous to reliability:

$G = \frac{\sigma^2_p}{\sigma^2_p + \sigma^2_{\Delta}}$

Where $\sigma^2_{\Delta}$ is the error variance appropriate to the measurement design.

A D-study uses the G-study variance components to predict how reliability would change if the number of items, raters, or occasions were varied — similar to the Spearman-Brown formula but for multiple facets simultaneously.

13.3 Reliability of Difference Scores

When researchers compute difference scores (e.g., post-treatment score minus pre-treatment score, or the difference between two subscales), the reliability of the difference is typically lower than the reliability of either component:

$\rho_{D} = \frac{\rho_{XX}\sigma^2_X + \rho_{YY}\sigma^2_Y - 2r_{XY}\sigma_X\sigma_Y}{\sigma^2_X + \sigma^2_Y - 2r_{XY}\sigma_X\sigma_Y}$

Where:

$\rho_{XX}$ , $\rho_{YY}$ = reliabilities of $X$ and $Y$ .
$\sigma^2_X$ , $\sigma^2_Y$ = variances of $X$ and $Y$ .
$r_{XY}$ = observed correlation between $X$ and $Y$ .

For parallel measures ( $\rho_{XX} = \rho_{YY} = \rho$ and $\sigma_X = \sigma_Y = \sigma$ ):

$\rho_D = \frac{\rho - r_{XY}}{1 - r_{XY}}$

This shows that when $X$ and $Y$ are highly correlated (as expected when both are pre/post measures of the same construct), the reliability of the difference score can be very low.

Example: $\rho = 0.80$ , $r_{XY} = 0.70$ :

$\rho_D = \frac{0.80 - 0.70}{1 - 0.70} = \frac{0.10}{0.30} = 0.33$

Even though each measure has reliability 0.80, their difference has reliability of only 0.33! This is why difference scores are generally discouraged and residualised change scores or ANCOVA are preferred for measuring change.

13.4 Reliability and the Attenuation-Correction Decision

When planning a study, the researcher must decide whether to:

Accept observed correlations (with attenuation from unreliability), or
Correct for attenuation to estimate the true relationship.

Arguments for correcting:

Shows the theoretical true relationship between constructs.
Allows comparison across studies using instruments of different quality.
Better informs theory testing.

Arguments against correcting:

The corrected estimate is population-level and not applicable to individual predictions.
Relies on accurate reliability estimates (which carry their own uncertainty).
Can produce $r^* > 1.0$ , which is inadmissible.

Best practice: Report both the observed and disattenuated correlations, and always report the reliability estimates used for correction.

13.5 Reliability of Composite Scores from Multiple Subscales

When a total score is formed by combining items from multiple subscales, reliability cannot be computed by treating all items as a single scale (which would violate the unidimensionality assumption). Instead, use Mosier's formula for the reliability of a composite:

$\rho_{XX}^{\text{composite}} = \frac{\sigma^2_{\text{composite}} - \sum_{k=1}^{K} w_k^2 \sigma^2_{X_k}(1 - \rho_{kk})}{\sigma^2_{\text{composite}}}$

Where:

$K$ = number of subscales.
$w_k$ = weight of subscale $k$ in the composite (1 for unweighted sum).
$\sigma^2_{X_k}$ = variance of subscale $k$ .
$\rho_{kk}$ = reliability of subscale $k$ .
$\sigma^2_{\text{composite}}$ = total variance of the composite score.

This formula partitions total composite variance into reliable variance (from true scores) and error variance (from subscale measurement errors), providing an accurate estimate of the composite's reliability.

13.6 Item Response Theory (IRT) and Marginal Reliability

Item Response Theory (IRT) provides a framework for reliability that is more flexible than CTT. In IRT, the precision of measurement is not constant across the score range — it is highest where the test has the most information.

The Test Information Function $I(\theta)$ quantifies how much information the test provides at each level $\theta$ of the latent trait:

$I(\theta) = \sum_{j=1}^{p} I_j(\theta)$

Where $I_j(\theta)$ is the item information function for item $j$ .

The conditional standard error of measurement at trait level $\theta$ is:

$\text{SE}(\theta) = \frac{1}{\sqrt{I(\theta)}}$

The marginal reliability of the test (averaging over the population distribution of $\theta$ ):

$\rho_{\text{marginal}} = 1 - \frac{E[1/I(\theta)]}{Var(\theta) + E[1/I(\theta)]}$

IRT marginal reliability is a more informative reliability measure than Cronbach's alpha because it shows that reliability can be high for some test-takers and low for others — traditional reliability statistics only provide an average.

14. Worked Examples

Example 1: Cronbach's Alpha — 6-Item Burnout Scale

A researcher develops a 6-item work burnout scale with items rated 1 (Never) to 5 (Always). Data are collected from $n = 250$ employees.

Items:

B1: I feel emotionally exhausted from my work.
B2: I feel used up at the end of the working day.
B3: I feel fatigued when I get up in the morning and have to face another day on the job.
B4: Working with people all day is really a strain for me.
B5: I feel burned out from my work.
B6: I feel frustrated by my job.

Item Statistics:

Item	Mean	SD	Skewness	CITC	$\alpha$ if Deleted
B1	3.21	1.08	-0.22	0.72	0.86
B2	3.08	1.12	-0.15	0.69	0.87
B3	2.95	1.15	0.10	0.61	0.88
B4	2.78	1.20	0.18	0.55	0.89
B5	3.31	1.05	-0.30	0.75	0.86
B6	3.05	1.18	-0.08	0.68	0.87

Inter-Item Correlation Matrix:

	B1	B2	B3	B4	B5	B6
B1	1.00	0.68	0.55	0.44	0.74	0.61
B2		1.00	0.60	0.42	0.69	0.58
B3			1.00	0.48	0.58	0.52
B4				1.00	0.49	0.54
B5					1.00	0.66
B6						1.00

Average inter-item correlation: $\bar{r} = 0.573$

Cronbach's Alpha Computation:

$\alpha_{\text{std}} = \frac{6 \times 0.573}{1 + (6-1) \times 0.573} = \frac{3.438}{1 + 2.865} = \frac{3.438}{3.865} = 0.890$

95% Confidence Interval (Feldt method): $\alpha = 0.890$ [0.870, 0.907]

Scale Statistics:

Scale Mean: $3.06$
Scale SD: $4.58$
SEM: $4.58 \times \sqrt{1 - 0.890} = 4.58 \times 0.332 = 1.52$

Item Analysis Decision:

Item	CITC	$\alpha$ -if-Deleted	Action
B1	0.72	0.86	✅ Retain — strong indicator
B2	0.69	0.87	✅ Retain — good indicator
B3	0.61	0.88	✅ Retain — acceptable
B4	0.55	0.89	✅ Retain — but weakest item
B5	0.75	0.86	✅ Retain — strongest indicator
B6	0.68	0.87	✅ Retain — good indicator

Conclusion: All six items are retained. Cronbach's alpha of $0.890$ (95% CI: 0.870, 0.907) indicates good internal consistency. All corrected item-total correlations exceed 0.50, and no single item appreciably improves alpha when deleted. The scale is internally consistent and all items contribute positively to the burnout construct.

Example 2: McDonald's Omega — 8-Item Anxiety Scale

A researcher administers an 8-item anxiety scale to $n = 320$ participants and runs a CFA-based reliability analysis using McDonald's omega because item loadings are expected to differ.

Single-Factor CFA Results:

Item	Standardised Loading ( $\lambda_j$ )	Uniqueness ( $\theta_j = 1 - \lambda_j^2$ )	$R^2$
A1	0.82	0.33	0.67
A2	0.79	0.38	0.62
A3	0.71	0.50	0.50
A4	0.68	0.54	0.46
A5	0.85	0.28	0.72
A6	0.74	0.45	0.55
A7	0.63	0.60	0.40
A8	0.77	0.41	0.59

CFA Fit: CFI = 0.976, TLI = 0.968, RMSEA = 0.047, SRMR = 0.041 → Good fit

Omega Total Computation:

$\sum_{j=1}^{8}\lambda_j = 0.82 + 0.79 + 0.71 + 0.68 + 0.85 + 0.74 + 0.63 + 0.77 = 5.99$

$\left(\sum_{j=1}^{8}\lambda_j\right)^2 = (5.99)^2 = 35.88$

$\sum_{j=1}^{8}\theta_j = 0.33 + 0.38 + 0.50 + 0.54 + 0.28 + 0.45 + 0.60 + 0.41 = 3.49$

$\omega_t = \frac{35.88}{35.88 + 3.49} = \frac{35.88}{39.37} = 0.911$

Cronbach's Alpha (for comparison):

$\alpha = 0.892$

Comparison:

Statistic	Value	Interpretation
Cronbach's $\alpha$	0.892	Good — but underestimates true reliability
McDonald's $\omega_t$	0.911	Excellent — accurate estimate for congeneric items
Difference ( $\omega_t - \alpha$ )	0.019	Alpha underestimates by 1.9 percentage points

Conclusion: The 8-item anxiety scale demonstrates excellent internal consistency. Omega total ( $\omega_t = 0.911$ ) is the preferred and more accurate estimate because the items have unequal factor loadings (ranging from 0.63 to 0.85), confirming the congeneric model. Cronbach's alpha ( $0.892$ ) slightly underestimates the true reliability, as expected for a congeneric scale.

Example 3: ICC — Two Clinical Raters Assessing Pain Intensity

Two physiotherapists independently rate pain intensity on a 0–10 numeric scale for $n = 40$ patients. The researcher wants to assess whether the two raters can be used interchangeably (absolute agreement ICC).

ANOVA Table:

Source	SS	df	MS
Between Patients	412.8	39	10.585
Between Raters	8.1	1	8.100
Residual (Error)	58.4	39	1.497
Total	479.3	79

ICC(2,1) — Two-Way Random, Absolute Agreement:

$\text{ICC}(2,1) = \frac{10.585 - 1.497}{10.585 + (2-1)(1.497) + \frac{2}{40}(8.100 - 1.497)}$

$= \frac{9.088}{10.585 + 1.497 + 0.330} = \frac{9.088}{12.412} = 0.732$

ICC(3,1) — Two-Way Mixed, Consistency:

$\text{ICC}(3,1) = \frac{10.585 - 1.497}{10.585 + (2-1)(1.497)} = \frac{9.088}{12.082} = 0.752$

F-test for significance:

$F = \frac{\text{MS}_{\text{between}}}{\text{MS}_{\text{error}}} = \frac{10.585}{1.497} = 7.07, \quad p < 0.001$

95% Confidence Interval for ICC(2,1):

$F_L = F_{0.025, 39, 39} = 0.529, \quad F_U = F_{0.975, 39, 39} = 1.895$

Using the Shrout-Fleiss CI formula: $\text{ICC}_{95\%\text{CI}} = [0.562, 0.849]$

Interpretation:

Statistic	Value	Interpretation
ICC(2,1) — Absolute Agreement	0.732 [0.562, 0.849]	Moderate-Good agreement
ICC(3,1) — Consistency	0.752	Good consistency
Difference (abs. vs. consistency)	0.020	Small systematic rater mean difference

Interpretation: The ICC for absolute agreement is 0.732 (95% CI: 0.562, 0.849), indicating moderate-to-good inter-rater reliability. The slightly higher consistency ICC (0.752) suggests a small systematic difference in how the two physiotherapists use the rating scale (Rater A rates slightly higher/lower than Rater B on average). For clinical interchangeability of the two raters, the absolute agreement ICC of 0.732 is adequate for research purposes but falls short of the 0.90 threshold recommended for high-stakes clinical decision-making. Additional rater training is recommended to improve agreement.

Example 4: Spearman-Brown Prophecy — Lengthening a Short Scale

A researcher has a 5-item resilience scale with $\alpha = 0.68$ and wants to improve reliability to at least $\alpha = 0.80$ by adding parallel items. How many more items are needed?

Step 1 — Compute $n$ (multiplication factor):

$n = \frac{\rho^*(1 - \rho_{XX})}{\rho_{XX}(1 - \rho^*)} = \frac{0.80(1 - 0.68)}{0.68(1 - 0.80)} = \frac{0.80 \times 0.32}{0.68 \times 0.20} = \frac{0.256}{0.136} = 1.882$

Step 2 — Compute required number of items:

New total items = $n \times 5 = 1.882 \times 5 = 9.41 \approx 10$ items

Step 3 — Verify with Spearman-Brown:

$\rho_{XX}^{(10)} = \frac{2 \times 0.68}{1 + (2-1) \times 0.68} = \frac{1.36}{1.68} = 0.810$

Conclusion: Adding 5 more parallel items (total 10 items) is predicted to raise the reliability from $\alpha = 0.68$ to approximately $\alpha = 0.81$ , exceeding the target of 0.80. This assumes that the new items have the same average inter-item correlation as the original 5.

15. Common Mistakes and How to Avoid Them

Mistake 1: Reporting Alpha Without a Confidence Interval

Problem: Cronbach's alpha is a sample statistic with substantial sampling variability, especially in small samples. Reporting only the point estimate gives a false sense of precision. A value of $\alpha = 0.78$ with $n = 50$ could have a 95% CI as wide as [0.63, 0.89].
Solution: Always report the 95% confidence interval for all reliability coefficients. Use the Feldt method for alpha or bootstrap CIs for omega.

Mistake 2: Using Cronbach's Alpha as a Measure of Unidimensionality

Problem: Alpha measures internal consistency (how strongly items co-vary), not unidimensionality (whether items measure a single construct). A multidimensional scale with two positively correlated subscales can produce high alpha, even though it clearly violates unidimensionality.
Solution: Always conduct an EFA or CFA to assess dimensionality before computing reliability. Report alpha/omega separately for each unidimensional subscale.

Mistake 3: Blindly Deleting Items to Maximise Alpha

Problem: Removing items purely because they increase alpha capitalises on sampling variability and can produce a shorter scale that performs worse in new samples. Alpha increases simply by removing poor items, but the gain in reliability may be spurious.
Solution: Use a principled decision framework: only remove an item if (a) the CITC is below 0.20, (b) the item has poor theoretical alignment with the construct, AND (c) the item does not reduce content validity. Validate the revised scale in a new sample.

Mistake 4: Not Checking for Reverse-Coded Items

Problem: Including negatively-worded items without reverse coding them will produce negative inter-item correlations and severely deflate alpha. A value of $\alpha = 0.10$ is often a sign that one or more items have not been reverse coded.
Solution: Before running reliability analysis, identify all negatively-worded items and reverse-code them: $X_{\text{rev}} = (\text{max} + \text{min}) - X$ .

Mistake 5: Reporting Alpha for a Multidimensional Scale as a Whole

Problem: Computing a single alpha for a multidimensional questionnaire (e.g., a measure with anxiety, depression, and stress subscales combined) is theoretically inappropriate and can produce misleading reliability estimates.
Solution: Compute reliability separately for each subscale. If a composite total score is used, estimate its reliability using Mosier's composite reliability formula (Section 13.5).

Mistake 6: Ignoring the Number of Items When Interpreting Alpha

Problem: Alpha increases automatically with more items (Spearman-Brown effect). A 30-item scale with weak items can produce $\alpha = 0.90$ , while a 4-item scale with strong items might produce $\alpha = 0.75$ . The 4-item scale may actually be more efficient and have better items.
Solution: Consider the average inter-item correlation ( $\bar{r}$ ) alongside alpha. Compare $\bar{r}$ across scales of different lengths, as it is not affected by scale length. Ideal $\bar{r} = 0.20$ to $0.40$ .

Mistake 7: Confusing Inter-Rater Agreement With Inter-Rater Reliability

Problem: These two concepts are related but distinct:

Reliability (ICC, consistency): Do raters rank subjects in the same order?
Agreement (ICC absolute, kappa): Do raters give the same actual values?
A rater who always scores 2 points higher than another has perfect reliability (same ordering) but poor agreement.
Solution: Select the appropriate ICC type based on the research question. Use absolute agreement ICC when raters must be interchangeable. Use consistency ICC when only the ranking matters and systematic differences between raters are acceptable.

Mistake 8: Using Percent Agreement Instead of Cohen's Kappa

Problem: Percent agreement does not correct for chance agreement. With a binary rating where 90% of cases fall in one category, two raters randomly agreeing with base rates would achieve 82% agreement by chance, making 85% agreement seem impressive when it is barely above chance.
Solution: Always report Cohen's Kappa (or Fleiss' Kappa for multiple raters) alongside percent agreement. Never interpret percent agreement alone.

Mistake 9: Applying Cronbach's Alpha to Subscale Scores (Not Item-Level Data)

Problem: Computing alpha using subscale total scores (rather than individual item scores) as the input produces a composite alpha estimate that is not the same as the reliability of the total scale and is not interpretable as a standard reliability coefficient.
Solution: Always compute reliability from item-level data (each item in a separate column), not from subscale totals.

Mistake 10: Interpreting Alpha of 0.95 as "Better" Than Alpha of 0.85

Problem: Very high alpha ( $> 0.95$ ) is often a sign of item redundancy — items are so similar that they provide almost no unique measurement information. This wastes respondent time without improving construct coverage.
Solution: Target $0.80 - 0.90$ for most research scales. If alpha exceeds 0.95 with many items, consider reducing scale length by removing the most redundant items (lowest unique information) while maintaining acceptable reliability.

16. Troubleshooting

Problem	Likely Cause	Solution
Cronbach's alpha is very low ( $< 0.50$ )	Reverse-coded items not recoded; items from different constructs mixed; items are too heterogeneous	Check and reverse-code negatively worded items; separate subscales; check for construct coherence
Alpha is negative	At least one item is very negatively correlated with others; reverse coding error	Examine inter-item correlations; check for items that need reverse coding
Alpha is very high ( $> 0.95$ ) with many items	Item redundancy — too many items with near-identical wording	Inspect item pairs with $r > 0.80$ ; remove the most redundant items
Alpha exceeds omega	Correlated errors inflating alpha; model misspecification	Run CFA to check for correlated errors; use omega from properly specified model
One item's alpha-if-deleted greatly exceeds overall alpha	Item measures a different construct; possible reverse-coding error; item is ambiguous	Examine item content; check reverse coding; consider removing from scale
All corrected item-total correlations are near zero	Items are unrelated to each other; multidimensional scale being treated as unidimensional	Run EFA; split into subscales; reconsider construct definition
Negative corrected item-total correlation	Item is negatively related to the construct; reverse coding needed	Reverse-code the item and re-run
ICC very low ( $< 0.50$ ) with large F-ratio	Raters highly inconsistent; training issue	Re-train raters; clarify rating criteria; pilot coding manual
ICC inconsistency much higher than absolute agreement	Systematic rater bias (one rater consistently rates higher/lower)	Identify the biased rater; re-calibrate; consider rater re-training
Cohen's Kappa is very low despite high percent agreement	High base-rate of one category (prevalence paradox)	Report both statistics; use PABAK (prevalence and bias adjusted kappa)
CFA does not converge for omega	Very small sample; near-perfect correlations; Heywood case	Increase sample size; reduce number of items; check for duplicate items
SEM is very large	Low reliability and/or high scale variance	Improve reliability; report SEM explicitly in all clinical applications
Omega hierarchical approaches zero	Essentially no general factor; scale is fully multidimensional	Use subscale scores rather than total; report subscale-specific omega
Spearman-Brown predicts a very large number of items needed	Baseline reliability is very low; items are poor indicators	Redesign items; collect new pilot data; consider different item format

17. Quick Reference Cheat Sheet

Core Equations

Formula	Description
$X_i = T_i + E_i$	Classical Test Theory model
$\rho_{XX} = \sigma^2_T / \sigma^2_X$	Reliability coefficient (population)
$\text{SEM} = \sigma_X\sqrt{1-\rho_{XX}}$	Standard Error of Measurement
$\alpha = \frac{p}{p-1}\left(1 - \frac{\sum\sigma^2_j}{\sigma^2_X}\right)$	Cronbach's alpha
$\alpha_{\text{std}} = \frac{p\bar{r}}{1+(p-1)\bar{r}}$	Standardised alpha
$\rho_{XX}^{(n)} = \frac{n\rho_{XX}}{1+(n-1)\rho_{XX}}$	Spearman-Brown prophecy
$n = \frac{\rho^(1-\rho_{XX})}{\rho_{XX}(1-\rho^)}$	Items needed for target reliability
$r^*_{XY} = \frac{r_{XY}}{\sqrt{\rho_{XX}\rho_{YY}}}$	Correction for attenuation
$\omega_t = \frac{(\sum\lambda_j)^2}{(\sum\lambda_j)^2 + \sum\theta_j}$	McDonald's omega total
$\omega_h = \frac{(\sum\lambda_{jg})^2}{\sigma^2_X}$	Omega hierarchical
$\rho_{AB}^{SB} = \frac{2r_{AB}}{1+r_{AB}}$	Split-half (Spearman-Brown corrected)
$\kappa = \frac{P_o - P_e}{1-P_e}$	Cohen's Kappa
$\text{ICC}(3,1) = \frac{\text{MS}_B - \text{MS}_E}{\text{MS}_B + (k-1)\text{MS}_E}$	ICC consistency (two-way mixed)
$\text{ICC}(2,1) = \frac{\text{MS}_B - \text{MS}_E}{\text{MS}_B + (k-1)\text{MS}_E + \frac{k}{n}(\text{MS}_R - \text{MS}_E)}$	ICC absolute agreement (two-way random)

Reliability Benchmarks

Coefficient	Poor	Acceptable	Good	Excellent
Alpha / Omega	$< 0.60$	$0.60 - 0.69$	$0.70 - 0.89$	$\geq 0.90$
ICC (research)	$< 0.50$	$0.50 - 0.74$	$0.75 - 0.90$	$> 0.90$
Cohen's Kappa	$< 0.21$	$0.21 - 0.40$	$0.41 - 0.60$	$> 0.80$
CITC	$< 0.20$	$0.20 - 0.29$	$0.30 - 0.49$	$\geq 0.50$
$\omega_h / \omega_t$	$< 0.50$	$0.50 - 0.64$	$0.65 - 0.79$	$\geq 0.80$

Reliability Type Selection Guide

Scenario	Recommended Method
Multi-item Likert scale, unidimensional	McDonald's omega (preferred) or Cronbach's alpha
Multi-item scale, multidimensional	Omega hierarchical (bifactor) + omega subscale
Binary scored items (correct/incorrect)	KR-20
Ordinal scale ( $\leq 5$ categories)	Ordinal alpha or ordinal omega (polychoric)
Two raters, nominal categories	Cohen's Kappa
Two raters, ordered categories	Weighted Kappa (linear or quadratic)
Three+ raters, nominal categories	Fleiss' Kappa
Two+ raters, continuous ratings	ICC (specify model and type)
Test-retest, continuous	ICC (two-way mixed, absolute agreement)
Multiple sources of error	Generalisability Theory (G-coefficient)

Item Analysis Decision Rules

Statistic	Threshold	Action
CITC	$< 0.20$	Flag for removal or revision
CITC	Negative	Check reverse coding; flag for review
Alpha-if-deleted	$> \alpha + 0.05$	Strong candidate for removal
Inter-item $r$	$> 0.80$	Redundancy — remove one of the pair
Item skewness	$	z
Item difficulty ( $p_j$ , binary)	$< 0.20$ or $> 0.80$	Item too hard or too easy
Item discrimination ( $D$ )	$< 0.20$	Poor discriminator — revise

Minimum Reliability by Context

Context	Minimum	Preferred
Exploratory / pilot research	0.60	$\geq 0.70$
Group-level research	0.70	$\geq 0.80$
Individual research decisions	0.80	$\geq 0.90$
Clinical / high-stakes decisions	0.90	$\geq 0.95$

ICC Model Selection Guide

Design	Raters	Measure	Recommended ICC
Each subject rated by different raters	Random	Single	ICC(1,1)
Each subject rated by different raters	Random	Mean of $k$	ICC(1, $k$ )
Same raters rate all; generalise to all raters	Random	Single	ICC(2,1)
Same raters rate all; generalise to all raters	Random	Mean of $k$	ICC(2, $k$ )
Same fixed raters; generalise to these raters	Fixed	Single	ICC(3,1)
Same fixed raters; generalise to these raters	Fixed	Mean of $k$	ICC(3, $k$ )

This tutorial provides a comprehensive foundation for understanding, applying, and interpreting Reliability Analysis using the DataStatPro application. For further reading, consult Revelle & Zinbarg's "Coefficients Alpha, Beta, Omega, and the glb" (2009), McDonald's "Test Theory: A Unified Treatment" (1999), Koo & Mae's "A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research" (2016), or Shrout & Fleiss's "Intraclass Correlations: Uses in Assessing Rater Reliability" (1979). For feature requests or support, contact the DataStatPro team.

Reliability Analysis

Reliability Analysis: Zero to Hero Tutorial

Table of Contents

1. Prerequisites and Background Concepts

1.1 Measurement and Scales

1.2 Variance and Covariance

1.3 The Inter-Item Correlation Matrix

1.4 Composite Scores

1.5 The Signal-to-Noise Analogy

1.6 True Score Theory (Brief Preview)

2. What is Reliability Analysis?

2.1 The Core Question

2.2 Reliability vs. Validity

2.3 The Role of Reliability in Research

2.4 Real-World Applications

3. The Mathematics Behind Reliability

3.1 Classical Test Theory (CTT)

3.2 Assumptions of the True Score Model

3.3 Variance Decomposition

3.4 The Reliability Coefficient

3.5 Standard Error of Measurement

3.6 The Spearman-Brown Prophecy Formula

3.7 Attenuation and Correction for Attenuation

3.8 The Covariance Matrix of a Scale

4. Assumptions of Reliability Analysis

4.1 Unidimensionality

4.2 Tau-Equivalence (for Cronbach's Alpha)

4.3 Uncorrelated Errors

4.4 Continuous or Approximately Continuous Items

4.5 Adequate Sample Size

4.6 No Extreme Outliers

5. Types of Reliability

5.1 Internal Consistency Reliability

5.2 Test-Retest Reliability (Stability)

5.3 Inter-Rater Reliability (Agreement)

5.4 Parallel-Forms Reliability (Alternate Forms)

5.5 Summary: Choosing the Right Reliability Type

6. Using the Reliability Analysis Component

Step-by-Step Guide

7. Cronbach's Alpha

7.1 Definition and Formula

7.2 Standardised Alpha

7.3 Cronbach's Alpha as a Lower Bound

7.4 The Confidence Interval for Alpha

7.5 Interpreting Cronbach's Alpha

7.6 Alpha for Binary Items: KR-20 and KR-21

7.7 Item-Total Statistics and Alpha-if-Item-Deleted

7.8 Worked Manual Calculation of Cronbach's Alpha

8. McDonald's Omega

8.1 Limitations of Cronbach's Alpha and the Case for Omega

8.2 The Congeneric Model

8.3 Omega Total (ωt\omega_tωt​)

8.4 Omega Hierarchical (ωh\omega_hωh​)

8.5 Comparison of Alpha and Omega

8.6 Interpreting Omega Values

8.7 The Ratio ωh/ωt\omega_h / \omega_tωh​/ωt​ (Explained Common Variance)

9. Split-Half Reliability

9.1 The Split-Half Method

9.2 Methods for Splitting the Scale

9.3 The Guttman Lambda Coefficients

9.4 The Greatest Lower Bound (GLB)

10. Inter-Rater Reliability

10.1 Why Inter-Rater Reliability Matters

10.2 Percent Agreement

10.3 Cohen's Kappa (κ\kappaκ)

10.4 Interpreting Cohen's Kappa

10.5 Weighted Kappa (κw\kappa_wκw​)

10.6 Intraclass Correlation Coefficient (ICC)

10.7 Confidence Intervals for ICC

10.8 Interpreting ICC Values

10.9 Fleiss' Kappa (Multiple Raters, Nominal Scale)

10.10 Krippendorff's Alpha

11. Item Analysis

11.1 What is Item Analysis?

11.2 Item Difficulty (for Knowledge Tests)

11.3 Item Discrimination Index

11.4 Item-Rest Correlation (Corrected Item-Total Correlation)

11.5 Inter-Item Correlation Analysis

11.6 Floor and Ceiling Effects

11.7 Item Response Curves

8.3 Omega Total ( $\omega_t$ )

8.4 Omega Hierarchical ( $\omega_h$ )

8.7 The Ratio $\omega_h / \omega_t$ (Explained Common Variance)

10.3 Cohen's Kappa ( $\kappa$ )

10.5 Weighted Kappa ( $\kappa_w$ )