Knowledge Base / Reliability Analysis Advanced Analysis 64 min read

Reliability Analysis

Comprehensive reference guide for measurement reliability and consistency assessment.

Reliability Analysis: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of Reliability Analysis all the way through advanced estimation, evaluation, item analysis, and practical usage within the DataStatPro application. Whether you are encountering reliability analysis for the first time or looking to deepen your understanding of measurement quality, this guide builds your knowledge systematically from the ground up.


Table of Contents

  1. Prerequisites and Background Concepts
  2. What is Reliability Analysis?
  3. The Mathematics Behind Reliability
  4. Assumptions of Reliability Analysis
  5. Types of Reliability
  6. Using the Reliability Analysis Component
  7. Cronbach's Alpha
  8. McDonald's Omega
  9. Split-Half Reliability
  10. Inter-Rater Reliability
  11. Item Analysis
  12. Model Fit and Evaluation
  13. Advanced Topics
  14. Worked Examples
  15. Common Mistakes and How to Avoid Them
  16. Troubleshooting
  17. Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into reliability analysis, it is helpful to understand several foundational statistical and psychometric concepts. Each is briefly reviewed below.

1.1 Measurement and Scales

Measurement is the process of assigning numbers to objects or events according to rules. In social and behavioural sciences, we frequently measure latent constructs — unobservable psychological or social attributes such as intelligence, anxiety, depression, or customer satisfaction.

These constructs are measured indirectly through observable indicators (items or questions on a questionnaire). The quality of this measurement process is what reliability analysis evaluates.

Scales of measurement:

ScalePropertiesExamples
NominalCategories only; no orderGender, blood type, nationality
OrdinalOrdered categories; unequal intervalsLikert scale responses (1–5), satisfaction ratings
IntervalEqual intervals; arbitrary zeroTemperature (°C), IQ scores
RatioEqual intervals; true zeroHeight, weight, reaction time

Most psychological questionnaires use ordinal scales (Likert items), though they are often treated as approximately interval for the purpose of reliability analysis.

1.2 Variance and Covariance

The variance of a variable XX quantifies how spread out its values are around the mean:

σX2=1n1i=1n(XiXˉ)2\sigma^2_X = \frac{1}{n-1}\sum_{i=1}^{n}(X_i - \bar{X})^2

The covariance between two variables XX and YY measures how they vary together:

σXY=1n1i=1n(XiXˉ)(YiYˉ)\sigma_{XY} = \frac{1}{n-1}\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})

The correlation is the standardised covariance:

rXY=σXYσXσYr_{XY} = \frac{\sigma_{XY}}{\sigma_X \sigma_Y}

Reliability analysis is fundamentally about analysing the covariance structure of a set of items — how strongly they co-vary with each other determines how reliably they measure the underlying construct.

1.3 The Inter-Item Correlation Matrix

Given pp items, the inter-item correlation matrix R\mathbf{R} is a p×pp \times p symmetric matrix where:

High inter-item correlations (typically r>0.30r > 0.30) indicate that the items are measuring a common construct. Items with very low correlations (r<0.10r < 0.10) with all others may be measuring something different and should be reviewed.

1.4 Composite Scores

A composite score (also called a total score or scale score) is the sum or average of responses across multiple items:

Xtotal=j=1pXjorXˉ=1pj=1pXjX_{\text{total}} = \sum_{j=1}^{p} X_j \quad \text{or} \quad \bar{X} = \frac{1}{p}\sum_{j=1}^{p} X_j

The reliability of a composite score depends on:

  1. The reliability of each individual item.
  2. The number of items (pp) — more items generally means higher reliability.
  3. The average inter-item correlation — higher correlations mean higher reliability.

1.5 The Signal-to-Noise Analogy

Reliability can be thought of in terms of signal and noise:

Reliability=SignalSignal+Noise=σT2σX2\text{Reliability} = \frac{\text{Signal}}{\text{Signal} + \text{Noise}} = \frac{\sigma^2_T}{\sigma^2_X}

A perfectly reliable measure would have all signal and no noise (R=1.0R = 1.0). In practice, all measurement has some noise, and reliability values above 0.70 are generally considered acceptable for research purposes.

1.6 True Score Theory (Brief Preview)

The concept of true scores comes from Classical Test Theory (CTT), which is the mathematical framework underlying most reliability analyses. In CTT, every observed score XiX_i is composed of a true score TiT_i and a random error EiE_i:

Xi=Ti+EiX_i = T_i + E_i

This deceptively simple equation is the foundation for all reliability estimation methods covered in this tutorial.


2. What is Reliability Analysis?

2.1 The Core Question

Reliability analysis evaluates whether a measurement instrument (a questionnaire, scale, test, or rating system) produces consistent, stable, and reproducible results. It answers the fundamental question:

"If we measure the same thing again under the same conditions, will we get the same result?"

A reliable instrument gives similar scores across:

2.2 Reliability vs. Validity

Reliability and validity are the two cornerstones of measurement quality, but they are distinct:

PropertyDefinitionQuestion Asked
ReliabilityConsistency of measurement"Does it measure consistently?"
ValidityAccuracy of measurement"Does it measure what it claims to?"

The critical relationship is:

Reliability is a necessary but not sufficient condition for validity.

A measure can be highly reliable but invalid (consistently measuring the wrong thing — like a miscalibrated scale that consistently reads 5 kg too heavy). A valid measure, by definition, must also be reliable (you cannot accurately measure something if the results are random). Perfect Reliability + High Validity = Excellent measurement (target) Perfect Reliability + Low Validity = Consistent but wrong (systematic bias) Low Reliability + Any Validity = Impossible (random measurement cannot be valid)

2.3 The Role of Reliability in Research

Reliability affects research in critical ways:

Statistical power: Unreliable measures attenuate (reduce) observed correlations and effect sizes. The true correlation rXYr_{XY} between constructs XX and YY is related to the observed correlation robsr_{obs} by:

robs=rXYRXXRYYr_{obs} = r_{XY} \cdot \sqrt{R_{XX} \cdot R_{YY}}

Where RXXR_{XX} and RYYR_{YY} are the reliabilities of the two measures. Low reliability reduces the observed correlation below the true value - called attenuation.

Precision of measurement: The Standard Error of Measurement (SEM) quantifies the uncertainty in an individual's score:

SEM=σX1RXX\text{SEM} = \sigma_X \sqrt{1 - R_{XX}}

Lower reliability → larger SEM → less precise measurement of individual scores.

Sample size requirements: Studies using unreliable measures require larger samples to achieve the same statistical power, because unreliability adds noise to all effect size estimates.

2.4 Real-World Applications


3. The Mathematics Behind Reliability

3.1 Classical Test Theory (CTT)

Classical Test Theory (CTT) is the dominant framework for reliability analysis in the social sciences. Its foundation is the true score model:

Xi=Ti+EiX_i = T_i + E_i

Where:

3.2 Assumptions of the True Score Model

The CTT model rests on four key mathematical assumptions:

  1. Linearity: Xi=Ti+EiX_i = T_i + E_i (observed = true + error, additively).
  2. Zero mean error: E(Ei)=0E(E_i) = 0 — measurement errors average to zero across many measurements.
  3. Uncorrelated error and true score: Cov(Ti,Ei)=0\text{Cov}(T_i, E_i) = 0 — error is unrelated to the person's true score.
  4. Uncorrelated errors across items: Cov(Ej,Ek)=0\text{Cov}(E_j, E_k) = 0 for jkj \neq k — errors on different items are independent.

3.3 Variance Decomposition

Under the CTT assumptions, the variance of the observed score decomposes into two parts:

σX2=σT2+σE2\sigma^2_X = \sigma^2_T + \sigma^2_E

Where:

3.4 The Reliability Coefficient

The reliability coefficient ρXX\rho_{XX} is defined as the ratio of true score variance to total observed score variance:

ρXX=σT2σX2=σT2σT2+σE2\rho_{XX} = \frac{\sigma^2_T}{\sigma^2_X} = \frac{\sigma^2_T}{\sigma^2_T + \sigma^2_E}

This is the population reliability — it ranges from 0 (completely unreliable) to 1 (perfectly reliable). In practice, reliability is estimated from sample data using methods such as Cronbach's alpha, McDonald's omega, or split-half reliability.

Key properties of ρXX\rho_{XX}:

3.5 Standard Error of Measurement

The Standard Error of Measurement (SEM) quantifies the average error in an individual's observed score:

SEM=σX1ρXX\text{SEM} = \sigma_X \sqrt{1 - \rho_{XX}}

The SEM defines a confidence interval around a person's observed score:

True Score CI95%:Xi±1.96SEM\text{True Score CI}_{95\%}: X_i \pm 1.96 \cdot \text{SEM}

For example, if σX=10\sigma_X = 10 and ρXX=0.84\rho_{XX} = 0.84:

SEM=1010.84=100.16=10×0.40=4.0\text{SEM} = 10\sqrt{1 - 0.84} = 10\sqrt{0.16} = 10 \times 0.40 = 4.0

A person scoring 65 has a 95% CI for their true score of 65±1.96(4)=[57.2,72.8]65 \pm 1.96(4) = [57.2, 72.8].

3.6 The Spearman-Brown Prophecy Formula

The Spearman-Brown Prophecy Formula predicts the reliability of a lengthened (or shortened) test. If the current test has reliability ρXX\rho_{XX} and is changed to nn times its current length:

ρXX(n)=nρXX1+(n1)ρXX\rho_{XX}^{(n)} = \frac{n \cdot \rho_{XX}}{1 + (n-1) \cdot \rho_{XX}}

Where nn is the multiplication factor (e.g., n=2n = 2 means doubling the number of items).

Inverse formula — how many times longer must the test be to reach a target reliability ρ\rho^*?

n=ρ(1ρXX)ρXX(1ρ)n = \frac{\rho^*(1 - \rho_{XX})}{\rho_{XX}(1 - \rho^*)}

Example: Current reliability = 0.60. How many times longer must the test be to reach 0.80?

n=0.80(10.60)0.60(10.80)=0.80×0.400.60×0.20=0.320.12=2.67n = \frac{0.80(1 - 0.60)}{0.60(1 - 0.80)} = \frac{0.80 \times 0.40}{0.60 \times 0.20} = \frac{0.32}{0.12} = 2.67

The test must be approximately 2.67 times longer (about 167% more items).

3.7 Attenuation and Correction for Attenuation

The correction for attenuation formula estimates the true (disattenuated) correlation between two constructs from their observed correlation, correcting for the attenuation caused by measurement unreliability:

rXY=rXYρXXρYYr^*_{XY} = \frac{r_{XY}}{\sqrt{\rho_{XX} \cdot \rho_{YY}}}

Where:

⚠️ The disattenuated correlation can exceed 1.0 if both reliabilities are low and the observed correlation is moderate or high. Values >1.0> 1.0 are inadmissible and indicate that the reliabilities or correlation are inaccurate. Treat disattenuated correlations with caution.

3.8 The Covariance Matrix of a Scale

For a pp-item scale, the covariance matrix Σ\boldsymbol{\Sigma} is p×pp \times p with:

The total scale variance is the sum of all elements:

σX2=j=1pσj2+2j<kσjk\sigma^2_X = \sum_{j=1}^{p}\sigma^2_j + 2\sum_{j < k}\sigma_{jk}

Or in matrix notation:

σX2=1TΣ1\sigma^2_X = \mathbf{1}^T \boldsymbol{\Sigma} \mathbf{1}

Where 1\mathbf{1} is a p×1p \times 1 vector of ones.

This covariance matrix is the fundamental input to Cronbach's alpha and McDonald's omega.


4. Assumptions of Reliability Analysis

4.1 Unidimensionality

The most critical assumption for most reliability coefficients (especially Cronbach's alpha) is that all items in the scale measure a single underlying construct (unidimensionality).

Why it matters: Cronbach's alpha is a measure of internal consistency, not unidimensionality. A scale with multiple distinct dimensions (e.g., a scale with "anxiety" and "depression" items combined) can still produce a high alpha, but the alpha in that case is misleading — it does not reflect a single coherent construct.

How to check:

4.2 Tau-Equivalence (for Cronbach's Alpha)

Cronbach's alpha is theoretically justified only when items are tau-equivalent — meaning all items have:

In practice, most scales are congeneric — items have different factor loadings (not tau- equivalent) — which means Cronbach's alpha underestimates the true reliability. McDonald's omega (Section 8) does not require tau-equivalence and is therefore more appropriate for congeneric scales.

4.3 Uncorrelated Errors

CTT assumes that measurement errors on different items are uncorrelated — the error on item jj is independent of the error on item kk. Correlated errors arise when:

Correlated errors inflate Cronbach's alpha and can produce reliability estimates that exceed the true reliability. When correlated errors are suspected, use a model-based approach (CFA) to explicitly model and account for them.

4.4 Continuous or Approximately Continuous Items

Standard reliability formulas assume that item scores are continuous (or at minimum, ordinal with many ordered categories treated as approximately continuous). For binary items (0/1 responses), use the Kuder-Richardson Formula 20 (KR-20), which is a special case of Cronbach's alpha. For ordinal items with 3–4 categories, use polychoric correlations as input (ordinal alpha or ordinal omega).

4.5 Adequate Sample Size

Reliability estimation requires a sufficient sample size for stable results:

Reliability StatisticMinimum nnRecommended nn
Cronbach's alpha50200\geq 200
McDonald's omega (CFA-based)100200\geq 200
ICC (inter-rater)30 subjects50\geq 50 subjects
Test-retest (Pearson/ICC)3050\geq 50
Split-half50100\geq 100

⚠️ With small samples, reliability estimates are highly unstable and have wide confidence intervals. Always report a confidence interval alongside the point estimate.

4.6 No Extreme Outliers

Outliers in item responses can substantially distort covariances and variances, leading to inaccurate reliability estimates. Screen data for:


5. Types of Reliability

5.1 Internal Consistency Reliability

Internal consistency measures whether the items in a scale consistently measure the same construct. It is assessed using a single administration of the scale.

MethodBasisWhen to Use
Cronbach's Alpha (α\alpha)Average inter-item covarianceDefault for multi-item scales; assumes tau-equivalence
McDonald's Omega (ω\omega)Factor model (CFA-based)Better for congeneric items; most recommended
Ordinal Alpha / OmegaPolychoric correlationsOrdinal items with 5\leq 5 categories
KR-20 / KR-21Binary itemsDichotomous response scales (correct/incorrect)
Split-Half (Spearman-Brown)Two halves of the scaleQuick estimate; now largely superseded
Greatest Lower Bound (GLB)Maximisation over all splittingsUpper bound for internal consistency

5.2 Test-Retest Reliability (Stability)

Test-retest reliability assesses the temporal stability of a measure — whether the same people get the same scores when measured at two different time points.

ρtest-retest=r(X1,X2)\rho_{\text{test-retest}} = r(X_1, X_2)

Where X1X_1 and X2X_2 are scores from the same measure at Time 1 and Time 2.

Key considerations:

5.3 Inter-Rater Reliability (Agreement)

Inter-rater reliability assesses the degree to which different raters (judges, coders, or observers) agree in their assessments of the same subjects.

MethodData TypeWhen to Use
Percent AgreementNominal / OrdinalSimple; ignores chance agreement
Cohen's Kappa (κ\kappa)NominalTwo raters; categorical ratings
Weighted Kappa (κw\kappa_w)OrdinalTwo raters; ordered categories
Fleiss' KappaNominalThree or more raters
Intraclass Correlation (ICC)Continuous / IntervalTwo or more raters; continuous ratings
Krippendorff's AlphaAny scaleMultiple raters; any data type

5.4 Parallel-Forms Reliability (Alternate Forms)

Parallel-forms reliability (also called alternate-form or equivalent-form reliability) assesses agreement between two different versions of the same test administered at the same time:

ρparallel=r(XForm A,XForm B)\rho_{\text{parallel}} = r(X_{\text{Form A}}, X_{\text{Form B}})

When used:

This type is less common in everyday practice because developing two truly parallel forms is resource-intensive.

5.5 Summary: Choosing the Right Reliability Type

Research ScenarioRecommended Type
Single-administration questionnaire (Likert items)Internal consistency (alpha/omega)
Binary scored test (correct/incorrect)KR-20
Longitudinal measurement (same scale, two time points)Test-retest ICC
Observational coding scheme (two coders)Cohen's Kappa or ICC
Observational coding scheme (three+ coders)Fleiss' Kappa or ICC
Continuous ratings by multiple ratersICC
Two test forms used interchangeablyParallel-forms rr
Ordinal items (5\leq 5 categories)Ordinal alpha or omega

6. Using the Reliability Analysis Component

The Reliability Analysis component in DataStatPro provides a complete workflow for evaluating the reliability of multi-item scales and rating systems.

Step-by-Step Guide

Step 1 — Select Dataset

Choose the dataset from the "Dataset" dropdown. Ensure:

💡 Tip: Use the DataStatPro data editor to reverse-code negatively worded items before running reliability analysis. For a 5-point scale, reverse coding transforms responses as: Xreversed=(max+min)Xoriginal=6XX_{\text{reversed}} = (\text{max} + \text{min}) - X_{\text{original}} = 6 - X.

Step 2 — Select Scale Items

Select all variables (columns) that constitute the scale from the "Scale Items" dropdown. You can select multiple items. All selected items should:

⚠️ Important: Do not mix items from different subscales in a single reliability analysis. Run separate analyses for each subscale.

Step 3 — Select Reliability Method

Choose from the "Method" dropdown:

Step 4 — Select Model (for ICC)

If using ICC for inter-rater reliability, select the appropriate ICC model:

Step 5 — Select ICC Type (for ICC)

Step 6 — Select Confidence Level

Choose the confidence level for confidence intervals (default: 95%).

Step 7 — Display Options

Select which outputs to display:

Step 8 — Run the Analysis

Click "Run Reliability Analysis". The application will:

  1. Compute item descriptive statistics.
  2. Calculate the inter-item correlation matrix.
  3. Estimate the chosen reliability coefficient(s) and confidence intervals.
  4. Produce item-total statistics (corrected correlations, alpha/omega-if-item-deleted).
  5. Display all selected outputs and visualisations.

7. Cronbach's Alpha

7.1 Definition and Formula

Cronbach's alpha (α\alpha) is the most widely used measure of internal consistency reliability. It estimates the proportion of total scale variance attributable to the common factor shared by all items.

For a scale with pp items and covariance matrix Σ\boldsymbol{\Sigma}:

α=pp1(1j=1pσj2σX2)\alpha = \frac{p}{p-1}\left(1 - \frac{\sum_{j=1}^{p}\sigma^2_j}{\sigma^2_X}\right)

Where:

Equivalent form using the average inter-item covariance cˉ\bar{c} and average item variance vˉ\bar{v}:

α=pcˉvˉ+(p1)cˉ\alpha = \frac{p \cdot \bar{c}}{\bar{v} + (p-1)\bar{c}}

This form clearly shows that alpha depends on two quantities:

  1. The average inter-item covariance cˉ\bar{c}: how strongly items co-vary.
  2. The number of items pp: more items → higher alpha (Spearman-Brown effect).

7.2 Standardised Alpha

When items are measured on different response scales (e.g., some items 1–5, others 1–7), it is appropriate to use the standardised alpha, which is based on the correlation matrix R\mathbf{R} rather than the covariance matrix:

αstd=prˉ1+(p1)rˉ\alpha_{\text{std}} = \frac{p \cdot \bar{r}}{1 + (p-1)\bar{r}}

Where rˉ\bar{r} is the average inter-item correlation.

This formula, known as the Spearman-Brown formula applied to the average inter-item correlation, shows that standardised alpha is solely determined by the number of items and the average correlation between them.

7.3 Cronbach's Alpha as a Lower Bound

Cronbach's alpha is a lower bound on the true reliability — it underestimates the true reliability under most conditions:

αρXX\alpha \leq \rho_{XX}

with equality holding only when items are tau-equivalent (all true score variances are equal, i.e., all factor loadings are equal). Because psychological items are almost never tau-equivalent in practice, alpha typically underestimates the true reliability.

This is why McDonald's omega (Section 8) is theoretically preferred — it is an exact estimate of reliability under the congeneric model.

7.4 The Confidence Interval for Alpha

The sampling distribution of Cronbach's alpha is complex. An approximate 95% confidence interval based on the Fisher transformation is:

CI95%(α)=[αL,αU]\text{CI}_{95\%}(\alpha) = \left[\alpha_L, \alpha_U\right]

The Feldt (1965) exact CI method transforms alpha through:

F=1α1αF = \frac{1 - \alpha}{1 - \alpha^*}

Where α\alpha^* is the population value, which follows an F-distribution with degrees of freedom df1=n1df_1 = n - 1 and df2=(n1)(p1)df_2 = (n-1)(p-1). The CI bounds are:

αL=1(1α)F1α/2,df1,df2\alpha_L = 1 - (1 - \alpha) \cdot F_{1-\alpha/2, df_1, df_2}

αU=1(1α)Fα/2,df1,df2\alpha_U = 1 - (1 - \alpha) \cdot F_{\alpha/2, df_1, df_2}

Where nn is the sample size and pp is the number of items.

💡 Always report the confidence interval alongside the point estimate of alpha. With small samples (n<100n < 100), the confidence interval can be very wide (e.g., α=0.80\alpha = 0.80, 95% CI: [0.68, 0.89]), signalling considerable uncertainty in the estimate.

7.5 Interpreting Cronbach's Alpha

α\alpha ValueInterpretationTypical Use
0.95\geq 0.95Excellent — but may indicate redundancyHigh-stakes clinical decisions
0.900.940.90 - 0.94ExcellentHigh-stakes clinical decisions
0.800.890.80 - 0.89GoodMost research applications
0.700.790.70 - 0.79AcceptableExploratory research
0.600.690.60 - 0.69Questionable — scale needs revisionPilot studies only
0.500.590.50 - 0.59Poor — major revision neededUnacceptable for most purposes
<0.50< 0.50UnacceptableShould not be used as a scale

⚠️ Very high alpha (>0.95> 0.95) is not always desirable. It can indicate item redundancy — that items are so similar in wording that they add no unique information. Aim for 0.800.900.80 - 0.90 for most psychological scales, with not too many highly redundant items.

7.6 Alpha for Binary Items: KR-20 and KR-21

When all items are dichotomous (scored 0 or 1, as in knowledge tests), Cronbach's alpha reduces to the Kuder-Richardson Formula 20 (KR-20):

KR-20=pp1(1j=1ppjqjσX2)\text{KR-20} = \frac{p}{p-1}\left(1 - \frac{\sum_{j=1}^{p} p_j q_j}{\sigma^2_X}\right)

Where pjp_j is the proportion of respondents answering item jj correctly, qj=1pjq_j = 1 - p_j, and σX2\sigma^2_X is the total test score variance.

When item difficulties are assumed equal (pj=pˉp_j = \bar{p} for all items), a simpler formula is the Kuder-Richardson Formula 21 (KR-21):

KR-21=pp1(1pˉ(1pˉ)pσX2)\text{KR-21} = \frac{p}{p-1}\left(1 - \frac{\bar{p}(1-\bar{p}) \cdot p}{\sigma^2_X}\right)

KR-21 is always \leq KR-20. Both formulas produce the same estimate as Cronbach's alpha applied to binary data.

7.7 Item-Total Statistics and Alpha-if-Item-Deleted

The item-total statistics table provides four critical pieces of information for each item:

StatisticDescriptionUse
Scale Mean if Item DeletedMean of total score if this item is removedIdentifies items that skew the scale
Scale Variance if Item DeletedVariance of total score if this item is removedIdentifies items that affect scale spread
Corrected Item-Total CorrelationCorrelation between item score and total minus that itemPrimary item quality indicator
Alpha if Item DeletedAlpha of the remaining items if this item is removedShows whether removing the item improves alpha

Corrected Item-Total Correlation (CITC):

rj(Xj)=Cov(Xj,XtotalXj)σjσXjr_{j(X-j)} = \frac{\text{Cov}(X_j, X_{\text{total}} - X_j)}{\sigma_j \cdot \sigma_{X-j}}

where the subscript (Xj)(X-j) means the total score with item jj removed. This correction prevents the artificial inflation that would result from correlating an item with a total that includes the item itself.

Interpretation of CITC:

CITCInterpretation
0.50\geq 0.50Excellent discriminator — strongly related to the construct
0.300.490.30 - 0.49Good discriminator — acceptable item
0.200.290.20 - 0.29Marginal — consider revision or removal
<0.20< 0.20Poor — item should be revised or removed
NegativeItem is negatively related to the scale — likely needs reverse coding or removal

7.8 Worked Manual Calculation of Cronbach's Alpha

Suppose we have a 3-item scale with the following covariance matrix from n=100n = 100 respondents:

Σ=(1.440.720.600.721.210.550.600.551.00)\boldsymbol{\Sigma} = \begin{pmatrix} 1.44 & 0.72 & 0.60 \\ 0.72 & 1.21 & 0.55 \\ 0.60 & 0.55 & 1.00 \end{pmatrix}

Step 1 — Sum of item variances:

j=13σj2=1.44+1.21+1.00=3.65\sum_{j=1}^{3}\sigma^2_j = 1.44 + 1.21 + 1.00 = 3.65

Step 2 — Total scale variance:

σX2=1.44+1.21+1.00+2(0.72+0.60+0.55)=3.65+2(1.87)=3.65+3.74=7.39\sigma^2_X = 1.44 + 1.21 + 1.00 + 2(0.72 + 0.60 + 0.55) = 3.65 + 2(1.87) = 3.65 + 3.74 = 7.39

Step 3 — Apply Cronbach's formula:

α=331(13.657.39)=32(10.494)=1.5×0.506=0.759\alpha = \frac{3}{3-1}\left(1 - \frac{3.65}{7.39}\right) = \frac{3}{2}\left(1 - 0.494\right) = 1.5 \times 0.506 = 0.759

Interpretation: α=0.759\alpha = 0.759 — acceptable internal consistency for a 3-item scale.


8. McDonald's Omega

8.1 Limitations of Cronbach's Alpha and the Case for Omega

While Cronbach's alpha is the most widely reported reliability index, it has several well-documented limitations:

  1. Assumes tau-equivalence — all items must have equal factor loadings. Violated in virtually all real scales.
  2. Underestimates true reliability when items are congeneric (unequal loadings).
  3. Can be artificially inflated by correlated errors or multidimensionality.
  4. Does not distinguish between general factor variance and group factor variance in multidimensional scales.

McDonald's omega (ω\omega) overcomes these limitations by explicitly modelling the factor structure of the scale. It is now recommended by major methodologists as the preferred reliability index over Cronbach's alpha.

8.2 The Congeneric Model

McDonald's omega is based on the congeneric measurement model — a single-factor CFA where items can have different factor loadings (unlike the tau-equivalent model assumed by alpha):

Xj=λjF+ϵj,j=1,2,,pX_j = \lambda_j F + \epsilon_j, \quad j = 1, 2, \dots, p

Where:

The model-implied covariance matrix is:

Σ=λλT+Θ\boldsymbol{\Sigma} = \boldsymbol{\lambda}\boldsymbol{\lambda}^T + \boldsymbol{\Theta}

Where λ=(λ1,λ2,,λp)T\boldsymbol{\lambda} = (\lambda_1, \lambda_2, \dots, \lambda_p)^T and Θ=diag(θ1,θ2,,θp)\boldsymbol{\Theta} = \text{diag}(\theta_1, \theta_2, \dots, \theta_p).

8.3 Omega Total (ωt\omega_t)

Omega total (ωt\omega_t) is the reliability of the total composite score from a congeneric single-factor model. It equals the squared correlation between the true score and the observed total score:

ωt=(j=1pλj)2(j=1pλj)2+j=1pθj\omega_t = \frac{(\sum_{j=1}^{p}\lambda_j)^2}{(\sum_{j=1}^{p}\lambda_j)^2 + \sum_{j=1}^{p}\theta_j}

Where λj\lambda_j are the standardised factor loadings and θj=1λj2\theta_j = 1 - \lambda_j^2 are the unique variances (uniquenesses).

This formula has a clear interpretation:

💡 Omega total is equivalent to Cronbach's alpha when items are tau-equivalent, and exceeds alpha when items are congeneric (unequal loadings). In practice, omega is usually somewhat higher than alpha.

8.4 Omega Hierarchical (ωh\omega_h)

For multidimensional scales with both a general factor and group-specific factors (bifactor structure), omega hierarchical (ωh\omega_h) quantifies the proportion of total score variance attributable to the general factor alone:

ωh=(j=1pλjg)2σX2\omega_h = \frac{(\sum_{j=1}^{p}\lambda_{jg})^2}{\sigma^2_X}

Where λjg\lambda_{jg} is the loading of item jj on the general factor gg (from a bifactor model), and σX2=1TΣ1\sigma^2_X = \mathbf{1}^T\boldsymbol{\Sigma}\mathbf{1} is the total scale variance.

Omega hierarchical subscale (ωhs\omega_{hs}) is the proportion of variance in a subscale's total score attributable to the group-specific factor:

ωhs=(jSλjs)2σXS2\omega_{hs} = \frac{(\sum_{j \in S}\lambda_{js})^2}{\sigma^2_{X_S}}

Where the sum is over items jj in subscale SS and λjs\lambda_{js} is the loading on the group-specific factor ss.

8.5 Comparison of Alpha and Omega

PropertyCronbach's AlphaMcDonald's Omega (Total)
Assumes tau-equivalenceYesNo
Appropriate for congeneric itemsNo (underestimates)Yes (correct estimate)
Requires factor analysisNoYes (single-factor CFA)
Sensitive to multidimensionalityYes (can inflate)Partially
Can separate general/group factorsNoYes (ωh\omega_h vs. ωt\omega_t)
Currently recommended by APAIncreasinglyIncreasingly
Sensitivity to correlated errorsInflatedCan model explicitly

General rule: When items are tau-equivalent → alpha ≈ omega. When items are congeneric (different loadings) → omega > alpha. The difference between omega and alpha is larger when loadings vary more across items.

8.6 Interpreting Omega Values

The same benchmarks as Cronbach's alpha apply to omega:

ωt\omega_tInterpretation
0.90\geq 0.90Excellent reliability
0.800.890.80 - 0.89Good reliability
0.700.790.70 - 0.79Acceptable reliability
0.600.690.60 - 0.69Questionable — revision needed
<0.60< 0.60Poor — major revision required

For omega hierarchical (ωh\omega_h), which represents the reliability attributable only to the general factor:

ωh\omega_hInterpretation
0.80\geq 0.80Strong general factor; composite score is justified
0.650.790.65 - 0.79Moderate general factor; composite score is defensible
0.500.640.50 - 0.64Weak general factor; subscale scores may be preferable
<0.50< 0.50Very weak general factor; total score is not recommended

8.7 The Ratio ωh/ωt\omega_h / \omega_t (Explained Common Variance)

The ratio of omega hierarchical to omega total is sometimes called the ECV (Explained Common Variance) and quantifies how much of the reliable variance is attributable to the general factor vs. group factors:

ECV=ωhωt\text{ECV} = \frac{\omega_h}{\omega_t}


9. Split-Half Reliability

9.1 The Split-Half Method

Split-half reliability estimates reliability by dividing the scale into two halves, computing the total score for each half, and correlating the two half-scores. This provides an estimate based on a single test administration (unlike test-retest reliability).

The Pearson correlation between the two half-scores (XAX_A and XBX_B) is:

rAB=r(XA,XB)r_{AB} = r(X_A, X_B)

However, this correlation estimates the reliability of a half-length test, not the full test. The Spearman-Brown correction is applied to estimate the reliability of the full test:

ρXXSB=2rAB1+rAB\rho_{XX}^{SB} = \frac{2r_{AB}}{1 + r_{AB}}

9.2 Methods for Splitting the Scale

MethodHowIssue
Odd-Even SplitOdd-numbered items → Half A; even-numbered → Half BAssumes order of items does not matter
First-Last SplitFirst p/2p/2 items → Half A; Last p/2p/2 → Half BFavours scales where item order is random
Random SplitItems randomly assigned to halvesMore reproducible with many iterations
Matched-Random SplitItems matched on difficulty/content then splitBest for heterogeneous item sets

⚠️ The split-half method gives different results depending on how the scale is split. This is a major weakness. Cronbach's alpha can be interpreted as the average of all possible split-half reliabilities — making it a more stable and preferred estimate. Split-half is primarily of historical interest today.

9.3 The Guttman Lambda Coefficients

The Guttman (1945) lambda coefficients are a family of reliability lower bounds. The most useful are:

Lambda 2 (λ2\lambda_2): The tightest lower bound computable without factor analysis:

λ2=λ1+pp1(jkσjk2)\lambda_2 = \lambda_1 + \sqrt{\frac{p}{p-1}\left(\sum_{j \neq k}\sigma^2_{jk}\right)}

Lambda 4 (λ4\lambda_4): The maximum split-half reliability over all possible splits (the greatest split-half). It equals Cronbach's alpha when items are tau-equivalent, and typically exceeds alpha for congeneric items.

Lambda 6 (λ6\lambda_6): Based on the squared multiple correlations of each item with all others:

λ6=1j=1p(1Rj2)σX2\lambda_6 = 1 - \frac{\sum_{j=1}^{p}(1 - R^2_j)}{\sigma^2_X}

Where Rj2R^2_j is the R2R^2 from regressing item jj on all other items.

9.4 The Greatest Lower Bound (GLB)

The Greatest Lower Bound (GLB) is the maximum possible reliability lower bound, computed by maximising over all possible splits and decompositions of the covariance matrix:

GLB=1min(jψj)σX2\text{GLB} = 1 - \frac{\min\left(\sum_{j}\psi_j\right)}{\sigma^2_X}

Subject to the constraint that ΣΨ\boldsymbol{\Sigma} - \boldsymbol{\Psi} is positive semidefinite, where Ψ=diag(ψ1,,ψp)\boldsymbol{\Psi} = \text{diag}(\psi_1, \dots, \psi_p).

GLB λ4λ2α\geq \lambda_4 \geq \lambda_2 \geq \alpha — the GLB is never less than alpha or any other lower bound. However, the GLB can be severely positively biased in small samples and may overestimate reliability more than omega. Use with caution when n<500n < 500.


10. Inter-Rater Reliability

10.1 Why Inter-Rater Reliability Matters

When data collection relies on human judgment — observations, coding of qualitative data, clinical assessments, interview ratings — different raters may disagree. Inter-rater reliability (IRR) quantifies the degree of agreement between raters and determines whether ratings can be trusted as objective.

Low IRR suggests:

10.2 Percent Agreement

The simplest IRR measure is percent agreement — the proportion of ratings on which all raters agree:

PA=Number of agreementsN×100%PA = \frac{\text{Number of agreements}}{N} \times 100\%

Critical limitation: Percent agreement does not correct for the level of agreement expected purely by chance. Two raters randomly assigning ratings to binary categories (50/50 split) would agree about 50% of the time by chance alone.

10.3 Cohen's Kappa (κ\kappa)

Cohen's Kappa corrects for chance agreement. For two raters assigning subjects to kk categories:

κ=PoPe1Pe\kappa = \frac{P_o - P_e}{1 - P_e}

Where:

Computing PeP_e: For a k×kk \times k contingency table with row proportions pi+p_{i+} and column proportions p+jp_{+j}:

Pe=i=1kpi+p+iP_e = \sum_{i=1}^{k} p_{i+} \cdot p_{+i}

Example for a 2-category rating (agreement / disagreement):

Suppose two raters classify 100 subjects as "Case" or "Non-Case":

Rater B: CaseRater B: Non-CaseRow Total
Rater A: Case451055
Rater A: Non-Case54045
Column Total5050100

Po=45+40100=0.85P_o = \frac{45 + 40}{100} = 0.85

Pe=55100×50100+45100×50100=0.275+0.225=0.50P_e = \frac{55}{100} \times \frac{50}{100} + \frac{45}{100} \times \frac{50}{100} = 0.275 + 0.225 = 0.50

κ=0.850.5010.50=0.350.50=0.70\kappa = \frac{0.85 - 0.50}{1 - 0.50} = \frac{0.35}{0.50} = 0.70

Standard Error of Kappa:

SE(κ)Po(1Po)N(1Pe)2SE(\kappa) \approx \sqrt{\frac{P_o(1 - P_o)}{N(1 - P_e)^2}}

95% Confidence Interval:

κ±1.96SE(κ)\kappa \pm 1.96 \cdot SE(\kappa)

10.4 Interpreting Cohen's Kappa

κ\kappaStrength of Agreement
<0< 0Less than chance agreement (worse than random)
0.000.200.00 - 0.20Slight
0.210.400.21 - 0.40Fair
0.410.600.41 - 0.60Moderate
0.610.800.61 - 0.80Substantial
0.811.000.81 - 1.00Almost Perfect

(Landis & Koch, 1977 benchmarks — widely used but not universally accepted)

⚠️ Kappa is sensitive to the prevalence (base rate) of each category. When one category is very rare, even high percent agreement can yield a very low kappa. Always report percent agreement alongside kappa.

10.5 Weighted Kappa (κw\kappa_w)

For ordinal rating scales (where disagreements of different magnitudes are not equally serious), weighted kappa assigns weights based on the severity of disagreement:

κw=ijwij(pijpije)ijwij(pijmpije)\kappa_w = \frac{\sum_{i}\sum_{j} w_{ij}(p_{ij} - p_{ij}^e)}{\sum_{i}\sum_{j} w_{ij}(p_{ij}^m - p_{ij}^e)}

Where wijw_{ij} are the weights, pijp_{ij} are the observed proportions, and pijep_{ij}^e are the expected proportions under independence.

Common weighting schemes:

Weight Typewijw_{ij} FormulaSuitable For
Linear weights$1 - \frac{i-j
Quadratic weights$1 - \left(\frac{i-j

Note: Weighted kappa with quadratic weights is mathematically equivalent to the ICC(2,1) model (see Section 10.6).

10.6 Intraclass Correlation Coefficient (ICC)

The Intraclass Correlation Coefficient (ICC) is the most versatile measure of inter-rater reliability for continuous or interval-scale ratings. Unlike Cohen's Kappa, the ICC can handle:

The ICC is based on a one-way or two-way ANOVA decomposition of the total variance in ratings into:

Six standard ICC models (Shrout & Fleiss, 1979; McGraw & Wong, 1996):

ICC ModelNotationRater DesignMeasures
One-Way Random, SingleICC(1,1)Each subject rated by a different random raterConsistency
One-Way Random, Mean of kkICC(1,kk)Each subject rated by different raters; average usedConsistency
Two-Way Random, SingleICC(2,1)Same raters rate all subjects; raters randomAbsolute agreement
Two-Way Random, Mean of kkICC(2,kk)Same raters rate all; raters random; average usedAbsolute agreement
Two-Way Mixed, SingleICC(3,1)Same fixed raters; single rating usedConsistency
Two-Way Mixed, Mean of kkICC(3,kk)Same fixed raters; average of kk ratings usedConsistency

ICC Formulas (Two-Way Mixed Model):

For nn subjects, kk raters, with mean squares from a two-way ANOVA:

MSbetween=SSbetweenn1\text{MS}_{\text{between}} = \frac{\text{SS}_{\text{between}}}{n-1}

MSwithin=SSwithinn(k1)\text{MS}_{\text{within}} = \frac{\text{SS}_{\text{within}}}{n(k-1)}

MSrater=SSraterk1\text{MS}_{\text{rater}} = \frac{\text{SS}_{\text{rater}}}{k-1}

MSerror=SSerror(n1)(k1)\text{MS}_{\text{error}} = \frac{\text{SS}_{\text{error}}}{(n-1)(k-1)}

ICC(3,1) — Consistency:

ICC(3,1)=MSbetweenMSerrorMSbetween+(k1)MSerror\text{ICC}(3,1) = \frac{\text{MS}_{\text{between}} - \text{MS}_{\text{error}}}{\text{MS}_{\text{between}} + (k-1)\text{MS}_{\text{error}}}

ICC(2,1) — Absolute Agreement:

ICC(2,1)=MSbetweenMSerrorMSbetween+(k1)MSerror+kn(MSraterMSerror)\text{ICC}(2,1) = \frac{\text{MS}_{\text{between}} - \text{MS}_{\text{error}}}{\text{MS}_{\text{between}} + (k-1)\text{MS}_{\text{error}} + \frac{k}{n}(\text{MS}_{\text{rater}} - \text{MS}_{\text{error}})}

For kk averaged ratings — ICC(3,kk) — Consistency:

ICC(3,k)=MSbetweenMSerrorMSbetween\text{ICC}(3,k) = \frac{\text{MS}_{\text{between}} - \text{MS}_{\text{error}}}{\text{MS}_{\text{between}}}

10.7 Confidence Intervals for ICC

The 95% CI for ICC is computed using the F-distribution:

FL=Fα/2,df1,df2,FU=F1α/2,df1,df2F_L = F_{\alpha/2, df_1, df_2}, \quad F_U = F_{1-\alpha/2, df_1, df_2}

With df1=n1df_1 = n - 1 and df2=(n1)(k1)df_2 = (n-1)(k-1).

ICCL=Fobs/FU1Fobs/FU+k1\text{ICC}_L = \frac{F_{\text{obs}}/F_U - 1}{F_{\text{obs}}/F_U + k - 1}

ICCU=Fobs/FL1Fobs/FL+k1\text{ICC}_U = \frac{F_{\text{obs}}/F_L - 1}{F_{\text{obs}}/F_L + k - 1}

Where Fobs=MSbetween/MSerrorF_{\text{obs}} = \text{MS}_{\text{between}} / \text{MS}_{\text{error}}.

10.8 Interpreting ICC Values

ICCReliability Quality
<0.50< 0.50Poor
0.500.740.50 - 0.74Moderate
0.750.900.75 - 0.90Good
>0.90> 0.90Excellent

(Koo & Mae, 2016 benchmarks — widely used in clinical research)

10.9 Fleiss' Kappa (Multiple Raters, Nominal Scale)

When more than two raters independently classify subjects into kk categories, Fleiss' Kappa generalises Cohen's Kappa:

κF=PˉPˉe1Pˉe\kappa_F = \frac{\bar{P} - \bar{P}_e}{1 - \bar{P}_e}

Where:

Pˉ=1nr(r1)i=1nj=1knij(nij1)\bar{P} = \frac{1}{n \cdot r(r-1)} \sum_{i=1}^{n} \sum_{j=1}^{k} n_{ij}(n_{ij} - 1)

Pˉe=j=1k(i=1nnijnr)2\bar{P}_e = \sum_{j=1}^{k} \left(\frac{\sum_{i=1}^{n} n_{ij}}{n \cdot r}\right)^2

With nn = number of subjects, rr = number of raters, nijn_{ij} = number of raters assigning subject ii to category jj.

10.10 Krippendorff's Alpha

Krippendorff's alpha (αK\alpha_K) is a versatile agreement measure that:

αK=1DoDe\alpha_K = 1 - \frac{D_o}{D_e}

Where:

For ordinal data, the metric function dck2d^2_{ck} is the squared difference between category ranks. For interval data, dck2=(ck)2d^2_{ck} = (c - k)^2. For nominal data, dck2=1[ck]d^2_{ck} = \mathbf{1}[c \neq k] (0 if equal, 1 if different).

Krippendorff recommends αK0.80\alpha_K \geq 0.80 for reliable conclusions, with 0.67αK<0.800.67 \leq \alpha_K < 0.80 allowing only tentative conclusions.


11. Item Analysis

11.1 What is Item Analysis?

Item analysis is the process of evaluating the statistical properties of individual items to determine which items contribute positively to the scale's reliability and validity, and which should be revised or removed.

Item analysis is typically performed as part of reliability analysis and is crucial during:

11.2 Item Difficulty (for Knowledge Tests)

For knowledge tests with correct/incorrect scoring, item difficulty (pp-value) is the proportion of respondents who answer the item correctly:

pj=Number correct on item jNp_j = \frac{\text{Number correct on item } j}{N}

Difficulty (pjp_j)Interpretation
<0.20< 0.20Very difficult — too hard for most
0.200.390.20 - 0.39Difficult
0.400.600.40 - 0.60Moderate — optimal for discrimination
0.610.800.61 - 0.80Easy
>0.80> 0.80Very easy — too easy for most

Items with pj0.50p_j \approx 0.50 provide the most information about differences between individuals (maximum variance). However, items at extremes (pj<0.20p_j < 0.20 or pj>0.80p_j > 0.80) have low variance and contribute little to reliability.

11.3 Item Discrimination Index

The item discrimination index (DD) measures how well an item differentiates between high-scoring and low-scoring respondents. It is computed using the extreme groups method:

  1. Divide respondents into the top 27% (High group, HH) and bottom 27% (Low group, LL) based on total score.
  2. Compute the proportion correct in each group: pHp_H and pLp_L.
  3. Compute: D=pHpLD = p_H - p_L
DDInterpretation
0.40\geq 0.40Excellent discriminator
0.300.390.30 - 0.39Good discriminator
0.200.290.20 - 0.29Marginal — consider revision
<0.20< 0.20Poor — revise or remove
NegativePerverse — high scorers do worse (review carefully)

11.4 Item-Rest Correlation (Corrected Item-Total Correlation)

As introduced in Section 7.7, the corrected item-total correlation (CITC) is the primary item quality indicator for Likert-type scales. It is equivalent to the item discrimination index for continuous scales and should be:

11.5 Inter-Item Correlation Analysis

Beyond item-total correlations, examining the inter-item correlation matrix reveals:

Items that are too highly correlated (r>0.80r > 0.80): May be redundant — they are essentially asking the same question twice and add little unique information. One should be removed or both should be revised to be more distinct.

Items that are too weakly correlated (r<0.10r < 0.10 with most other items): Likely measuring a different construct. These items should be examined theoretically and may need to be placed in a different subscale or removed.

Average inter-item correlation: Values of rˉ=0.20\bar{r} = 0.20 to 0.400.40 are typically considered optimal. Very high average correlations (rˉ>0.60\bar{r} > 0.60) with many items indicate excessive redundancy.

11.6 Floor and Ceiling Effects

Floor effects occur when most respondents score near the minimum possible score. Ceiling effects occur when most respondents score near the maximum possible score.

Both effects:

Check for floor/ceiling effects by inspecting:

11.7 Item Response Curves

For knowledge tests (binary items), the item response curve (IRC) or item characteristic curve (ICC) plots the probability of a correct response as a function of total test score.

A well-functioning item should show a monotonically increasing S-shaped curve — the probability of a correct answer should consistently increase with the total score. Items that show a non-monotonic curve (e.g., high scorers are less likely to answer correctly than medium scorers) are flagged as problematic discriminators.

11.8 The Item Analysis Decision Framework

For each item: | v Is CITC < 0.20? Yes → Flag for removal or revision No → Continue | v Is item-item correlation > 0.80 with any other item? Yes → Flag for redundancy; remove one of the pair No → Continue | v Does alpha-if-deleted substantially exceed current alpha (by > 0.05)? Yes → Strong candidate for removal No → Retain item | v Is skewness > |2| or kurtosis > |7|? (floor/ceiling effects) Yes → Consider item revision or transformation No → Retain item with confidence


12. Model Fit and Evaluation

12.1 Reporting Reliability: Minimum Requirements

At minimum, a reliability report should include:

  1. The reliability coefficient (alpha, omega, ICC, kappa, etc.).
  2. The 95% confidence interval around the coefficient.
  3. The number of items included in the analysis.
  4. The sample size (nn).
  5. The method used (Cronbach's alpha, McDonald's omega, ICC model, etc.).
  6. Item-level statistics (means, SDs, corrected item-total correlations).

Example APA-style reporting:

"Internal consistency of the 10-item Emotional Regulation Scale was evaluated using McDonald's omega (ω\omega), as items were expected to have unequal factor loadings (congeneric model). Omega total was ωt=0.87\omega_t = 0.87 (95% CI [0.84, 0.90]), indicating good internal consistency. Omega hierarchical was ωh=0.74\omega_h = 0.74, suggesting that the majority of reliable variance was attributable to the general factor. Corrected item-total correlations ranged from 0.41 to 0.68, with all items exceeding the acceptable threshold of 0.30."

12.2 Scale-Level Statistics

Beyond the reliability coefficient, the following scale-level statistics should be computed and reported:

StatisticFormulaInterpretation
Scale MeanXˉ=1niXitotal\bar{X} = \frac{1}{n}\sum_i X_i^{\text{total}}Average composite score
Scale VarianceσX2\sigma^2_XSpread of composite scores
Scale SDσX\sigma_XSD of composite scores
SEMσX1ρXX\sigma_X\sqrt{1-\rho_{XX}}Average error in individual scores
RangeMax − MinSpread of composite scores observed
Skewness & KurtosisStandard formulasCheck normality of composite

12.3 Assessing the Factor Structure Before Reliability Analysis

Before running reliability analysis, it is best practice to verify the factor structure:

Step 1 — Exploratory Factor Analysis (EFA):

Step 2 — Confirmatory Factor Analysis (CFA):

12.4 Evaluating Convergent and Discriminant Validity

Convergent validity: The scale should correlate strongly with other measures of the same or similar constructs (theoretically related measures). Typically evaluated using Pearson or Spearman correlations.

Discriminant validity: The scale should correlate weakly with measures of theoretically unrelated constructs.

Using reliability information, the disattenuated correlation (Section 3.7) provides the best estimate of the true relationship between constructs, corrected for measurement error.

12.5 Minimum Acceptable Reliability by Context

The required level of reliability depends on the stakes and purpose of measurement:

ContextMinimum Acceptable ρXX\rho_{XX}Preferred
Group-level research (comparing means)0.700.80\geq 0.80
Individual-level decisions (clinical)0.900.95\geq 0.95
High-stakes testing (licensure)0.900.95\geq 0.95
Pilot / exploratory research0.600.70\geq 0.70
Inter-rater agreement (research)0.70 ICC0.80\geq 0.80 ICC
Inter-rater agreement (clinical)0.90 ICC0.95\geq 0.95 ICC

13. Advanced Topics

13.1 Ordinal Reliability: Polychoric Correlations

When scale items use fewer than 5 ordinal categories (e.g., a 3-point or 4-point Likert scale), treating Likert responses as continuous can distort covariances and underestimate reliability. A more appropriate approach uses polychoric correlations as the input matrix.

The polychoric correlation between two ordinal items jj and kk estimates the correlation between the underlying continuous latent variables that generate the observed ordinal responses. It is estimated by maximum likelihood, assuming bivariate normality of the latent variables.

Ordinal alpha is Cronbach's alpha computed on the polychoric correlation matrix:

αordinal=prˉpoly1+(p1)rˉpoly\alpha_{\text{ordinal}} = \frac{p \cdot \bar{r}_{\text{poly}}}{1 + (p-1)\bar{r}_{\text{poly}}}

Ordinal omega is McDonald's omega estimated from a factor model fit to the polychoric correlation matrix (using WLSMV or similar ordinal estimator in CFA).

Ordinal alpha and omega are typically higher than their Pearson-based counterparts for coarsely-rated Likert items, because polychoric correlations are less attenuated by the coarse ordinal scaling.

13.2 Reliability in Generalisability Theory (G-Theory)

Generalisability Theory (G-Theory) extends CTT by recognising that measurement error can have multiple sources (facets). In a rating study, error might come from:

A G-study uses a fully crossed (or nested) ANOVA to partition the total variance into components corresponding to each facet and their interactions:

σXijk2=σp2+σi2+σr2+σpi2+σpr2+σir2+σpir2\sigma^2_{X_{ijk}} = \sigma^2_p + \sigma^2_i + \sigma^2_r + \sigma^2_{pi} + \sigma^2_{pr} + \sigma^2_{ir} + \sigma^2_{pir}

Where pp = persons, ii = items, rr = raters.

The Generalisation Coefficient (G-coefficient) is analogous to reliability:

G=σp2σp2+σΔ2G = \frac{\sigma^2_p}{\sigma^2_p + \sigma^2_{\Delta}}

Where σΔ2\sigma^2_{\Delta} is the error variance appropriate to the measurement design.

A D-study uses the G-study variance components to predict how reliability would change if the number of items, raters, or occasions were varied — similar to the Spearman-Brown formula but for multiple facets simultaneously.

13.3 Reliability of Difference Scores

When researchers compute difference scores (e.g., post-treatment score minus pre-treatment score, or the difference between two subscales), the reliability of the difference is typically lower than the reliability of either component:

ρD=ρXXσX2+ρYYσY22rXYσXσYσX2+σY22rXYσXσY\rho_{D} = \frac{\rho_{XX}\sigma^2_X + \rho_{YY}\sigma^2_Y - 2r_{XY}\sigma_X\sigma_Y}{\sigma^2_X + \sigma^2_Y - 2r_{XY}\sigma_X\sigma_Y}

Where:

For parallel measures (ρXX=ρYY=ρ\rho_{XX} = \rho_{YY} = \rho and σX=σY=σ\sigma_X = \sigma_Y = \sigma):

ρD=ρrXY1rXY\rho_D = \frac{\rho - r_{XY}}{1 - r_{XY}}

This shows that when XX and YY are highly correlated (as expected when both are pre/post measures of the same construct), the reliability of the difference score can be very low.

Example: ρ=0.80\rho = 0.80, rXY=0.70r_{XY} = 0.70:

ρD=0.800.7010.70=0.100.30=0.33\rho_D = \frac{0.80 - 0.70}{1 - 0.70} = \frac{0.10}{0.30} = 0.33

Even though each measure has reliability 0.80, their difference has reliability of only 0.33! This is why difference scores are generally discouraged and residualised change scores or ANCOVA are preferred for measuring change.

13.4 Reliability and the Attenuation-Correction Decision

When planning a study, the researcher must decide whether to:

  1. Accept observed correlations (with attenuation from unreliability), or
  2. Correct for attenuation to estimate the true relationship.

Arguments for correcting:

Arguments against correcting:

Best practice: Report both the observed and disattenuated correlations, and always report the reliability estimates used for correction.

13.5 Reliability of Composite Scores from Multiple Subscales

When a total score is formed by combining items from multiple subscales, reliability cannot be computed by treating all items as a single scale (which would violate the unidimensionality assumption). Instead, use Mosier's formula for the reliability of a composite:

ρXXcomposite=σcomposite2k=1Kwk2σXk2(1ρkk)σcomposite2\rho_{XX}^{\text{composite}} = \frac{\sigma^2_{\text{composite}} - \sum_{k=1}^{K} w_k^2 \sigma^2_{X_k}(1 - \rho_{kk})}{\sigma^2_{\text{composite}}}

Where:

This formula partitions total composite variance into reliable variance (from true scores) and error variance (from subscale measurement errors), providing an accurate estimate of the composite's reliability.

13.6 Item Response Theory (IRT) and Marginal Reliability

Item Response Theory (IRT) provides a framework for reliability that is more flexible than CTT. In IRT, the precision of measurement is not constant across the score range — it is highest where the test has the most information.

The Test Information Function I(θ)I(\theta) quantifies how much information the test provides at each level θ\theta of the latent trait:

I(θ)=j=1pIj(θ)I(\theta) = \sum_{j=1}^{p} I_j(\theta)

Where Ij(θ)I_j(\theta) is the item information function for item jj.

The conditional standard error of measurement at trait level θ\theta is:

SE(θ)=1I(θ)\text{SE}(\theta) = \frac{1}{\sqrt{I(\theta)}}

The marginal reliability of the test (averaging over the population distribution of θ\theta):

ρmarginal=1E[1/I(θ)]Var(θ)+E[1/I(θ)]\rho_{\text{marginal}} = 1 - \frac{E[1/I(\theta)]}{Var(\theta) + E[1/I(\theta)]}

IRT marginal reliability is a more informative reliability measure than Cronbach's alpha because it shows that reliability can be high for some test-takers and low for others — traditional reliability statistics only provide an average.


14. Worked Examples

Example 1: Cronbach's Alpha — 6-Item Burnout Scale

A researcher develops a 6-item work burnout scale with items rated 1 (Never) to 5 (Always). Data are collected from n=250n = 250 employees.

Items:

Item Statistics:

ItemMeanSDSkewnessCITCα\alpha if Deleted
B13.211.08-0.220.720.86
B23.081.12-0.150.690.87
B32.951.150.100.610.88
B42.781.200.180.550.89
B53.311.05-0.300.750.86
B63.051.18-0.080.680.87

Inter-Item Correlation Matrix:

B1B2B3B4B5B6
B11.000.680.550.440.740.61
B21.000.600.420.690.58
B31.000.480.580.52
B41.000.490.54
B51.000.66
B61.00

Average inter-item correlation: rˉ=0.573\bar{r} = 0.573

Cronbach's Alpha Computation:

αstd=6×0.5731+(61)×0.573=3.4381+2.865=3.4383.865=0.890\alpha_{\text{std}} = \frac{6 \times 0.573}{1 + (6-1) \times 0.573} = \frac{3.438}{1 + 2.865} = \frac{3.438}{3.865} = 0.890

95% Confidence Interval (Feldt method): α=0.890\alpha = 0.890 [0.870, 0.907]

Scale Statistics:

Item Analysis Decision:

ItemCITCα\alpha-if-DeletedAction
B10.720.86✅ Retain — strong indicator
B20.690.87✅ Retain — good indicator
B30.610.88✅ Retain — acceptable
B40.550.89✅ Retain — but weakest item
B50.750.86✅ Retain — strongest indicator
B60.680.87✅ Retain — good indicator

Conclusion: All six items are retained. Cronbach's alpha of 0.8900.890 (95% CI: 0.870, 0.907) indicates good internal consistency. All corrected item-total correlations exceed 0.50, and no single item appreciably improves alpha when deleted. The scale is internally consistent and all items contribute positively to the burnout construct.


Example 2: McDonald's Omega — 8-Item Anxiety Scale

A researcher administers an 8-item anxiety scale to n=320n = 320 participants and runs a CFA-based reliability analysis using McDonald's omega because item loadings are expected to differ.

Single-Factor CFA Results:

ItemStandardised Loading (λj\lambda_j)Uniqueness (θj=1λj2\theta_j = 1 - \lambda_j^2)R2R^2
A10.820.330.67
A20.790.380.62
A30.710.500.50
A40.680.540.46
A50.850.280.72
A60.740.450.55
A70.630.600.40
A80.770.410.59

CFA Fit: CFI = 0.976, TLI = 0.968, RMSEA = 0.047, SRMR = 0.041 → Good fit

Omega Total Computation:

j=18λj=0.82+0.79+0.71+0.68+0.85+0.74+0.63+0.77=5.99\sum_{j=1}^{8}\lambda_j = 0.82 + 0.79 + 0.71 + 0.68 + 0.85 + 0.74 + 0.63 + 0.77 = 5.99

(j=18λj)2=(5.99)2=35.88\left(\sum_{j=1}^{8}\lambda_j\right)^2 = (5.99)^2 = 35.88

j=18θj=0.33+0.38+0.50+0.54+0.28+0.45+0.60+0.41=3.49\sum_{j=1}^{8}\theta_j = 0.33 + 0.38 + 0.50 + 0.54 + 0.28 + 0.45 + 0.60 + 0.41 = 3.49

ωt=35.8835.88+3.49=35.8839.37=0.911\omega_t = \frac{35.88}{35.88 + 3.49} = \frac{35.88}{39.37} = 0.911

Cronbach's Alpha (for comparison):

α=0.892\alpha = 0.892

Comparison:

StatisticValueInterpretation
Cronbach's α\alpha0.892Good — but underestimates true reliability
McDonald's ωt\omega_t0.911Excellent — accurate estimate for congeneric items
Difference (ωtα\omega_t - \alpha)0.019Alpha underestimates by 1.9 percentage points

Conclusion: The 8-item anxiety scale demonstrates excellent internal consistency. Omega total (ωt=0.911\omega_t = 0.911) is the preferred and more accurate estimate because the items have unequal factor loadings (ranging from 0.63 to 0.85), confirming the congeneric model. Cronbach's alpha (0.8920.892) slightly underestimates the true reliability, as expected for a congeneric scale.


Example 3: ICC — Two Clinical Raters Assessing Pain Intensity

Two physiotherapists independently rate pain intensity on a 0–10 numeric scale for n=40n = 40 patients. The researcher wants to assess whether the two raters can be used interchangeably (absolute agreement ICC).

ANOVA Table:

SourceSSdfMS
Between Patients412.83910.585
Between Raters8.118.100
Residual (Error)58.4391.497
Total479.379

ICC(2,1) — Two-Way Random, Absolute Agreement:

ICC(2,1)=10.5851.49710.585+(21)(1.497)+240(8.1001.497)\text{ICC}(2,1) = \frac{10.585 - 1.497}{10.585 + (2-1)(1.497) + \frac{2}{40}(8.100 - 1.497)}

=9.08810.585+1.497+0.330=9.08812.412=0.732= \frac{9.088}{10.585 + 1.497 + 0.330} = \frac{9.088}{12.412} = 0.732

ICC(3,1) — Two-Way Mixed, Consistency:

ICC(3,1)=10.5851.49710.585+(21)(1.497)=9.08812.082=0.752\text{ICC}(3,1) = \frac{10.585 - 1.497}{10.585 + (2-1)(1.497)} = \frac{9.088}{12.082} = 0.752

F-test for significance:

F=MSbetweenMSerror=10.5851.497=7.07,p<0.001F = \frac{\text{MS}_{\text{between}}}{\text{MS}_{\text{error}}} = \frac{10.585}{1.497} = 7.07, \quad p < 0.001

95% Confidence Interval for ICC(2,1):

FL=F0.025,39,39=0.529,FU=F0.975,39,39=1.895F_L = F_{0.025, 39, 39} = 0.529, \quad F_U = F_{0.975, 39, 39} = 1.895

Using the Shrout-Fleiss CI formula: ICC95%CI=[0.562,0.849]\text{ICC}_{95\%\text{CI}} = [0.562, 0.849]

Interpretation:

StatisticValueInterpretation
ICC(2,1) — Absolute Agreement0.732 [0.562, 0.849]Moderate-Good agreement
ICC(3,1) — Consistency0.752Good consistency
Difference (abs. vs. consistency)0.020Small systematic rater mean difference

Interpretation: The ICC for absolute agreement is 0.732 (95% CI: 0.562, 0.849), indicating moderate-to-good inter-rater reliability. The slightly higher consistency ICC (0.752) suggests a small systematic difference in how the two physiotherapists use the rating scale (Rater A rates slightly higher/lower than Rater B on average). For clinical interchangeability of the two raters, the absolute agreement ICC of 0.732 is adequate for research purposes but falls short of the 0.90 threshold recommended for high-stakes clinical decision-making. Additional rater training is recommended to improve agreement.


Example 4: Spearman-Brown Prophecy — Lengthening a Short Scale

A researcher has a 5-item resilience scale with α=0.68\alpha = 0.68 and wants to improve reliability to at least α=0.80\alpha = 0.80 by adding parallel items. How many more items are needed?

Step 1 — Compute nn (multiplication factor):

n=ρ(1ρXX)ρXX(1ρ)=0.80(10.68)0.68(10.80)=0.80×0.320.68×0.20=0.2560.136=1.882n = \frac{\rho^*(1 - \rho_{XX})}{\rho_{XX}(1 - \rho^*)} = \frac{0.80(1 - 0.68)}{0.68(1 - 0.80)} = \frac{0.80 \times 0.32}{0.68 \times 0.20} = \frac{0.256}{0.136} = 1.882

Step 2 — Compute required number of items:

New total items = n×5=1.882×5=9.4110n \times 5 = 1.882 \times 5 = 9.41 \approx 10 items

Step 3 — Verify with Spearman-Brown:

ρXX(10)=2×0.681+(21)×0.68=1.361.68=0.810\rho_{XX}^{(10)} = \frac{2 \times 0.68}{1 + (2-1) \times 0.68} = \frac{1.36}{1.68} = 0.810

Conclusion: Adding 5 more parallel items (total 10 items) is predicted to raise the reliability from α=0.68\alpha = 0.68 to approximately α=0.81\alpha = 0.81, exceeding the target of 0.80. This assumes that the new items have the same average inter-item correlation as the original 5.


15. Common Mistakes and How to Avoid Them

Mistake 1: Reporting Alpha Without a Confidence Interval

Problem: Cronbach's alpha is a sample statistic with substantial sampling variability, especially in small samples. Reporting only the point estimate gives a false sense of precision. A value of α=0.78\alpha = 0.78 with n=50n = 50 could have a 95% CI as wide as [0.63, 0.89].
Solution: Always report the 95% confidence interval for all reliability coefficients. Use the Feldt method for alpha or bootstrap CIs for omega.

Mistake 2: Using Cronbach's Alpha as a Measure of Unidimensionality

Problem: Alpha measures internal consistency (how strongly items co-vary), not unidimensionality (whether items measure a single construct). A multidimensional scale with two positively correlated subscales can produce high alpha, even though it clearly violates unidimensionality.
Solution: Always conduct an EFA or CFA to assess dimensionality before computing reliability. Report alpha/omega separately for each unidimensional subscale.

Mistake 3: Blindly Deleting Items to Maximise Alpha

Problem: Removing items purely because they increase alpha capitalises on sampling variability and can produce a shorter scale that performs worse in new samples. Alpha increases simply by removing poor items, but the gain in reliability may be spurious.
Solution: Use a principled decision framework: only remove an item if (a) the CITC is below 0.20, (b) the item has poor theoretical alignment with the construct, AND (c) the item does not reduce content validity. Validate the revised scale in a new sample.

Mistake 4: Not Checking for Reverse-Coded Items

Problem: Including negatively-worded items without reverse coding them will produce negative inter-item correlations and severely deflate alpha. A value of α=0.10\alpha = 0.10 is often a sign that one or more items have not been reverse coded.
Solution: Before running reliability analysis, identify all negatively-worded items and reverse-code them: Xrev=(max+min)XX_{\text{rev}} = (\text{max} + \text{min}) - X.

Mistake 5: Reporting Alpha for a Multidimensional Scale as a Whole

Problem: Computing a single alpha for a multidimensional questionnaire (e.g., a measure with anxiety, depression, and stress subscales combined) is theoretically inappropriate and can produce misleading reliability estimates.
Solution: Compute reliability separately for each subscale. If a composite total score is used, estimate its reliability using Mosier's composite reliability formula (Section 13.5).

Mistake 6: Ignoring the Number of Items When Interpreting Alpha

Problem: Alpha increases automatically with more items (Spearman-Brown effect). A 30-item scale with weak items can produce α=0.90\alpha = 0.90, while a 4-item scale with strong items might produce α=0.75\alpha = 0.75. The 4-item scale may actually be more efficient and have better items.
Solution: Consider the average inter-item correlation (rˉ\bar{r}) alongside alpha. Compare rˉ\bar{r} across scales of different lengths, as it is not affected by scale length. Ideal rˉ=0.20\bar{r} = 0.20 to 0.400.40.

Mistake 7: Confusing Inter-Rater Agreement With Inter-Rater Reliability

Problem: These two concepts are related but distinct:

Mistake 8: Using Percent Agreement Instead of Cohen's Kappa

Problem: Percent agreement does not correct for chance agreement. With a binary rating where 90% of cases fall in one category, two raters randomly agreeing with base rates would achieve 82% agreement by chance, making 85% agreement seem impressive when it is barely above chance.
Solution: Always report Cohen's Kappa (or Fleiss' Kappa for multiple raters) alongside percent agreement. Never interpret percent agreement alone.

Mistake 9: Applying Cronbach's Alpha to Subscale Scores (Not Item-Level Data)

Problem: Computing alpha using subscale total scores (rather than individual item scores) as the input produces a composite alpha estimate that is not the same as the reliability of the total scale and is not interpretable as a standard reliability coefficient.
Solution: Always compute reliability from item-level data (each item in a separate column), not from subscale totals.

Mistake 10: Interpreting Alpha of 0.95 as "Better" Than Alpha of 0.85

Problem: Very high alpha (>0.95> 0.95) is often a sign of item redundancy — items are so similar that they provide almost no unique measurement information. This wastes respondent time without improving construct coverage.
Solution: Target 0.800.900.80 - 0.90 for most research scales. If alpha exceeds 0.95 with many items, consider reducing scale length by removing the most redundant items (lowest unique information) while maintaining acceptable reliability.


16. Troubleshooting

ProblemLikely CauseSolution
Cronbach's alpha is very low (<0.50< 0.50)Reverse-coded items not recoded; items from different constructs mixed; items are too heterogeneousCheck and reverse-code negatively worded items; separate subscales; check for construct coherence
Alpha is negativeAt least one item is very negatively correlated with others; reverse coding errorExamine inter-item correlations; check for items that need reverse coding
Alpha is very high (>0.95> 0.95) with many itemsItem redundancy — too many items with near-identical wordingInspect item pairs with r>0.80r > 0.80; remove the most redundant items
Alpha exceeds omegaCorrelated errors inflating alpha; model misspecificationRun CFA to check for correlated errors; use omega from properly specified model
One item's alpha-if-deleted greatly exceeds overall alphaItem measures a different construct; possible reverse-coding error; item is ambiguousExamine item content; check reverse coding; consider removing from scale
All corrected item-total correlations are near zeroItems are unrelated to each other; multidimensional scale being treated as unidimensionalRun EFA; split into subscales; reconsider construct definition
Negative corrected item-total correlationItem is negatively related to the construct; reverse coding neededReverse-code the item and re-run
ICC very low (<0.50< 0.50) with large F-ratioRaters highly inconsistent; training issueRe-train raters; clarify rating criteria; pilot coding manual
ICC inconsistency much higher than absolute agreementSystematic rater bias (one rater consistently rates higher/lower)Identify the biased rater; re-calibrate; consider rater re-training
Cohen's Kappa is very low despite high percent agreementHigh base-rate of one category (prevalence paradox)Report both statistics; use PABAK (prevalence and bias adjusted kappa)
CFA does not converge for omegaVery small sample; near-perfect correlations; Heywood caseIncrease sample size; reduce number of items; check for duplicate items
SEM is very largeLow reliability and/or high scale varianceImprove reliability; report SEM explicitly in all clinical applications
Omega hierarchical approaches zeroEssentially no general factor; scale is fully multidimensionalUse subscale scores rather than total; report subscale-specific omega
Spearman-Brown predicts a very large number of items neededBaseline reliability is very low; items are poor indicatorsRedesign items; collect new pilot data; consider different item format

17. Quick Reference Cheat Sheet

Core Equations

FormulaDescription
Xi=Ti+EiX_i = T_i + E_iClassical Test Theory model
ρXX=σT2/σX2\rho_{XX} = \sigma^2_T / \sigma^2_XReliability coefficient (population)
SEM=σX1ρXX\text{SEM} = \sigma_X\sqrt{1-\rho_{XX}}Standard Error of Measurement
α=pp1(1σj2σX2)\alpha = \frac{p}{p-1}\left(1 - \frac{\sum\sigma^2_j}{\sigma^2_X}\right)Cronbach's alpha
αstd=prˉ1+(p1)rˉ\alpha_{\text{std}} = \frac{p\bar{r}}{1+(p-1)\bar{r}}Standardised alpha
ρXX(n)=nρXX1+(n1)ρXX\rho_{XX}^{(n)} = \frac{n\rho_{XX}}{1+(n-1)\rho_{XX}}Spearman-Brown prophecy
n=ρ(1ρXX)ρXX(1ρ)n = \frac{\rho^*(1-\rho_{XX})}{\rho_{XX}(1-\rho^*)}Items needed for target reliability
rXY=rXYρXXρYYr^*_{XY} = \frac{r_{XY}}{\sqrt{\rho_{XX}\rho_{YY}}}Correction for attenuation
ωt=(λj)2(λj)2+θj\omega_t = \frac{(\sum\lambda_j)^2}{(\sum\lambda_j)^2 + \sum\theta_j}McDonald's omega total
ωh=(λjg)2σX2\omega_h = \frac{(\sum\lambda_{jg})^2}{\sigma^2_X}Omega hierarchical
ρABSB=2rAB1+rAB\rho_{AB}^{SB} = \frac{2r_{AB}}{1+r_{AB}}Split-half (Spearman-Brown corrected)
κ=PoPe1Pe\kappa = \frac{P_o - P_e}{1-P_e}Cohen's Kappa
ICC(3,1)=MSBMSEMSB+(k1)MSE\text{ICC}(3,1) = \frac{\text{MS}_B - \text{MS}_E}{\text{MS}_B + (k-1)\text{MS}_E}ICC consistency (two-way mixed)
ICC(2,1)=MSBMSEMSB+(k1)MSE+kn(MSRMSE)\text{ICC}(2,1) = \frac{\text{MS}_B - \text{MS}_E}{\text{MS}_B + (k-1)\text{MS}_E + \frac{k}{n}(\text{MS}_R - \text{MS}_E)}ICC absolute agreement (two-way random)

Reliability Benchmarks

CoefficientPoorAcceptableGoodExcellent
Alpha / Omega<0.60< 0.600.600.690.60 - 0.690.700.890.70 - 0.890.90\geq 0.90
ICC (research)<0.50< 0.500.500.740.50 - 0.740.750.900.75 - 0.90>0.90> 0.90
Cohen's Kappa<0.21< 0.210.210.400.21 - 0.400.410.600.41 - 0.60>0.80> 0.80
CITC<0.20< 0.200.200.290.20 - 0.290.300.490.30 - 0.490.50\geq 0.50
ωh/ωt\omega_h / \omega_t<0.50< 0.500.500.640.50 - 0.640.650.790.65 - 0.790.80\geq 0.80

Reliability Type Selection Guide

ScenarioRecommended Method
Multi-item Likert scale, unidimensionalMcDonald's omega (preferred) or Cronbach's alpha
Multi-item scale, multidimensionalOmega hierarchical (bifactor) + omega subscale
Binary scored items (correct/incorrect)KR-20
Ordinal scale (5\leq 5 categories)Ordinal alpha or ordinal omega (polychoric)
Two raters, nominal categoriesCohen's Kappa
Two raters, ordered categoriesWeighted Kappa (linear or quadratic)
Three+ raters, nominal categoriesFleiss' Kappa
Two+ raters, continuous ratingsICC (specify model and type)
Test-retest, continuousICC (two-way mixed, absolute agreement)
Multiple sources of errorGeneralisability Theory (G-coefficient)

Item Analysis Decision Rules

StatisticThresholdAction
CITC<0.20< 0.20Flag for removal or revision
CITCNegativeCheck reverse coding; flag for review
Alpha-if-deleted>α+0.05> \alpha + 0.05Strong candidate for removal
Inter-item rr>0.80> 0.80Redundancy — remove one of the pair
Item skewness$z
Item difficulty (pjp_j, binary)<0.20< 0.20 or >0.80> 0.80Item too hard or too easy
Item discrimination (DD)<0.20< 0.20Poor discriminator — revise

Minimum Reliability by Context

ContextMinimumPreferred
Exploratory / pilot research0.600.70\geq 0.70
Group-level research0.700.80\geq 0.80
Individual research decisions0.800.90\geq 0.90
Clinical / high-stakes decisions0.900.95\geq 0.95

ICC Model Selection Guide

DesignRatersMeasureRecommended ICC
Each subject rated by different ratersRandomSingleICC(1,1)
Each subject rated by different ratersRandomMean of kkICC(1,kk)
Same raters rate all; generalise to all ratersRandomSingleICC(2,1)
Same raters rate all; generalise to all ratersRandomMean of kkICC(2,kk)
Same fixed raters; generalise to these ratersFixedSingleICC(3,1)
Same fixed raters; generalise to these ratersFixedMean of kkICC(3,kk)

This tutorial provides a comprehensive foundation for understanding, applying, and interpreting Reliability Analysis using the DataStatPro application. For further reading, consult Revelle & Zinbarg's "Coefficients Alpha, Beta, Omega, and the glb" (2009), McDonald's "Test Theory: A Unified Treatment" (1999), Koo & Mae's "A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research" (2016), or Shrout & Fleiss's "Intraclass Correlations: Uses in Assessing Rater Reliability" (1979). For feature requests or support, contact the DataStatPro team.