Reliability Analysis: Zero to Hero Tutorial
This comprehensive tutorial takes you from the foundational concepts of Reliability Analysis all the way through advanced estimation, evaluation, item analysis, and practical usage within the DataStatPro application. Whether you are encountering reliability analysis for the first time or looking to deepen your understanding of measurement quality, this guide builds your knowledge systematically from the ground up.
Table of Contents
- Prerequisites and Background Concepts
- What is Reliability Analysis?
- The Mathematics Behind Reliability
- Assumptions of Reliability Analysis
- Types of Reliability
- Using the Reliability Analysis Component
- Cronbach's Alpha
- McDonald's Omega
- Split-Half Reliability
- Inter-Rater Reliability
- Item Analysis
- Model Fit and Evaluation
- Advanced Topics
- Worked Examples
- Common Mistakes and How to Avoid Them
- Troubleshooting
- Quick Reference Cheat Sheet
1. Prerequisites and Background Concepts
Before diving into reliability analysis, it is helpful to understand several foundational statistical and psychometric concepts. Each is briefly reviewed below.
1.1 Measurement and Scales
Measurement is the process of assigning numbers to objects or events according to rules. In social and behavioural sciences, we frequently measure latent constructs — unobservable psychological or social attributes such as intelligence, anxiety, depression, or customer satisfaction.
These constructs are measured indirectly through observable indicators (items or questions on a questionnaire). The quality of this measurement process is what reliability analysis evaluates.
Scales of measurement:
| Scale | Properties | Examples |
|---|---|---|
| Nominal | Categories only; no order | Gender, blood type, nationality |
| Ordinal | Ordered categories; unequal intervals | Likert scale responses (1–5), satisfaction ratings |
| Interval | Equal intervals; arbitrary zero | Temperature (°C), IQ scores |
| Ratio | Equal intervals; true zero | Height, weight, reaction time |
Most psychological questionnaires use ordinal scales (Likert items), though they are often treated as approximately interval for the purpose of reliability analysis.
1.2 Variance and Covariance
The variance of a variable quantifies how spread out its values are around the mean:
The covariance between two variables and measures how they vary together:
The correlation is the standardised covariance:
Reliability analysis is fundamentally about analysing the covariance structure of a set of items — how strongly they co-vary with each other determines how reliably they measure the underlying construct.
1.3 The Inter-Item Correlation Matrix
Given items, the inter-item correlation matrix is a symmetric matrix where:
- Diagonal elements = 1.0 (each item correlates perfectly with itself).
- Off-diagonal element = Pearson correlation between items and .
High inter-item correlations (typically ) indicate that the items are measuring a common construct. Items with very low correlations () with all others may be measuring something different and should be reviewed.
1.4 Composite Scores
A composite score (also called a total score or scale score) is the sum or average of responses across multiple items:
The reliability of a composite score depends on:
- The reliability of each individual item.
- The number of items () — more items generally means higher reliability.
- The average inter-item correlation — higher correlations mean higher reliability.
1.5 The Signal-to-Noise Analogy
Reliability can be thought of in terms of signal and noise:
- Signal = true score variance (the construct you actually want to measure).
- Noise = error variance (random fluctuations due to item wording, context, mood, etc.).
A perfectly reliable measure would have all signal and no noise (). In practice, all measurement has some noise, and reliability values above 0.70 are generally considered acceptable for research purposes.
1.6 True Score Theory (Brief Preview)
The concept of true scores comes from Classical Test Theory (CTT), which is the mathematical framework underlying most reliability analyses. In CTT, every observed score is composed of a true score and a random error :
This deceptively simple equation is the foundation for all reliability estimation methods covered in this tutorial.
2. What is Reliability Analysis?
2.1 The Core Question
Reliability analysis evaluates whether a measurement instrument (a questionnaire, scale, test, or rating system) produces consistent, stable, and reproducible results. It answers the fundamental question:
"If we measure the same thing again under the same conditions, will we get the same result?"
A reliable instrument gives similar scores across:
- Items (internal consistency): Do all items in the scale measure the same thing?
- Time (test-retest reliability): Do scores remain stable when measured again?
- Raters (inter-rater reliability): Do different raters give the same scores?
- Parallel forms (alternate-form reliability): Do different versions of the test agree?
2.2 Reliability vs. Validity
Reliability and validity are the two cornerstones of measurement quality, but they are distinct:
| Property | Definition | Question Asked |
|---|---|---|
| Reliability | Consistency of measurement | "Does it measure consistently?" |
| Validity | Accuracy of measurement | "Does it measure what it claims to?" |
The critical relationship is:
Reliability is a necessary but not sufficient condition for validity.
A measure can be highly reliable but invalid (consistently measuring the wrong thing — like a miscalibrated scale that consistently reads 5 kg too heavy). A valid measure, by definition, must also be reliable (you cannot accurately measure something if the results are random). Perfect Reliability + High Validity = Excellent measurement (target) Perfect Reliability + Low Validity = Consistent but wrong (systematic bias) Low Reliability + Any Validity = Impossible (random measurement cannot be valid)
2.3 The Role of Reliability in Research
Reliability affects research in critical ways:
Statistical power: Unreliable measures attenuate (reduce) observed correlations and effect sizes. The true correlation between constructs and is related to the observed correlation by:
Where and are the reliabilities of the two measures. Low reliability reduces the observed correlation below the true value - called attenuation.
Precision of measurement: The Standard Error of Measurement (SEM) quantifies the uncertainty in an individual's score:
Lower reliability → larger SEM → less precise measurement of individual scores.
Sample size requirements: Studies using unreliable measures require larger samples to achieve the same statistical power, because unreliability adds noise to all effect size estimates.
2.4 Real-World Applications
- Psychology: Evaluating whether a depression scale (e.g., PHQ-9), anxiety inventory, or personality questionnaire measures the intended construct consistently.
- Education: Assessing whether exam items reliably distinguish between high- and low-knowledge students.
- Medicine: Verifying that clinicians agree on disease severity ratings or diagnostic classifications (inter-rater reliability).
- Marketing Research: Confirming that customer satisfaction items consistently reflect the same underlying attitude.
- Human Resources: Evaluating the reliability of performance appraisal systems or structured interview ratings.
- Engineering / Quality Control: Assessing the consistency of measurement instruments, inspectors, or testing procedures (gauge R&R studies).
3. The Mathematics Behind Reliability
3.1 Classical Test Theory (CTT)
Classical Test Theory (CTT) is the dominant framework for reliability analysis in the social sciences. Its foundation is the true score model:
Where:
- is the observed score of person (what we actually measure).
- is the true score of person (the score they would obtain with perfect, error-free measurement).
- is the measurement error for person (random, unpredictable fluctuation).
3.2 Assumptions of the True Score Model
The CTT model rests on four key mathematical assumptions:
- Linearity: (observed = true + error, additively).
- Zero mean error: — measurement errors average to zero across many measurements.
- Uncorrelated error and true score: — error is unrelated to the person's true score.
- Uncorrelated errors across items: for — errors on different items are independent.
3.3 Variance Decomposition
Under the CTT assumptions, the variance of the observed score decomposes into two parts:
Where:
- = total observed score variance.
- = true score variance (signal).
- = error variance (noise).
3.4 The Reliability Coefficient
The reliability coefficient is defined as the ratio of true score variance to total observed score variance:
This is the population reliability — it ranges from 0 (completely unreliable) to 1 (perfectly reliable). In practice, reliability is estimated from sample data using methods such as Cronbach's alpha, McDonald's omega, or split-half reliability.
Key properties of :
- : All variance is error; the measure is completely random.
- : All variance is true score variance; perfect measurement.
- : Reliability equals the correlation between two parallel forms of the measure.
3.5 Standard Error of Measurement
The Standard Error of Measurement (SEM) quantifies the average error in an individual's observed score:
The SEM defines a confidence interval around a person's observed score:
For example, if and :
A person scoring 65 has a 95% CI for their true score of .
3.6 The Spearman-Brown Prophecy Formula
The Spearman-Brown Prophecy Formula predicts the reliability of a lengthened (or shortened) test. If the current test has reliability and is changed to times its current length:
Where is the multiplication factor (e.g., means doubling the number of items).
Inverse formula — how many times longer must the test be to reach a target reliability ?
Example: Current reliability = 0.60. How many times longer must the test be to reach 0.80?
The test must be approximately 2.67 times longer (about 167% more items).
3.7 Attenuation and Correction for Attenuation
The correction for attenuation formula estimates the true (disattenuated) correlation between two constructs from their observed correlation, correcting for the attenuation caused by measurement unreliability:
Where:
- = observed correlation between measures and .
- , = reliabilities of measures and .
- = estimated true correlation (disattenuated).
⚠️ The disattenuated correlation can exceed 1.0 if both reliabilities are low and the observed correlation is moderate or high. Values are inadmissible and indicate that the reliabilities or correlation are inaccurate. Treat disattenuated correlations with caution.
3.8 The Covariance Matrix of a Scale
For a -item scale, the covariance matrix is with:
- Diagonal elements: item variances .
- Off-diagonal elements: inter-item covariances .
The total scale variance is the sum of all elements:
Or in matrix notation:
Where is a vector of ones.
This covariance matrix is the fundamental input to Cronbach's alpha and McDonald's omega.
4. Assumptions of Reliability Analysis
4.1 Unidimensionality
The most critical assumption for most reliability coefficients (especially Cronbach's alpha) is that all items in the scale measure a single underlying construct (unidimensionality).
Why it matters: Cronbach's alpha is a measure of internal consistency, not unidimensionality. A scale with multiple distinct dimensions (e.g., a scale with "anxiety" and "depression" items combined) can still produce a high alpha, but the alpha in that case is misleading — it does not reflect a single coherent construct.
How to check:
- Run an Exploratory Factor Analysis (EFA) before reliability analysis.
- Inspect the scree plot and parallel analysis.
- If more than one factor is extracted, consider running reliability analysis separately for each subscale.
- Check the ratio of the first to second eigenvalue: a ratio suggests approximate unidimensionality.
4.2 Tau-Equivalence (for Cronbach's Alpha)
Cronbach's alpha is theoretically justified only when items are tau-equivalent — meaning all items have:
- Equal true score variances (all items are equally good at measuring the construct), AND
- Different error variances (items can differ in their reliability).
In practice, most scales are congeneric — items have different factor loadings (not tau- equivalent) — which means Cronbach's alpha underestimates the true reliability. McDonald's omega (Section 8) does not require tau-equivalence and is therefore more appropriate for congeneric scales.
4.3 Uncorrelated Errors
CTT assumes that measurement errors on different items are uncorrelated — the error on item is independent of the error on item . Correlated errors arise when:
- Two items use very similar wording (e.g., "I feel anxious" and "I feel worried").
- Two items share a reverse-score format (systematic method effect).
- Two items are administered sequentially and participants remember their previous response (carry-over effects).
Correlated errors inflate Cronbach's alpha and can produce reliability estimates that exceed the true reliability. When correlated errors are suspected, use a model-based approach (CFA) to explicitly model and account for them.
4.4 Continuous or Approximately Continuous Items
Standard reliability formulas assume that item scores are continuous (or at minimum, ordinal with many ordered categories treated as approximately continuous). For binary items (0/1 responses), use the Kuder-Richardson Formula 20 (KR-20), which is a special case of Cronbach's alpha. For ordinal items with 3–4 categories, use polychoric correlations as input (ordinal alpha or ordinal omega).
4.5 Adequate Sample Size
Reliability estimation requires a sufficient sample size for stable results:
| Reliability Statistic | Minimum | Recommended |
|---|---|---|
| Cronbach's alpha | 50 | |
| McDonald's omega (CFA-based) | 100 | |
| ICC (inter-rater) | 30 subjects | subjects |
| Test-retest (Pearson/ICC) | 30 | |
| Split-half | 50 |
⚠️ With small samples, reliability estimates are highly unstable and have wide confidence intervals. Always report a confidence interval alongside the point estimate.
4.6 No Extreme Outliers
Outliers in item responses can substantially distort covariances and variances, leading to inaccurate reliability estimates. Screen data for:
- Straight-line responding (participant gives the same response to all items).
- Acquiescence bias (tendency to agree with all items regardless of content).
- Extreme response styles (always using only the highest or lowest scale point).
5. Types of Reliability
5.1 Internal Consistency Reliability
Internal consistency measures whether the items in a scale consistently measure the same construct. It is assessed using a single administration of the scale.
| Method | Basis | When to Use |
|---|---|---|
| Cronbach's Alpha () | Average inter-item covariance | Default for multi-item scales; assumes tau-equivalence |
| McDonald's Omega () | Factor model (CFA-based) | Better for congeneric items; most recommended |
| Ordinal Alpha / Omega | Polychoric correlations | Ordinal items with categories |
| KR-20 / KR-21 | Binary items | Dichotomous response scales (correct/incorrect) |
| Split-Half (Spearman-Brown) | Two halves of the scale | Quick estimate; now largely superseded |
| Greatest Lower Bound (GLB) | Maximisation over all splittings | Upper bound for internal consistency |
5.2 Test-Retest Reliability (Stability)
Test-retest reliability assesses the temporal stability of a measure — whether the same people get the same scores when measured at two different time points.
Where and are scores from the same measure at Time 1 and Time 2.
Key considerations:
- The time interval between assessments must be chosen carefully:
- Too short (days): Carryover/memory effects inflate reliability.
- Too long (months/years): True change in the construct deflates reliability.
- Optimal: 2–4 weeks for stable psychological traits.
- Use Intraclass Correlation Coefficient (ICC) rather than Pearson for test-retest reliability, as ICC also accounts for systematic mean shifts between time points.
5.3 Inter-Rater Reliability (Agreement)
Inter-rater reliability assesses the degree to which different raters (judges, coders, or observers) agree in their assessments of the same subjects.
| Method | Data Type | When to Use |
|---|---|---|
| Percent Agreement | Nominal / Ordinal | Simple; ignores chance agreement |
| Cohen's Kappa () | Nominal | Two raters; categorical ratings |
| Weighted Kappa () | Ordinal | Two raters; ordered categories |
| Fleiss' Kappa | Nominal | Three or more raters |
| Intraclass Correlation (ICC) | Continuous / Interval | Two or more raters; continuous ratings |
| Krippendorff's Alpha | Any scale | Multiple raters; any data type |
5.4 Parallel-Forms Reliability (Alternate Forms)
Parallel-forms reliability (also called alternate-form or equivalent-form reliability) assesses agreement between two different versions of the same test administered at the same time:
When used:
- High-stakes testing where test security is a concern (using multiple test forms).
- Longitudinal studies where learning effects would contaminate test-retest assessment.
This type is less common in everyday practice because developing two truly parallel forms is resource-intensive.
5.5 Summary: Choosing the Right Reliability Type
| Research Scenario | Recommended Type |
|---|---|
| Single-administration questionnaire (Likert items) | Internal consistency (alpha/omega) |
| Binary scored test (correct/incorrect) | KR-20 |
| Longitudinal measurement (same scale, two time points) | Test-retest ICC |
| Observational coding scheme (two coders) | Cohen's Kappa or ICC |
| Observational coding scheme (three+ coders) | Fleiss' Kappa or ICC |
| Continuous ratings by multiple raters | ICC |
| Two test forms used interchangeably | Parallel-forms |
| Ordinal items ( categories) | Ordinal alpha or omega |
6. Using the Reliability Analysis Component
The Reliability Analysis component in DataStatPro provides a complete workflow for evaluating the reliability of multi-item scales and rating systems.
Step-by-Step Guide
Step 1 — Select Dataset
Choose the dataset from the "Dataset" dropdown. Ensure:
- Item responses are in separate columns (one column per item).
- The dataset contains only the items you wish to include in the scale.
- All items are coded in the same direction (or reverse-code negatively-worded items first).
💡 Tip: Use the DataStatPro data editor to reverse-code negatively worded items before running reliability analysis. For a 5-point scale, reverse coding transforms responses as: .
Step 2 — Select Scale Items
Select all variables (columns) that constitute the scale from the "Scale Items" dropdown. You can select multiple items. All selected items should:
- Be numeric.
- Use the same response scale (e.g., all 1–5 Likert).
- Theoretically measure the same underlying construct.
⚠️ Important: Do not mix items from different subscales in a single reliability analysis. Run separate analyses for each subscale.
Step 3 — Select Reliability Method
Choose from the "Method" dropdown:
- Cronbach's Alpha: Default, most widely used. Suitable for scales with many items on the same response scale. Assumes tau-equivalence.
- McDonald's Omega: Preferred method. Does not assume tau-equivalence. Uses a factor model internally. Recommended for congeneric scales.
- Split-Half (Spearman-Brown): Splits the scale into two halves and correlates them. Quick estimate, now largely superseded by alpha and omega.
- Inter-Rater Reliability (ICC / Kappa): For assessing agreement between raters.
Step 4 — Select Model (for ICC)
If using ICC for inter-rater reliability, select the appropriate ICC model:
- One-Way Random: Single raters randomly selected; raters differ across subjects.
- Two-Way Random: Raters are a random sample; interested in generalisability to all raters.
- Two-Way Mixed: Raters are fixed (same raters rate all subjects); generalise to these specific raters only.
Step 5 — Select ICC Type (for ICC)
- Consistency: Ignores systematic differences in rater means (appropriate when raters use different scale ranges habitually).
- Absolute Agreement: Accounts for systematic mean differences between raters (required when interchangeability of raters matters).
Step 6 — Select Confidence Level
Choose the confidence level for confidence intervals (default: 95%).
Step 7 — Display Options
Select which outputs to display:
- ✅ Overall Reliability Coefficient (with 95% CI)
- ✅ Item Statistics Table (mean, SD, skewness, kurtosis per item)
- ✅ Inter-Item Correlation Matrix
- ✅ Item-Total Statistics Table (corrected item-total correlation, alpha-if-deleted)
- ✅ Scale Statistics Summary
- ✅ Omega Hierarchical and Omega Total (if McDonald's Omega selected)
- ✅ Scree Plot / Factor Loadings (for omega)
- ✅ ICC Confidence Intervals and F-test (for inter-rater)
- ✅ Kappa Table with per-category agreement (for Kappa)
Step 8 — Run the Analysis
Click "Run Reliability Analysis". The application will:
- Compute item descriptive statistics.
- Calculate the inter-item correlation matrix.
- Estimate the chosen reliability coefficient(s) and confidence intervals.
- Produce item-total statistics (corrected correlations, alpha/omega-if-item-deleted).
- Display all selected outputs and visualisations.
7. Cronbach's Alpha
7.1 Definition and Formula
Cronbach's alpha () is the most widely used measure of internal consistency reliability. It estimates the proportion of total scale variance attributable to the common factor shared by all items.
For a scale with items and covariance matrix :
Where:
- = number of items.
- = variance of item .
- = total scale score variance.
Equivalent form using the average inter-item covariance and average item variance :
This form clearly shows that alpha depends on two quantities:
- The average inter-item covariance : how strongly items co-vary.
- The number of items : more items → higher alpha (Spearman-Brown effect).
7.2 Standardised Alpha
When items are measured on different response scales (e.g., some items 1–5, others 1–7), it is appropriate to use the standardised alpha, which is based on the correlation matrix rather than the covariance matrix:
Where is the average inter-item correlation.
This formula, known as the Spearman-Brown formula applied to the average inter-item correlation, shows that standardised alpha is solely determined by the number of items and the average correlation between them.
7.3 Cronbach's Alpha as a Lower Bound
Cronbach's alpha is a lower bound on the true reliability — it underestimates the true reliability under most conditions:
with equality holding only when items are tau-equivalent (all true score variances are equal, i.e., all factor loadings are equal). Because psychological items are almost never tau-equivalent in practice, alpha typically underestimates the true reliability.
This is why McDonald's omega (Section 8) is theoretically preferred — it is an exact estimate of reliability under the congeneric model.
7.4 The Confidence Interval for Alpha
The sampling distribution of Cronbach's alpha is complex. An approximate 95% confidence interval based on the Fisher transformation is:
The Feldt (1965) exact CI method transforms alpha through:
Where is the population value, which follows an F-distribution with degrees of freedom and . The CI bounds are:
Where is the sample size and is the number of items.
💡 Always report the confidence interval alongside the point estimate of alpha. With small samples (), the confidence interval can be very wide (e.g., , 95% CI: [0.68, 0.89]), signalling considerable uncertainty in the estimate.
7.5 Interpreting Cronbach's Alpha
| Value | Interpretation | Typical Use |
|---|---|---|
| Excellent — but may indicate redundancy | High-stakes clinical decisions | |
| Excellent | High-stakes clinical decisions | |
| Good | Most research applications | |
| Acceptable | Exploratory research | |
| Questionable — scale needs revision | Pilot studies only | |
| Poor — major revision needed | Unacceptable for most purposes | |
| Unacceptable | Should not be used as a scale |
⚠️ Very high alpha () is not always desirable. It can indicate item redundancy — that items are so similar in wording that they add no unique information. Aim for for most psychological scales, with not too many highly redundant items.
7.6 Alpha for Binary Items: KR-20 and KR-21
When all items are dichotomous (scored 0 or 1, as in knowledge tests), Cronbach's alpha reduces to the Kuder-Richardson Formula 20 (KR-20):
Where is the proportion of respondents answering item correctly, , and is the total test score variance.
When item difficulties are assumed equal ( for all items), a simpler formula is the Kuder-Richardson Formula 21 (KR-21):
KR-21 is always KR-20. Both formulas produce the same estimate as Cronbach's alpha applied to binary data.
7.7 Item-Total Statistics and Alpha-if-Item-Deleted
The item-total statistics table provides four critical pieces of information for each item:
| Statistic | Description | Use |
|---|---|---|
| Scale Mean if Item Deleted | Mean of total score if this item is removed | Identifies items that skew the scale |
| Scale Variance if Item Deleted | Variance of total score if this item is removed | Identifies items that affect scale spread |
| Corrected Item-Total Correlation | Correlation between item score and total minus that item | Primary item quality indicator |
| Alpha if Item Deleted | Alpha of the remaining items if this item is removed | Shows whether removing the item improves alpha |
Corrected Item-Total Correlation (CITC):
where the subscript means the total score with item removed. This correction prevents the artificial inflation that would result from correlating an item with a total that includes the item itself.
Interpretation of CITC:
| CITC | Interpretation |
|---|---|
| Excellent discriminator — strongly related to the construct | |
| Good discriminator — acceptable item | |
| Marginal — consider revision or removal | |
| Poor — item should be revised or removed | |
| Negative | Item is negatively related to the scale — likely needs reverse coding or removal |
7.8 Worked Manual Calculation of Cronbach's Alpha
Suppose we have a 3-item scale with the following covariance matrix from respondents:
Step 1 — Sum of item variances:
Step 2 — Total scale variance:
Step 3 — Apply Cronbach's formula:
Interpretation: — acceptable internal consistency for a 3-item scale.
8. McDonald's Omega
8.1 Limitations of Cronbach's Alpha and the Case for Omega
While Cronbach's alpha is the most widely reported reliability index, it has several well-documented limitations:
- Assumes tau-equivalence — all items must have equal factor loadings. Violated in virtually all real scales.
- Underestimates true reliability when items are congeneric (unequal loadings).
- Can be artificially inflated by correlated errors or multidimensionality.
- Does not distinguish between general factor variance and group factor variance in multidimensional scales.
McDonald's omega () overcomes these limitations by explicitly modelling the factor structure of the scale. It is now recommended by major methodologists as the preferred reliability index over Cronbach's alpha.
8.2 The Congeneric Model
McDonald's omega is based on the congeneric measurement model — a single-factor CFA where items can have different factor loadings (unlike the tau-equivalent model assumed by alpha):
Where:
- = factor loading of item (not constrained to be equal).
- = the common latent factor (standardised: ).
- = unique factor for item , with variance .
The model-implied covariance matrix is:
Where and .
8.3 Omega Total ()
Omega total () is the reliability of the total composite score from a congeneric single-factor model. It equals the squared correlation between the true score and the observed total score:
Where are the standardised factor loadings and are the unique variances (uniquenesses).
This formula has a clear interpretation:
- Numerator: The variance attributable to the common factor (the signal).
- Denominator: The total variance (signal + noise).
💡 Omega total is equivalent to Cronbach's alpha when items are tau-equivalent, and exceeds alpha when items are congeneric (unequal loadings). In practice, omega is usually somewhat higher than alpha.
8.4 Omega Hierarchical ()
For multidimensional scales with both a general factor and group-specific factors (bifactor structure), omega hierarchical () quantifies the proportion of total score variance attributable to the general factor alone:
Where is the loading of item on the general factor (from a bifactor model), and is the total scale variance.
Omega hierarchical subscale () is the proportion of variance in a subscale's total score attributable to the group-specific factor:
Where the sum is over items in subscale and is the loading on the group-specific factor .
8.5 Comparison of Alpha and Omega
| Property | Cronbach's Alpha | McDonald's Omega (Total) |
|---|---|---|
| Assumes tau-equivalence | Yes | No |
| Appropriate for congeneric items | No (underestimates) | Yes (correct estimate) |
| Requires factor analysis | No | Yes (single-factor CFA) |
| Sensitive to multidimensionality | Yes (can inflate) | Partially |
| Can separate general/group factors | No | Yes ( vs. ) |
| Currently recommended by APA | Increasingly | Increasingly |
| Sensitivity to correlated errors | Inflated | Can model explicitly |
General rule: When items are tau-equivalent → alpha ≈ omega. When items are congeneric (different loadings) → omega > alpha. The difference between omega and alpha is larger when loadings vary more across items.
8.6 Interpreting Omega Values
The same benchmarks as Cronbach's alpha apply to omega:
| Interpretation | |
|---|---|
| Excellent reliability | |
| Good reliability | |
| Acceptable reliability | |
| Questionable — revision needed | |
| Poor — major revision required |
For omega hierarchical (), which represents the reliability attributable only to the general factor:
| Interpretation | |
|---|---|
| Strong general factor; composite score is justified | |
| Moderate general factor; composite score is defensible | |
| Weak general factor; subscale scores may be preferable | |
| Very weak general factor; total score is not recommended |
8.7 The Ratio (Explained Common Variance)
The ratio of omega hierarchical to omega total is sometimes called the ECV (Explained Common Variance) and quantifies how much of the reliable variance is attributable to the general factor vs. group factors:
- : Almost all reliable variance is due to the general factor — the scale is essentially unidimensional and a total score is well-justified.
- : Substantial group factor variance — subscale scores carry unique meaning beyond the total score.
9. Split-Half Reliability
9.1 The Split-Half Method
Split-half reliability estimates reliability by dividing the scale into two halves, computing the total score for each half, and correlating the two half-scores. This provides an estimate based on a single test administration (unlike test-retest reliability).
The Pearson correlation between the two half-scores ( and ) is:
However, this correlation estimates the reliability of a half-length test, not the full test. The Spearman-Brown correction is applied to estimate the reliability of the full test:
9.2 Methods for Splitting the Scale
| Method | How | Issue |
|---|---|---|
| Odd-Even Split | Odd-numbered items → Half A; even-numbered → Half B | Assumes order of items does not matter |
| First-Last Split | First items → Half A; Last → Half B | Favours scales where item order is random |
| Random Split | Items randomly assigned to halves | More reproducible with many iterations |
| Matched-Random Split | Items matched on difficulty/content then split | Best for heterogeneous item sets |
⚠️ The split-half method gives different results depending on how the scale is split. This is a major weakness. Cronbach's alpha can be interpreted as the average of all possible split-half reliabilities — making it a more stable and preferred estimate. Split-half is primarily of historical interest today.
9.3 The Guttman Lambda Coefficients
The Guttman (1945) lambda coefficients are a family of reliability lower bounds. The most useful are:
Lambda 2 (): The tightest lower bound computable without factor analysis:
Lambda 4 (): The maximum split-half reliability over all possible splits (the greatest split-half). It equals Cronbach's alpha when items are tau-equivalent, and typically exceeds alpha for congeneric items.
Lambda 6 (): Based on the squared multiple correlations of each item with all others:
Where is the from regressing item on all other items.
9.4 The Greatest Lower Bound (GLB)
The Greatest Lower Bound (GLB) is the maximum possible reliability lower bound, computed by maximising over all possible splits and decompositions of the covariance matrix:
Subject to the constraint that is positive semidefinite, where .
GLB — the GLB is never less than alpha or any other lower bound. However, the GLB can be severely positively biased in small samples and may overestimate reliability more than omega. Use with caution when .
10. Inter-Rater Reliability
10.1 Why Inter-Rater Reliability Matters
When data collection relies on human judgment — observations, coding of qualitative data, clinical assessments, interview ratings — different raters may disagree. Inter-rater reliability (IRR) quantifies the degree of agreement between raters and determines whether ratings can be trusted as objective.
Low IRR suggests:
- Ambiguous category definitions or coding rules.
- Insufficient rater training.
- The phenomenon being rated is inherently subjective.
- The rating scale is poorly designed.
10.2 Percent Agreement
The simplest IRR measure is percent agreement — the proportion of ratings on which all raters agree:
Critical limitation: Percent agreement does not correct for the level of agreement expected purely by chance. Two raters randomly assigning ratings to binary categories (50/50 split) would agree about 50% of the time by chance alone.
10.3 Cohen's Kappa ()
Cohen's Kappa corrects for chance agreement. For two raters assigning subjects to categories:
Where:
- = observed proportion of agreement.
- = expected proportion of agreement by chance.
Computing : For a contingency table with row proportions and column proportions :
Example for a 2-category rating (agreement / disagreement):
Suppose two raters classify 100 subjects as "Case" or "Non-Case":
| Rater B: Case | Rater B: Non-Case | Row Total | |
|---|---|---|---|
| Rater A: Case | 45 | 10 | 55 |
| Rater A: Non-Case | 5 | 40 | 45 |
| Column Total | 50 | 50 | 100 |
Standard Error of Kappa:
95% Confidence Interval:
10.4 Interpreting Cohen's Kappa
| Strength of Agreement | |
|---|---|
| Less than chance agreement (worse than random) | |
| Slight | |
| Fair | |
| Moderate | |
| Substantial | |
| Almost Perfect |
(Landis & Koch, 1977 benchmarks — widely used but not universally accepted)
⚠️ Kappa is sensitive to the prevalence (base rate) of each category. When one category is very rare, even high percent agreement can yield a very low kappa. Always report percent agreement alongside kappa.
10.5 Weighted Kappa ()
For ordinal rating scales (where disagreements of different magnitudes are not equally serious), weighted kappa assigns weights based on the severity of disagreement:
Where are the weights, are the observed proportions, and are the expected proportions under independence.
Common weighting schemes:
| Weight Type | Formula | Suitable For |
|---|---|---|
| Linear weights | $1 - \frac{ | i-j |
| Quadratic weights | $1 - \left(\frac{ | i-j |
Note: Weighted kappa with quadratic weights is mathematically equivalent to the ICC(2,1) model (see Section 10.6).
10.6 Intraclass Correlation Coefficient (ICC)
The Intraclass Correlation Coefficient (ICC) is the most versatile measure of inter-rater reliability for continuous or interval-scale ratings. Unlike Cohen's Kappa, the ICC can handle:
- More than two raters.
- Continuous ratings.
- Both consistency (relative agreement) and absolute agreement.
The ICC is based on a one-way or two-way ANOVA decomposition of the total variance in ratings into:
- Between-subjects variance (): True differences between subjects.
- Within-subjects / between-rater variance (): Systematic rater differences (in two-way models).
- Error / residual variance (): Random inconsistency.
Six standard ICC models (Shrout & Fleiss, 1979; McGraw & Wong, 1996):
| ICC Model | Notation | Rater Design | Measures |
|---|---|---|---|
| One-Way Random, Single | ICC(1,1) | Each subject rated by a different random rater | Consistency |
| One-Way Random, Mean of | ICC(1,) | Each subject rated by different raters; average used | Consistency |
| Two-Way Random, Single | ICC(2,1) | Same raters rate all subjects; raters random | Absolute agreement |
| Two-Way Random, Mean of | ICC(2,) | Same raters rate all; raters random; average used | Absolute agreement |
| Two-Way Mixed, Single | ICC(3,1) | Same fixed raters; single rating used | Consistency |
| Two-Way Mixed, Mean of | ICC(3,) | Same fixed raters; average of ratings used | Consistency |
ICC Formulas (Two-Way Mixed Model):
For subjects, raters, with mean squares from a two-way ANOVA:
ICC(3,1) — Consistency:
ICC(2,1) — Absolute Agreement:
For averaged ratings — ICC(3,) — Consistency:
10.7 Confidence Intervals for ICC
The 95% CI for ICC is computed using the F-distribution:
With and .
Where .
10.8 Interpreting ICC Values
| ICC | Reliability Quality |
|---|---|
| Poor | |
| Moderate | |
| Good | |
| Excellent |
(Koo & Mae, 2016 benchmarks — widely used in clinical research)
10.9 Fleiss' Kappa (Multiple Raters, Nominal Scale)
When more than two raters independently classify subjects into categories, Fleiss' Kappa generalises Cohen's Kappa:
Where:
With = number of subjects, = number of raters, = number of raters assigning subject to category .
10.10 Krippendorff's Alpha
Krippendorff's alpha () is a versatile agreement measure that:
- Handles any number of raters.
- Works for any scale of measurement (nominal, ordinal, interval, ratio).
- Handles missing data (not all subjects need to be rated by all raters).
Where:
- = observed disagreement (average squared/metric disagreement between raters).
- = expected disagreement under chance agreement.
For ordinal data, the metric function is the squared difference between category ranks. For interval data, . For nominal data, (0 if equal, 1 if different).
Krippendorff recommends for reliable conclusions, with allowing only tentative conclusions.
11. Item Analysis
11.1 What is Item Analysis?
Item analysis is the process of evaluating the statistical properties of individual items to determine which items contribute positively to the scale's reliability and validity, and which should be revised or removed.
Item analysis is typically performed as part of reliability analysis and is crucial during:
- Initial scale development (identifying weak items).
- Scale revision (improving problematic items).
- Test construction (selecting the best items from a larger item pool).
11.2 Item Difficulty (for Knowledge Tests)
For knowledge tests with correct/incorrect scoring, item difficulty (-value) is the proportion of respondents who answer the item correctly:
| Difficulty () | Interpretation |
|---|---|
| Very difficult — too hard for most | |
| Difficult | |
| Moderate — optimal for discrimination | |
| Easy | |
| Very easy — too easy for most |
Items with provide the most information about differences between individuals (maximum variance). However, items at extremes ( or ) have low variance and contribute little to reliability.
11.3 Item Discrimination Index
The item discrimination index () measures how well an item differentiates between high-scoring and low-scoring respondents. It is computed using the extreme groups method:
- Divide respondents into the top 27% (High group, ) and bottom 27% (Low group, ) based on total score.
- Compute the proportion correct in each group: and .
- Compute:
| Interpretation | |
|---|---|
| Excellent discriminator | |
| Good discriminator | |
| Marginal — consider revision | |
| Poor — revise or remove | |
| Negative | Perverse — high scorers do worse (review carefully) |
11.4 Item-Rest Correlation (Corrected Item-Total Correlation)
As introduced in Section 7.7, the corrected item-total correlation (CITC) is the primary item quality indicator for Likert-type scales. It is equivalent to the item discrimination index for continuous scales and should be:
- : Item is a satisfactory indicator of the construct.
- : Item is marginal; consider revision.
- : Item is a poor indicator; strong candidate for removal.
- Negative: Item is inversely related to the construct (check for reverse coding).
11.5 Inter-Item Correlation Analysis
Beyond item-total correlations, examining the inter-item correlation matrix reveals:
Items that are too highly correlated (): May be redundant — they are essentially asking the same question twice and add little unique information. One should be removed or both should be revised to be more distinct.
Items that are too weakly correlated ( with most other items): Likely measuring a different construct. These items should be examined theoretically and may need to be placed in a different subscale or removed.
Average inter-item correlation: Values of to are typically considered optimal. Very high average correlations () with many items indicate excessive redundancy.
11.6 Floor and Ceiling Effects
Floor effects occur when most respondents score near the minimum possible score. Ceiling effects occur when most respondents score near the maximum possible score.
Both effects:
- Reduce variance (compress the distribution).
- Attenuate correlations with other variables.
- Reduce reliability.
- Make it difficult to detect differences or changes.
Check for floor/ceiling effects by inspecting:
- The distribution of item responses (histograms).
- Skewness: Extreme skewness () signals potential floor/ceiling issues.
- Proportion of respondents at the minimum or maximum: suggests an issue.
11.7 Item Response Curves
For knowledge tests (binary items), the item response curve (IRC) or item characteristic curve (ICC) plots the probability of a correct response as a function of total test score.
A well-functioning item should show a monotonically increasing S-shaped curve — the probability of a correct answer should consistently increase with the total score. Items that show a non-monotonic curve (e.g., high scorers are less likely to answer correctly than medium scorers) are flagged as problematic discriminators.
11.8 The Item Analysis Decision Framework
For each item: | v Is CITC < 0.20? Yes → Flag for removal or revision No → Continue | v Is item-item correlation > 0.80 with any other item? Yes → Flag for redundancy; remove one of the pair No → Continue | v Does alpha-if-deleted substantially exceed current alpha (by > 0.05)? Yes → Strong candidate for removal No → Retain item | v Is skewness > |2| or kurtosis > |7|? (floor/ceiling effects) Yes → Consider item revision or transformation No → Retain item with confidence
12. Model Fit and Evaluation
12.1 Reporting Reliability: Minimum Requirements
At minimum, a reliability report should include:
- The reliability coefficient (alpha, omega, ICC, kappa, etc.).
- The 95% confidence interval around the coefficient.
- The number of items included in the analysis.
- The sample size ().
- The method used (Cronbach's alpha, McDonald's omega, ICC model, etc.).
- Item-level statistics (means, SDs, corrected item-total correlations).
Example APA-style reporting:
"Internal consistency of the 10-item Emotional Regulation Scale was evaluated using McDonald's omega (), as items were expected to have unequal factor loadings (congeneric model). Omega total was (95% CI [0.84, 0.90]), indicating good internal consistency. Omega hierarchical was , suggesting that the majority of reliable variance was attributable to the general factor. Corrected item-total correlations ranged from 0.41 to 0.68, with all items exceeding the acceptable threshold of 0.30."
12.2 Scale-Level Statistics
Beyond the reliability coefficient, the following scale-level statistics should be computed and reported:
| Statistic | Formula | Interpretation |
|---|---|---|
| Scale Mean | Average composite score | |
| Scale Variance | Spread of composite scores | |
| Scale SD | SD of composite scores | |
| SEM | Average error in individual scores | |
| Range | Max − Min | Spread of composite scores observed |
| Skewness & Kurtosis | Standard formulas | Check normality of composite |
12.3 Assessing the Factor Structure Before Reliability Analysis
Before running reliability analysis, it is best practice to verify the factor structure:
Step 1 — Exploratory Factor Analysis (EFA):
- Run EFA with parallel analysis to determine the number of factors.
- If a single dominant factor is confirmed (parallel analysis retains 1 factor), the scale is approximately unidimensional → proceed with alpha or omega.
- If 2+ factors are retained → split into subscales and analyse each separately.
Step 2 — Confirmatory Factor Analysis (CFA):
- Specify a single-factor CFA model.
- Evaluate fit (CFI , RMSEA , SRMR ).
- If fit is good → use omega total from the CFA solution.
- If fit is poor → consider multidimensional model; use omega hierarchical from bifactor model.
12.4 Evaluating Convergent and Discriminant Validity
Convergent validity: The scale should correlate strongly with other measures of the same or similar constructs (theoretically related measures). Typically evaluated using Pearson or Spearman correlations.
Discriminant validity: The scale should correlate weakly with measures of theoretically unrelated constructs.
Using reliability information, the disattenuated correlation (Section 3.7) provides the best estimate of the true relationship between constructs, corrected for measurement error.
12.5 Minimum Acceptable Reliability by Context
The required level of reliability depends on the stakes and purpose of measurement:
| Context | Minimum Acceptable | Preferred |
|---|---|---|
| Group-level research (comparing means) | 0.70 | |
| Individual-level decisions (clinical) | 0.90 | |
| High-stakes testing (licensure) | 0.90 | |
| Pilot / exploratory research | 0.60 | |
| Inter-rater agreement (research) | 0.70 ICC | ICC |
| Inter-rater agreement (clinical) | 0.90 ICC | ICC |
13. Advanced Topics
13.1 Ordinal Reliability: Polychoric Correlations
When scale items use fewer than 5 ordinal categories (e.g., a 3-point or 4-point Likert scale), treating Likert responses as continuous can distort covariances and underestimate reliability. A more appropriate approach uses polychoric correlations as the input matrix.
The polychoric correlation between two ordinal items and estimates the correlation between the underlying continuous latent variables that generate the observed ordinal responses. It is estimated by maximum likelihood, assuming bivariate normality of the latent variables.
Ordinal alpha is Cronbach's alpha computed on the polychoric correlation matrix:
Ordinal omega is McDonald's omega estimated from a factor model fit to the polychoric correlation matrix (using WLSMV or similar ordinal estimator in CFA).
Ordinal alpha and omega are typically higher than their Pearson-based counterparts for coarsely-rated Likert items, because polychoric correlations are less attenuated by the coarse ordinal scaling.
13.2 Reliability in Generalisability Theory (G-Theory)
Generalisability Theory (G-Theory) extends CTT by recognising that measurement error can have multiple sources (facets). In a rating study, error might come from:
- Items (some items are harder or easier than others).
- Raters (some raters are more lenient than others).
- Occasions (scores vary across testing sessions).
- Interactions (some raters are harsher on certain items).
A G-study uses a fully crossed (or nested) ANOVA to partition the total variance into components corresponding to each facet and their interactions:
Where = persons, = items, = raters.
The Generalisation Coefficient (G-coefficient) is analogous to reliability:
Where is the error variance appropriate to the measurement design.
A D-study uses the G-study variance components to predict how reliability would change if the number of items, raters, or occasions were varied — similar to the Spearman-Brown formula but for multiple facets simultaneously.
13.3 Reliability of Difference Scores
When researchers compute difference scores (e.g., post-treatment score minus pre-treatment score, or the difference between two subscales), the reliability of the difference is typically lower than the reliability of either component:
Where:
- , = reliabilities of and .
- , = variances of and .
- = observed correlation between and .
For parallel measures ( and ):
This shows that when and are highly correlated (as expected when both are pre/post measures of the same construct), the reliability of the difference score can be very low.
Example: , :
Even though each measure has reliability 0.80, their difference has reliability of only 0.33! This is why difference scores are generally discouraged and residualised change scores or ANCOVA are preferred for measuring change.
13.4 Reliability and the Attenuation-Correction Decision
When planning a study, the researcher must decide whether to:
- Accept observed correlations (with attenuation from unreliability), or
- Correct for attenuation to estimate the true relationship.
Arguments for correcting:
- Shows the theoretical true relationship between constructs.
- Allows comparison across studies using instruments of different quality.
- Better informs theory testing.
Arguments against correcting:
- The corrected estimate is population-level and not applicable to individual predictions.
- Relies on accurate reliability estimates (which carry their own uncertainty).
- Can produce , which is inadmissible.
Best practice: Report both the observed and disattenuated correlations, and always report the reliability estimates used for correction.
13.5 Reliability of Composite Scores from Multiple Subscales
When a total score is formed by combining items from multiple subscales, reliability cannot be computed by treating all items as a single scale (which would violate the unidimensionality assumption). Instead, use Mosier's formula for the reliability of a composite:
Where:
- = number of subscales.
- = weight of subscale in the composite (1 for unweighted sum).
- = variance of subscale .
- = reliability of subscale .
- = total variance of the composite score.
This formula partitions total composite variance into reliable variance (from true scores) and error variance (from subscale measurement errors), providing an accurate estimate of the composite's reliability.
13.6 Item Response Theory (IRT) and Marginal Reliability
Item Response Theory (IRT) provides a framework for reliability that is more flexible than CTT. In IRT, the precision of measurement is not constant across the score range — it is highest where the test has the most information.
The Test Information Function quantifies how much information the test provides at each level of the latent trait:
Where is the item information function for item .
The conditional standard error of measurement at trait level is:
The marginal reliability of the test (averaging over the population distribution of ):
IRT marginal reliability is a more informative reliability measure than Cronbach's alpha because it shows that reliability can be high for some test-takers and low for others — traditional reliability statistics only provide an average.
14. Worked Examples
Example 1: Cronbach's Alpha — 6-Item Burnout Scale
A researcher develops a 6-item work burnout scale with items rated 1 (Never) to 5 (Always). Data are collected from employees.
Items:
- B1: I feel emotionally exhausted from my work.
- B2: I feel used up at the end of the working day.
- B3: I feel fatigued when I get up in the morning and have to face another day on the job.
- B4: Working with people all day is really a strain for me.
- B5: I feel burned out from my work.
- B6: I feel frustrated by my job.
Item Statistics:
| Item | Mean | SD | Skewness | CITC | if Deleted |
|---|---|---|---|---|---|
| B1 | 3.21 | 1.08 | -0.22 | 0.72 | 0.86 |
| B2 | 3.08 | 1.12 | -0.15 | 0.69 | 0.87 |
| B3 | 2.95 | 1.15 | 0.10 | 0.61 | 0.88 |
| B4 | 2.78 | 1.20 | 0.18 | 0.55 | 0.89 |
| B5 | 3.31 | 1.05 | -0.30 | 0.75 | 0.86 |
| B6 | 3.05 | 1.18 | -0.08 | 0.68 | 0.87 |
Inter-Item Correlation Matrix:
| B1 | B2 | B3 | B4 | B5 | B6 | |
|---|---|---|---|---|---|---|
| B1 | 1.00 | 0.68 | 0.55 | 0.44 | 0.74 | 0.61 |
| B2 | 1.00 | 0.60 | 0.42 | 0.69 | 0.58 | |
| B3 | 1.00 | 0.48 | 0.58 | 0.52 | ||
| B4 | 1.00 | 0.49 | 0.54 | |||
| B5 | 1.00 | 0.66 | ||||
| B6 | 1.00 |
Average inter-item correlation:
Cronbach's Alpha Computation:
95% Confidence Interval (Feldt method): [0.870, 0.907]
Scale Statistics:
- Scale Mean:
- Scale SD:
- SEM:
Item Analysis Decision:
| Item | CITC | -if-Deleted | Action |
|---|---|---|---|
| B1 | 0.72 | 0.86 | ✅ Retain — strong indicator |
| B2 | 0.69 | 0.87 | ✅ Retain — good indicator |
| B3 | 0.61 | 0.88 | ✅ Retain — acceptable |
| B4 | 0.55 | 0.89 | ✅ Retain — but weakest item |
| B5 | 0.75 | 0.86 | ✅ Retain — strongest indicator |
| B6 | 0.68 | 0.87 | ✅ Retain — good indicator |
Conclusion: All six items are retained. Cronbach's alpha of (95% CI: 0.870, 0.907) indicates good internal consistency. All corrected item-total correlations exceed 0.50, and no single item appreciably improves alpha when deleted. The scale is internally consistent and all items contribute positively to the burnout construct.
Example 2: McDonald's Omega — 8-Item Anxiety Scale
A researcher administers an 8-item anxiety scale to participants and runs a CFA-based reliability analysis using McDonald's omega because item loadings are expected to differ.
Single-Factor CFA Results:
| Item | Standardised Loading () | Uniqueness () | |
|---|---|---|---|
| A1 | 0.82 | 0.33 | 0.67 |
| A2 | 0.79 | 0.38 | 0.62 |
| A3 | 0.71 | 0.50 | 0.50 |
| A4 | 0.68 | 0.54 | 0.46 |
| A5 | 0.85 | 0.28 | 0.72 |
| A6 | 0.74 | 0.45 | 0.55 |
| A7 | 0.63 | 0.60 | 0.40 |
| A8 | 0.77 | 0.41 | 0.59 |
CFA Fit: CFI = 0.976, TLI = 0.968, RMSEA = 0.047, SRMR = 0.041 → Good fit
Omega Total Computation:
Cronbach's Alpha (for comparison):
Comparison:
| Statistic | Value | Interpretation |
|---|---|---|
| Cronbach's | 0.892 | Good — but underestimates true reliability |
| McDonald's | 0.911 | Excellent — accurate estimate for congeneric items |
| Difference () | 0.019 | Alpha underestimates by 1.9 percentage points |
Conclusion: The 8-item anxiety scale demonstrates excellent internal consistency. Omega total () is the preferred and more accurate estimate because the items have unequal factor loadings (ranging from 0.63 to 0.85), confirming the congeneric model. Cronbach's alpha () slightly underestimates the true reliability, as expected for a congeneric scale.
Example 3: ICC — Two Clinical Raters Assessing Pain Intensity
Two physiotherapists independently rate pain intensity on a 0–10 numeric scale for patients. The researcher wants to assess whether the two raters can be used interchangeably (absolute agreement ICC).
ANOVA Table:
| Source | SS | df | MS |
|---|---|---|---|
| Between Patients | 412.8 | 39 | 10.585 |
| Between Raters | 8.1 | 1 | 8.100 |
| Residual (Error) | 58.4 | 39 | 1.497 |
| Total | 479.3 | 79 |
ICC(2,1) — Two-Way Random, Absolute Agreement:
ICC(3,1) — Two-Way Mixed, Consistency:
F-test for significance:
95% Confidence Interval for ICC(2,1):
Using the Shrout-Fleiss CI formula:
Interpretation:
| Statistic | Value | Interpretation |
|---|---|---|
| ICC(2,1) — Absolute Agreement | 0.732 [0.562, 0.849] | Moderate-Good agreement |
| ICC(3,1) — Consistency | 0.752 | Good consistency |
| Difference (abs. vs. consistency) | 0.020 | Small systematic rater mean difference |
Interpretation: The ICC for absolute agreement is 0.732 (95% CI: 0.562, 0.849), indicating moderate-to-good inter-rater reliability. The slightly higher consistency ICC (0.752) suggests a small systematic difference in how the two physiotherapists use the rating scale (Rater A rates slightly higher/lower than Rater B on average). For clinical interchangeability of the two raters, the absolute agreement ICC of 0.732 is adequate for research purposes but falls short of the 0.90 threshold recommended for high-stakes clinical decision-making. Additional rater training is recommended to improve agreement.
Example 4: Spearman-Brown Prophecy — Lengthening a Short Scale
A researcher has a 5-item resilience scale with and wants to improve reliability to at least by adding parallel items. How many more items are needed?
Step 1 — Compute (multiplication factor):
Step 2 — Compute required number of items:
New total items = items
Step 3 — Verify with Spearman-Brown:
Conclusion: Adding 5 more parallel items (total 10 items) is predicted to raise the reliability from to approximately , exceeding the target of 0.80. This assumes that the new items have the same average inter-item correlation as the original 5.
15. Common Mistakes and How to Avoid Them
Mistake 1: Reporting Alpha Without a Confidence Interval
Problem: Cronbach's alpha is a sample statistic with substantial sampling variability,
especially in small samples. Reporting only the point estimate gives a false sense of precision.
A value of with could have a 95% CI as wide as [0.63, 0.89].
Solution: Always report the 95% confidence interval for all reliability coefficients.
Use the Feldt method for alpha or bootstrap CIs for omega.
Mistake 2: Using Cronbach's Alpha as a Measure of Unidimensionality
Problem: Alpha measures internal consistency (how strongly items co-vary), not
unidimensionality (whether items measure a single construct). A multidimensional scale with
two positively correlated subscales can produce high alpha, even though it clearly violates
unidimensionality.
Solution: Always conduct an EFA or CFA to assess dimensionality before computing
reliability. Report alpha/omega separately for each unidimensional subscale.
Mistake 3: Blindly Deleting Items to Maximise Alpha
Problem: Removing items purely because they increase alpha capitalises on sampling
variability and can produce a shorter scale that performs worse in new samples. Alpha
increases simply by removing poor items, but the gain in reliability may be spurious.
Solution: Use a principled decision framework: only remove an item if (a) the CITC is
below 0.20, (b) the item has poor theoretical alignment with the construct, AND (c) the
item does not reduce content validity. Validate the revised scale in a new sample.
Mistake 4: Not Checking for Reverse-Coded Items
Problem: Including negatively-worded items without reverse coding them will produce
negative inter-item correlations and severely deflate alpha. A value of is
often a sign that one or more items have not been reverse coded.
Solution: Before running reliability analysis, identify all negatively-worded items and
reverse-code them: .
Mistake 5: Reporting Alpha for a Multidimensional Scale as a Whole
Problem: Computing a single alpha for a multidimensional questionnaire (e.g., a measure
with anxiety, depression, and stress subscales combined) is theoretically inappropriate and
can produce misleading reliability estimates.
Solution: Compute reliability separately for each subscale. If a composite total score
is used, estimate its reliability using Mosier's composite reliability formula (Section 13.5).
Mistake 6: Ignoring the Number of Items When Interpreting Alpha
Problem: Alpha increases automatically with more items (Spearman-Brown effect). A 30-item
scale with weak items can produce , while a 4-item scale with strong items might
produce . The 4-item scale may actually be more efficient and have better items.
Solution: Consider the average inter-item correlation () alongside alpha.
Compare across scales of different lengths, as it is not affected by scale length.
Ideal to .
Mistake 7: Confusing Inter-Rater Agreement With Inter-Rater Reliability
Problem: These two concepts are related but distinct:
- Reliability (ICC, consistency): Do raters rank subjects in the same order?
- Agreement (ICC absolute, kappa): Do raters give the same actual values?
A rater who always scores 2 points higher than another has perfect reliability (same ordering) but poor agreement.
Solution: Select the appropriate ICC type based on the research question. Use absolute agreement ICC when raters must be interchangeable. Use consistency ICC when only the ranking matters and systematic differences between raters are acceptable.
Mistake 8: Using Percent Agreement Instead of Cohen's Kappa
Problem: Percent agreement does not correct for chance agreement. With a binary rating
where 90% of cases fall in one category, two raters randomly agreeing with base rates would
achieve 82% agreement by chance, making 85% agreement seem impressive when it is barely
above chance.
Solution: Always report Cohen's Kappa (or Fleiss' Kappa for multiple raters) alongside
percent agreement. Never interpret percent agreement alone.
Mistake 9: Applying Cronbach's Alpha to Subscale Scores (Not Item-Level Data)
Problem: Computing alpha using subscale total scores (rather than individual item scores)
as the input produces a composite alpha estimate that is not the same as the reliability
of the total scale and is not interpretable as a standard reliability coefficient.
Solution: Always compute reliability from item-level data (each item in a separate
column), not from subscale totals.
Mistake 10: Interpreting Alpha of 0.95 as "Better" Than Alpha of 0.85
Problem: Very high alpha () is often a sign of item redundancy — items are so
similar that they provide almost no unique measurement information. This wastes respondent
time without improving construct coverage.
Solution: Target for most research scales. If alpha exceeds 0.95 with
many items, consider reducing scale length by removing the most redundant items (lowest
unique information) while maintaining acceptable reliability.
16. Troubleshooting
| Problem | Likely Cause | Solution |
|---|---|---|
| Cronbach's alpha is very low () | Reverse-coded items not recoded; items from different constructs mixed; items are too heterogeneous | Check and reverse-code negatively worded items; separate subscales; check for construct coherence |
| Alpha is negative | At least one item is very negatively correlated with others; reverse coding error | Examine inter-item correlations; check for items that need reverse coding |
| Alpha is very high () with many items | Item redundancy — too many items with near-identical wording | Inspect item pairs with ; remove the most redundant items |
| Alpha exceeds omega | Correlated errors inflating alpha; model misspecification | Run CFA to check for correlated errors; use omega from properly specified model |
| One item's alpha-if-deleted greatly exceeds overall alpha | Item measures a different construct; possible reverse-coding error; item is ambiguous | Examine item content; check reverse coding; consider removing from scale |
| All corrected item-total correlations are near zero | Items are unrelated to each other; multidimensional scale being treated as unidimensional | Run EFA; split into subscales; reconsider construct definition |
| Negative corrected item-total correlation | Item is negatively related to the construct; reverse coding needed | Reverse-code the item and re-run |
| ICC very low () with large F-ratio | Raters highly inconsistent; training issue | Re-train raters; clarify rating criteria; pilot coding manual |
| ICC inconsistency much higher than absolute agreement | Systematic rater bias (one rater consistently rates higher/lower) | Identify the biased rater; re-calibrate; consider rater re-training |
| Cohen's Kappa is very low despite high percent agreement | High base-rate of one category (prevalence paradox) | Report both statistics; use PABAK (prevalence and bias adjusted kappa) |
| CFA does not converge for omega | Very small sample; near-perfect correlations; Heywood case | Increase sample size; reduce number of items; check for duplicate items |
| SEM is very large | Low reliability and/or high scale variance | Improve reliability; report SEM explicitly in all clinical applications |
| Omega hierarchical approaches zero | Essentially no general factor; scale is fully multidimensional | Use subscale scores rather than total; report subscale-specific omega |
| Spearman-Brown predicts a very large number of items needed | Baseline reliability is very low; items are poor indicators | Redesign items; collect new pilot data; consider different item format |
17. Quick Reference Cheat Sheet
Core Equations
| Formula | Description |
|---|---|
| Classical Test Theory model | |
| Reliability coefficient (population) | |
| Standard Error of Measurement | |
| Cronbach's alpha | |
| Standardised alpha | |
| Spearman-Brown prophecy | |
| Items needed for target reliability | |
| Correction for attenuation | |
| McDonald's omega total | |
| Omega hierarchical | |
| Split-half (Spearman-Brown corrected) | |
| Cohen's Kappa | |
| ICC consistency (two-way mixed) | |
| ICC absolute agreement (two-way random) |
Reliability Benchmarks
| Coefficient | Poor | Acceptable | Good | Excellent |
|---|---|---|---|---|
| Alpha / Omega | ||||
| ICC (research) | ||||
| Cohen's Kappa | ||||
| CITC | ||||
Reliability Type Selection Guide
| Scenario | Recommended Method |
|---|---|
| Multi-item Likert scale, unidimensional | McDonald's omega (preferred) or Cronbach's alpha |
| Multi-item scale, multidimensional | Omega hierarchical (bifactor) + omega subscale |
| Binary scored items (correct/incorrect) | KR-20 |
| Ordinal scale ( categories) | Ordinal alpha or ordinal omega (polychoric) |
| Two raters, nominal categories | Cohen's Kappa |
| Two raters, ordered categories | Weighted Kappa (linear or quadratic) |
| Three+ raters, nominal categories | Fleiss' Kappa |
| Two+ raters, continuous ratings | ICC (specify model and type) |
| Test-retest, continuous | ICC (two-way mixed, absolute agreement) |
| Multiple sources of error | Generalisability Theory (G-coefficient) |
Item Analysis Decision Rules
| Statistic | Threshold | Action |
|---|---|---|
| CITC | Flag for removal or revision | |
| CITC | Negative | Check reverse coding; flag for review |
| Alpha-if-deleted | Strong candidate for removal | |
| Inter-item | Redundancy — remove one of the pair | |
| Item skewness | $ | z |
| Item difficulty (, binary) | or | Item too hard or too easy |
| Item discrimination () | Poor discriminator — revise |
Minimum Reliability by Context
| Context | Minimum | Preferred |
|---|---|---|
| Exploratory / pilot research | 0.60 | |
| Group-level research | 0.70 | |
| Individual research decisions | 0.80 | |
| Clinical / high-stakes decisions | 0.90 |
ICC Model Selection Guide
| Design | Raters | Measure | Recommended ICC |
|---|---|---|---|
| Each subject rated by different raters | Random | Single | ICC(1,1) |
| Each subject rated by different raters | Random | Mean of | ICC(1,) |
| Same raters rate all; generalise to all raters | Random | Single | ICC(2,1) |
| Same raters rate all; generalise to all raters | Random | Mean of | ICC(2,) |
| Same fixed raters; generalise to these raters | Fixed | Single | ICC(3,1) |
| Same fixed raters; generalise to these raters | Fixed | Mean of | ICC(3,) |
This tutorial provides a comprehensive foundation for understanding, applying, and interpreting Reliability Analysis using the DataStatPro application. For further reading, consult Revelle & Zinbarg's "Coefficients Alpha, Beta, Omega, and the glb" (2009), McDonald's "Test Theory: A Unified Treatment" (1999), Koo & Mae's "A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research" (2016), or Shrout & Fleiss's "Intraclass Correlations: Uses in Assessing Rater Reliability" (1979). For feature requests or support, contact the DataStatPro team.