Generalized Linear Models: Zero to Hero Tutorial
This comprehensive tutorial takes you from the foundational concepts of Generalized Linear Models (GLMs) all the way through advanced model specification, estimation, diagnostics, and practical usage within the DataStatPro application. Whether you are a complete beginner or an experienced analyst, this guide is structured to build your understanding step by step.
Table of Contents
- Prerequisites and Background Concepts
- What are Generalized Linear Models?
- The Mathematical Framework of GLMs
- The Exponential Family of Distributions
- Link Functions
- GLM Distributions and Their Applications
- Assumptions of GLMs
- Parameter Estimation: Maximum Likelihood and IRLS
- Model Fit and Evaluation
- Hypothesis Testing and Inference
- Model Diagnostics and Residuals
- Model Selection and Variable Selection
- Overdispersion and Underdispersion
- Using the GLM Component
- Computational and Formula Details
- Worked Examples
- Common Mistakes and How to Avoid Them
- Troubleshooting
- Quick Reference Cheat Sheet
1. Prerequisites and Background Concepts
Before diving into Generalized Linear Models, it is helpful to be familiar with the following foundational concepts. Do not worry if you are not — each concept is briefly explained here.
1.1 Probability Distributions
A probability distribution describes how the values of a random variable are distributed. Key distributions used in GLMs:
- Normal (Gaussian): Continuous, symmetric, unbounded. Characterised by mean and variance .
- Binomial: Discrete, counts successes in independent Bernoulli trials. Characterised by and probability .
- Poisson: Discrete, counts of events in a fixed interval. Characterised by rate , with .
- Gamma: Continuous, positive-valued, right-skewed. Characterised by shape and rate .
- Inverse Gaussian: Continuous, positive-valued, highly right-skewed. Models first-passage times.
- Negative Binomial: Discrete, counts with overdispersion relative to Poisson.
1.2 The Likelihood Function
The likelihood function measures how probable the observed data are, given a parameter vector . For independent observations:
The log-likelihood is more convenient to work with:
Maximum Likelihood Estimation (MLE) finds the parameter values that maximise .
1.3 The Linear Predictor
A linear predictor is a weighted linear combination of predictor variables:
This is the core structure inherited from linear regression. In GLMs, is not the outcome itself but is transformed through a link function to relate to the mean of the response distribution.
1.4 Ordinary Linear Regression Recap
In ordinary linear regression (OLS):
This model has three implicit components:
- A distribution for the response: .
- A linear predictor: .
- A link function connecting to : (the identity link).
GLMs generalise this framework by allowing different distributions and link functions.
1.5 The Score Equations and the Information Matrix
The score vector is the gradient of the log-likelihood with respect to the parameters:
Setting gives the MLE. The Fisher information matrix is:
Its inverse gives the asymptotic covariance matrix of the MLE, used to compute standard errors and confidence intervals.
2. What are Generalized Linear Models?
Generalized Linear Models (GLMs) are a unified class of regression models that extend ordinary linear regression to accommodate response variables with non-normal distributions. Introduced by Nelder and Wedderburn (1972), GLMs provide a coherent framework for modelling a wide variety of outcome types — counts, proportions, binary outcomes, continuous positive values, and more — using a single, elegant mathematical structure.
2.1 The Central Idea
Ordinary linear regression assumes the response is normally distributed and that the mean equals the linear predictor directly: . GLMs relax both restrictions:
- The distribution of can be any member of the exponential family (Normal, Binomial, Poisson, Gamma, Inverse Gaussian, etc.).
- The link function can be any monotone, differentiable function that maps the mean (constrained to its natural range) to the real line .
This two-step generalisation unlocks an enormous range of practical modelling scenarios while preserving the interpretability of regression coefficients.
2.2 Real-World Applications
GLMs are among the most widely used statistical models in applied science, business, and policy:
- Insurance & Actuarial Science: Modelling claim counts (Poisson/Negative Binomial), claim severity (Gamma), and pure premiums (Tweedie).
- Public Health & Epidemiology: Modelling disease incidence rates (Poisson with offset), binary disease outcomes (Binomial/logistic), survival and time-to-event data.
- Ecology: Modelling species counts (Poisson, Negative Binomial), presence/absence (Binomial), and biomass (Gamma).
- Economics & Finance: Modelling discrete choices (Binomial), income (Gamma), financial durations (Inverse Gaussian).
- Marketing: Modelling purchase counts (Poisson), click-through rates (Binomial), and customer lifetime value (Gamma/Tweedie).
- Clinical Trials: Modelling adverse event counts (Poisson), binary treatment outcomes (Binomial), and length of stay (Gamma).
- Manufacturing & Quality Control: Modelling defect counts (Poisson), product lifetimes (Gamma/Inverse Gaussian).
- Social Sciences: Modelling ordered survey responses (Ordinal), multinomial choices (Multinomial), and event rates.
2.3 The Three Components of a GLM
Every GLM is fully specified by three components:
| Component | Symbol | Description | Example (Logistic Regression) |
|---|---|---|---|
| Random Component | Distribution of from the exponential family | Binomial | |
| Systematic Component | Linear predictor | ||
| Link Function | Connects to | Logit: |
2.4 How GLMs Generalise Linear Regression
| Feature | Linear Regression | GLM |
|---|---|---|
| Response distribution | Normal only | Any exponential family |
| Link function | Identity () | Any valid link |
| Variance | Constant | Function of : |
| Estimation | OLS (closed form) | MLE via IRLS (iterative) |
| Goodness of fit | , -test | Deviance, AIC, likelihood ratio tests |
| Residuals | Raw, standardised | Pearson, deviance, Anscombe |
3. The Mathematical Framework of GLMs
3.1 The Three-Component Structure in Detail
A GLM specifies that the -th response has:
Random Component:
With mean and variance , where:
- is the dispersion parameter (estimated or known).
- is the variance function — a function of the mean that characterises the distribution.
Systematic Component:
Link Function:
So the mean is related to the predictors through:
Where is the inverse link function (also called the mean function or response function).
3.2 The Variance Function
The variance function characterises how the variance of the response depends on its mean. Each exponential family distribution has a specific variance function:
| Distribution | Interpretation | |
|---|---|---|
| Normal | Variance is constant (homoscedastic) | |
| Binomial | Variance is bell-shaped, maximum at | |
| Poisson | Variance equals the mean | |
| Gamma | Variance is proportional to the square of the mean | |
| Inverse Gaussian | Variance grows as the cube of the mean | |
| Negative Binomial | Variance exceeds the mean (overdispersion) | |
| Tweedie | Power variance function; for compound Poisson-Gamma |
3.3 The Dispersion Parameter
The full variance of is:
Where is a known prior weight (e.g., for binomial proportions). The dispersion parameter :
- Is known for Poisson () and Binomial ().
- Is estimated for Normal, Gamma, and Inverse Gaussian.
- Can be estimated for Poisson and Binomial to account for overdispersion (quasi-GLM).
3.4 The Canonical Link
For each distribution, there is a canonical link function that arises naturally from the mathematical structure of the exponential family. Using the canonical link has desirable statistical properties (sufficient statistics, simpler score equations):
| Distribution | Canonical Link | |
|---|---|---|
| Normal | Identity | |
| Binomial | Logit | |
| Poisson | Log | |
| Gamma | Inverse | |
| Inverse Gaussian | Inverse squared |
Non-canonical links can also be used and may be more interpretable in certain applications. The canonical link is the default in most GLM software but is not obligatory.
3.5 The Offset
An offset is a term added to the linear predictor with a fixed coefficient of 1:
Offsets are used when the response is a rate and observations have different exposure times or population sizes. The offset is known (not estimated) and is included on the linear predictor scale.
Example: Modelling disease incidence counts for regions with different population sizes . The rate per person is . Using a Poisson model with log link:
Where is the offset. The model then estimates the log rate .
4. The Exponential Family of Distributions
The exponential family is a broad class of distributions that share a common mathematical form, which is the foundation of the GLM framework.
4.1 The Exponential Family Form
A distribution belongs to the exponential family if its probability density (or mass) function can be written as:
Where:
- = natural (canonical) parameter — a function of the mean .
- = dispersion parameter.
- = cumulant function (log-normalising constant).
- = dispersion function (typically for prior weight ).
- = a function of the data and dispersion only (not ).
Key relationships derived from :
This elegant structure means that all moments of the distribution follow automatically from the cumulant function .
4.2 Major Exponential Family Distributions for GLMs
4.2.1 Normal Distribution
- Natural parameter:
- Cumulant function:
- Variance function:
- Dispersion:
- Support:
4.2.2 Binomial Distribution
- Natural parameter: (log odds)
- Cumulant function:
- Variance function: (where )
- Dispersion: (known)
- Support:
4.2.3 Poisson Distribution
- Natural parameter:
- Cumulant function:
- Variance function:
- Dispersion: (known)
- Support:
4.2.4 Gamma Distribution
- Natural parameter: (inverse of mean)
- Cumulant function:
- Variance function:
- Dispersion: (inverse of shape)
- Support:
4.2.5 Inverse Gaussian Distribution
- Natural parameter:
- Cumulant function:
- Variance function:
- Dispersion:
- Support:
4.3 The Negative Binomial Distribution
While the Negative Binomial is not a member of the exponential family in its most general form, it can be treated as a quasi-exponential family or as a Poisson-Gamma mixture:
- Mean:
- Variance:
- Overdispersion parameter: (smaller → more overdispersion; as , Negative Binomial → Poisson)
- Support:
4.4 The Tweedie Distribution
The Tweedie distribution is a special case of the exponential dispersion model with power variance function :
| Power | Distribution |
|---|---|
| Normal | |
| Poisson | |
| Compound Poisson-Gamma (supports exact zeros + positive values) | |
| Gamma | |
| Inverse Gaussian |
The Tweedie distribution with is particularly valuable in insurance (pure premium modelling) and ecology (biomass data) because it naturally accommodates data with a mass at zero and a continuous positive distribution for non-zero values.
5. Link Functions
The link function is the bridge between the mean of the response distribution and the linear predictor. Choosing an appropriate link function is a key modelling decision.
5.1 Requirements for a Valid Link Function
A valid link function must be:
- Monotone: Strictly increasing or decreasing.
- Differentiable: must exist and be non-zero.
- Range-compatible: The range of must match the support of the mean .
5.2 Commonly Used Link Functions
| Link Name | Range of | Canonical For | ||
|---|---|---|---|---|
| Identity | Normal | |||
| Log | Poisson | |||
| Logit | Binomial | |||
| Probit | Binomial (alt) | |||
| Complementary log-log (cloglog) | Binomial (alt) | |||
| Inverse | Gamma | |||
| Inverse squared | Inverse Gaussian | |||
| Square root | Poisson (alt) | |||
| Negative log | Complementary log | |||
| Log-log | Binomial (alt) |
5.3 Logit Link (Binomial GLM)
- Maps probabilities to the real line .
- Produces odds ratio interpretations: is the multiplicative change in odds per unit increase in .
- Symmetric around .
5.4 Probit Link (Binomial GLM)
Where is the quantile function of the standard normal distribution.
- Maps probabilities to the real line via the normal distribution.
- Produces probit (z-score) interpretations.
- Very similar to logit but with lighter tails; differs mainly for extreme probabilities.
5.5 Complementary Log-Log Link (Binomial GLM)
- Asymmetric: Approaches 1 faster from below than it approaches 0.
- Appropriate when the probability approaches 1 quickly but approaches 0 slowly — common in survival/hazard models.
- Produces a proportional hazards interpretation: is the multiplicative change in the hazard.
5.6 Log Link (Poisson, Negative Binomial, Gamma GLM)
- Ensures (positivity constraint satisfied automatically).
- Produces multiplicative interpretations: is the multiplicative change in the mean per unit increase in .
- Most commonly used link for count and continuous positive data.
5.7 Inverse Link (Gamma GLM)
- The canonical link for the Gamma distribution.
- Less commonly used than the log link because the inverse parameterisation is less interpretable.
- Coefficients represent changes in the reciprocal of the mean.
5.8 Choosing the Link Function
| Scenario | Recommended Link |
|---|---|
| Binomial: symmetric probability, easy odds interpretation | Logit |
| Binomial: latent normal model assumed | Probit |
| Binomial: rare events, extreme probabilities | Complementary log-log |
| Binomial: log-linear probability model needed | Log (with care about constraint) |
| Poisson / Negative Binomial / Gamma: multiplicative effects | Log |
| Gamma: when inverse relationships are natural | Inverse |
| Normal / continuous unbounded | Identity |
| Positive continuous: when multiplicative effects expected | Log |
6. GLM Distributions and Their Applications
6.1 Binomial GLM (Logistic, Probit, Cloglog Regression)
Use when: Response is a binary outcome (0/1, yes/no, success/failure) or a proportion where both (successes) and (trials) are known.
Model:
Default link: Logit.
Interpretation (logit link): is the odds ratio — the multiplicative change in the odds of success for a one-unit increase in .
Special cases:
- for all : Binary logistic regression.
- : Grouped binomial (proportions) regression.
Applications: Disease diagnosis, credit default, customer churn, election outcomes, clinical trial response rates.
6.2 Poisson GLM (Poisson Regression)
Use when: Response is a count of events that could in principle be any non-negative integer, arising from a process with a constant rate.
Model:
Default link: Log.
Interpretation (log link): is the rate ratio (or incidence rate ratio) — the multiplicative change in the expected count for a one-unit increase in .
Key assumption: (equidispersion). Violations lead to overdispersion (Section 13).
Applications: Number of accidents, hospital admissions, species counts, web page visits, insurance claims frequency.
6.3 Negative Binomial GLM
Use when: Response is a count variable with overdispersion (variance exceeds the mean) — the most common departure from Poisson assumptions.
Model:
Interpretation: Same as Poisson (log link, rate ratios), but with an additional overdispersion parameter estimated from the data.
Applications: Same as Poisson but when the Poisson assumption of equidispersion is violated — common in ecology (species abundance), healthcare (hospitalisation counts), and insurance.
6.4 Gamma GLM
Use when: Response is continuous and strictly positive, with variance that increases proportionally to the square of the mean (coefficient of variation is roughly constant).
Model:
Common links: Log (most interpretable), inverse (canonical), identity.
Interpretation (log link): is the multiplicative change in the mean response per unit increase in .
Applications: Insurance claim severity (cost per claim), income, hospital costs, reaction times, survival times (without censoring), environmental concentrations.
6.5 Inverse Gaussian GLM
Use when: Response is continuous, strictly positive, and highly right-skewed, with variance increasing as the cube of the mean — more extreme than Gamma.
Model:
Common links: Inverse squared (canonical), log, inverse.
Applications: First-passage times, repair times, extreme claim sizes, some types of survival data.
6.6 Gaussian GLM (Standard Linear Regression)
Use when: Response is continuous, unbounded, approximately normally distributed, with constant variance.
Model:
This is identical to OLS with the identity link. Including it in the GLM framework confirms that linear regression is a special case of GLMs.
Applications: All classical linear regression applications.
6.7 Tweedie GLM
Use when: Response contains exact zeros mixed with positive continuous values (a "zero-inflated" continuous distribution), or when the appropriate power variance is uncertain.
Model:
The power is estimated from the data or set by domain knowledge.
Applications: Insurance pure premium (frequency × severity), rainfall amounts, ecological biomass, fisheries catch data.
6.8 Quasi-GLMs
When the distributional assumption is uncertain or violated, quasi-GLMs relax the full distributional assumption and specify only the mean and variance function:
The dispersion parameter is estimated from the data (not fixed at 1), providing valid inference even when the count data are overdispersed or underdispersed.
Common quasi-GLMs:
- Quasi-Poisson: Poisson mean function with estimated .
- Quasi-Binomial: Binomial mean function with estimated .
⚠️ Quasi-GLMs do not have a full likelihood, so AIC/BIC cannot be computed. Use deviance and F-tests for model comparison instead.
6.9 Summary of GLM Distributions
| Distribution | Response Type | Variance | Default Link | Dispersion |
|---|---|---|---|---|
| Gaussian | Continuous, unbounded | Identity | Estimated | |
| Binomial | Binary / Proportions | Logit | Known () | |
| Poisson | Counts (integer ) | Log | Known () | |
| Negative Binomial | Counts (overdispersed) | Log | Estimated () | |
| Gamma | Continuous, positive | Log / Inverse | Estimated | |
| Inverse Gaussian | Continuous, positive, skewed | Inv. squared / Log | Estimated | |
| Tweedie | Zero-inflated positive | Log | Estimated (, ) | |
| Quasi-Poisson | Counts (overdispersed) | Log | Estimated | |
| Quasi-Binomial | Proportions (overdispersed) | Logit | Estimated |
7. Assumptions of GLMs
7.1 Correct Distributional Family
The chosen distribution must be appropriate for the type of response variable. For example:
- Using Gaussian for count data ignores the non-negativity and discreteness.
- Using Poisson for overdispersed counts ignores the excess variance, leading to underestimated standard errors.
How to check: Inspect the response variable's distribution (histogram, range), consider the data-generating process, and verify using residual diagnostics and goodness-of-fit tests.
7.2 Correct Link Function
The link function must be appropriate for the chosen distribution and the expected relationship between predictors and the mean.
How to check: Inspect residual plots; compare alternative link functions using AIC; use added-variable plots for the link function.
7.3 Linearity on the Link Scale
GLMs assume a linear relationship between the predictors and the transformed mean :
This means the relationship between each and must be linear, even if the relationship between and itself is non-linear.
How to check: Partial residual plots (component-plus-residual plots); LOESS-smoothed plots of residuals vs. each predictor.
7.4 Independence of Observations
Observations must be independent of each other. Clustered, longitudinal, or spatial data may have within-group correlations that violate this assumption.
How to check: Consider the study design. For clustered data, use Generalised Estimating Equations (GEE) or mixed models (GLMM) instead.
7.5 Correct Specification of the Variance Function
The variance function must correctly describe how variability changes with the mean. Misspecification leads to:
- Incorrect standard errors (too small if variance is underestimated).
- Invalid hypothesis tests and confidence intervals.
How to check: Residual vs. fitted value plots; scale-location plots; Pearson / deviance tests for dispersion.
7.6 No Perfect Multicollinearity
As in OLS, perfect multicollinearity (one predictor is a perfect linear function of others) prevents estimation. Near-multicollinearity inflates standard errors.
How to check: Variance Inflation Factor (VIF); condition number of the design matrix.
7.7 No Complete Separation (for Binomial GLMs)
For logistic regression and other binomial GLMs, complete separation (a predictor or combination perfectly predicts the outcome) causes the MLE to diverge to .
How to check: Warning messages from the fitting algorithm; extremely large coefficient estimates with huge standard errors.
7.8 Sufficient Sample Size
GLM inference is based on asymptotic (large-sample) theory. The adequacy of asymptotic approximations depends on:
- Total sample size .
- For Binomial: Expected counts in each cell.
- For Poisson: Expected counts (preferably ) in most cells.
Small expected counts reduce the reliability of likelihood ratio tests, Wald tests, and residual diagnostics.
8. Parameter Estimation: Maximum Likelihood and IRLS
8.1 The Log-Likelihood for GLMs
For independent observations, the log-likelihood is:
Where and depend on the regression coefficients through the link function.
8.2 The Score Equations
Setting the gradient of the log-likelihood to zero gives the score equations (MLE conditions):
In matrix form:
Where and .
These equations are generally non-linear in and require iterative solution.
8.3 Iteratively Reweighted Least Squares (IRLS)
The standard algorithm for fitting GLMs is Iteratively Reweighted Least Squares (IRLS), a Newton-Raphson optimisation applied to the log-likelihood.
At each iteration :
Step 1: Compute the adjusted dependent variable (working response):
Step 2: Compute the working weights:
Step 3: Solve the weighted least squares problem:
Convergence: Repeat until (e.g., ) or the change in deviance is negligible.
Starting values: Typically (small constant to avoid boundary issues) or the overall mean .
8.4 The Fisher Information Matrix and Standard Errors
At convergence, the observed Fisher information matrix is:
Where is the weight matrix evaluated at . The asymptotic covariance matrix of is:
The standard error of :
For known-dispersion models (Poisson, Binomial with ):
8.5 Estimating the Dispersion Parameter
For distributions with estimated dispersion (Normal, Gamma, Inverse Gaussian), is estimated after obtaining :
Method of Moments (Pearson ):
Maximum Likelihood / Deviance Estimator:
Where is the residual deviance (see Section 9).
💡 The Pearson estimator of is generally preferred for its robustness. For Poisson and Binomial, is known; if the Pearson estimator gives , overdispersion is present (Section 13).
9. Model Fit and Evaluation
9.1 The Deviance
The deviance is the primary goodness-of-fit measure for GLMs. It is defined as twice the log-likelihood difference between the saturated model (perfect fit, one parameter per observation) and the fitted model:
Where the deviance contribution of observation is:
Or more generally, twice the contribution of observation to the log-likelihood difference.
Deviance for each distribution:
| Distribution | Deviance Contribution |
|---|---|
| Gaussian | |
| Binomial | |
| Poisson | |
| Gamma | |
| Inverse Gaussian |
9.2 Null Deviance and Residual Deviance
The null deviance is the deviance of the null model (intercept only):
The residual deviance is the deviance of the fitted model:
The difference measures how much the predictors have reduced the unexplained deviance — analogous to the regression sum of squares in linear regression.
For known-dispersion models (Poisson, Binomial), the residual deviance approximately follows when the model is correct. A residual deviance much larger than suggests poor fit or overdispersion.
9.3 The Pearson Statistic
The Pearson statistic provides an alternative goodness-of-fit measure:
Under the correct model with known dispersion, for large samples. The Pearson dispersion estimate is .
9.4 Pseudo R² Measures
Since ordinary is not directly meaningful for non-Gaussian GLMs, several pseudo R² measures have been developed:
McFadden's Pseudo R²:
Simplified using deviances:
Cox-Snell Pseudo R²:
Nagelkerke Pseudo R² (scaled to reach maximum of 1):
Deviance R² (common in GLM literature):
Interpretation of McFadden's for GLMs:
| Interpretation | |
|---|---|
| Poor fit | |
| Acceptable fit | |
| Good fit | |
| Very good fit | |
| Excellent fit |
9.5 AIC and BIC
Akaike Information Criterion:
Bayesian Information Criterion:
Where is the number of estimated regression parameters (including the intercept). For models where is estimated, include it as an additional parameter.
Lower AIC/BIC indicates a better model (adjusted for complexity). AIC favours predictive accuracy; BIC imposes a stronger penalty for complexity and prefers parsimonious models.
⚠️ AIC and BIC require a proper likelihood. They cannot be computed for quasi-GLMs, which use a pseudo-likelihood. For quasi-models, use the F-test for model comparison.
10. Hypothesis Testing and Inference
10.1 Wald Tests for Individual Coefficients
For each coefficient , the Wald test tests :
Two-sided p-value:
A Wald confidence interval for :
For the effect on the original response scale, exponentiate:
- Log link: is the rate ratio or mean ratio (Poisson/Gamma).
- Logit link: is the odds ratio (Binomial).
Confidence interval on the original scale: .
10.2 Likelihood Ratio Test (LRT)
The likelihood ratio test compares two nested models: a smaller (restricted) model and a larger (full) model :
Under (the restrictions hold), where is the difference in the number of parameters between and .
For testing a single coefficient (), . For testing a group of coefficients jointly, .
💡 The LRT is generally preferred over the Wald test for GLMs because it is more accurate in small samples and avoids the Wald test's known deficiencies (e.g., the Hauck-Donner effect, where Wald -values can decrease for very large effects).
10.3 Score Test (Rao Test)
The score test evaluates whether the gradient of the log-likelihood (the score) is significantly different from zero at the restricted (null) parameter values:
The score test only requires fitting the null model (not the full model), making it computationally convenient when the null model is much simpler.
10.4 Analysis of Deviance Table
The analysis of deviance is the GLM analogue of the ANOVA table in linear regression. It sequentially adds predictors and reports the reduction in deviance:
| Source | Df | Deviance | Residual Df | Residual Deviance | -value |
|---|---|---|---|---|---|
| Null model | — | — | — | ||
| 1 | |||||
| 1 | |||||
| 1 |
For overdispersed models (quasi-GLMs), use an F-test instead of the test, dividing the deviance change by the estimated dispersion :
10.5 Confidence Intervals for the Mean Response
A confidence interval for the mean response at a new predictor vector is constructed on the linear predictor scale (where the asymptotic normality applies) and back-transformed:
Linear predictor and its SE:
Confidence interval on the scale:
Back-transform to the scale using :
💡 Constructing confidence intervals on the link scale and back-transforming (rather than constructing them directly on the scale) ensures the bounds respect the natural constraints of (e.g., positivity for Poisson/Gamma, for Binomial).
10.6 Profile Likelihood Confidence Intervals
Profile likelihood confidence intervals are more accurate than Wald intervals, especially in small samples or when the likelihood is asymmetric:
Where is the MLE with fixed at a test value. The DataStatPro application computes both Wald and profile likelihood CIs.
11. Model Diagnostics and Residuals
11.1 Types of GLM Residuals
Unlike linear regression, which has a single natural residual , GLMs have several types of residuals, each useful for different diagnostic purposes.
11.1.1 Raw (Response) Residuals
Simple but not standardised — larger values of tend to produce larger raw residuals even if the fit is equally good.
11.1.2 Pearson Residuals
Standardised by the expected standard deviation under the model. Pearson residuals should be approximately for large samples if the model is correct.
Standardised Pearson residuals (adjusted for leverage):
Where is the leverage (hat matrix diagonal). Values warrant investigation.
11.1.3 Deviance Residuals
Where is the deviance contribution of observation (see Section 9.1). The sum of squared deviance residuals equals the total deviance: .
Deviance residuals are generally preferred for normality assessments because they are closer to normally distributed than Pearson residuals in many GLMs.
Standardised deviance residuals:
11.1.4 Anscombe Residuals
Anscombe residuals are constructed using a variance-stabilising transformation chosen so that residuals are approximately normally distributed:
The Anscombe transformation for each distribution:
| Distribution | |
|---|---|
| Normal | |
| Poisson | |
| Binomial | (approximately) |
| Gamma | |
| Inverse Gaussian |
11.1.5 Quantile (Randomised) Residuals
Quantile residuals (Dunn & Smyth, 1996) are defined as:
Where is the cumulative probability of the observed value under the fitted model. For discrete distributions, is drawn uniformly from (randomised).
Quantile residuals are exactly normally distributed (by construction) when the model is correct, making them the gold standard for GLM diagnostics. They are particularly useful for discrete distributions (Poisson, Binomial, Negative Binomial) where other residuals are not well-approximated by a normal distribution.
11.2 Leverage, Influence, and Cook's Distance
Hat matrix (leverage): For GLMs, the hat matrix is:
The diagonal elements are the leverages — the influence of observation on its own fitted value. High leverage () indicates an observation with unusual predictor values.
Cook's Distance: Measures the influence of observation on all fitted values:
Values (or ) suggest influential observations.
DFBETA: Change in coefficient estimates when observation is excluded:
11.3 Diagnostic Plots
A comprehensive GLM diagnostic assessment includes the following plots:
| Plot | What to Look For |
|---|---|
| Residuals vs. Fitted values | No pattern; random scatter around zero |
| Scale-Location (√|residuals| vs. Fitted) | Horizontal band; no trend (homoscedasticity) |
| Normal Q-Q of residuals | Points near the diagonal line (normality) |
| Residuals vs. Leverage | No high-leverage + high-residual points |
| Cook's Distance | No observations with |
| Added Variable Plots | Linear relationship on the link scale |
| Partial Residual Plots | Detect non-linearity in individual predictors |
| Index Plot of Deviance Residuals | Identify outliers by observation number |
11.4 Goodness-of-Fit Tests
Hosmer-Lemeshow Test (for Binomial GLM): Groups observations into deciles of fitted probabilities and tests observed vs. expected event counts:
A non-significant result () indicates adequate calibration.
Deviance Goodness-of-Fit Test (for Poisson/Binomial):
A significant deviance () may indicate lack of fit, overdispersion, or missing covariates.
Pearson Goodness-of-Fit Test:
Similar interpretation to the deviance test. For sparse data, may be more reliable than .
12. Model Selection and Variable Selection
12.1 Nested Model Comparison via LRT
To compare two nested models :
For quasi-GLMs (estimated dispersion):
12.2 AIC-Based Model Selection
For non-nested models or exploratory model building, use AIC:
Select the model with the lowest AIC. A difference is considered meaningful; is strong evidence for the lower-AIC model.
12.3 Stepwise Variable Selection
Forward selection: Start with the null model; add the variable that most reduces AIC at each step; stop when no addition improves AIC.
Backward elimination: Start with the full model; remove the variable that least increases AIC (i.e., most reduces AIC) at each step; stop when no removal improves AIC.
Bidirectional stepwise: Combine forward and backward; at each step, consider both additions and removals.
⚠️ Stepwise selection using p-values suffers from multiple testing inflation and instability. AIC-based stepwise is preferred. Neither should be used as a substitute for theory-driven model building.
12.4 Handling Categorical Predictors
Categorical predictors with categories are encoded as dummy variables using a reference category. For a categorical variable "Group" with categories A (reference), B, and C:
The model becomes:
(for log link) is the ratio of the mean for group B relative to the reference group A.
Testing all categories jointly (LRT):
Drop all dummy variables simultaneously and compare to the full model:
12.5 Interaction Terms
Interactions model situations where the effect of one predictor on the response depends on the value of another predictor:
In a log-link model: is the multiplicative modification to the rate ratio of for each unit increase in .
Test whether the interaction term is needed using the LRT with (or more for categorical interactions).
12.6 Polynomial and Spline Terms
For non-linear relationships on the link scale, include polynomial or spline terms:
Polynomial:
Natural Cubic Spline: Replaces with a set of basis functions that allow flexible non-linear fitting while remaining linear at the extremes. The number of knots controls flexibility.
LOESS Smoothed Partial Residual Plot: Helps identify non-linearity — if the LOESS curve departs substantially from a straight line, a polynomial or spline term may be needed.
13. Overdispersion and Underdispersion
13.1 What is Overdispersion?
Overdispersion occurs when the observed variance in the data exceeds the variance predicted by the model. It is most commonly encountered with Poisson and Binomial GLMs.
For Poisson: overdispersion means . For Binomial: overdispersion means .
Consequences of ignoring overdispersion:
- Standard errors are underestimated (too small).
- -statistics and statistics are inflated.
- p-values are too small → spurious significance.
- Confidence intervals are too narrow.
13.2 Detecting Overdispersion
Informal check: Compute the ratio:
- : No overdispersion (Poisson/Binomial assumption holds).
- : Overdispersion. Values are a clear concern.
- : Underdispersion (rarer but possible).
Formal test: Test using:
A significant result () confirms overdispersion.
Cameron-Trivedi test: Regresses on (for Poisson) and tests whether the slope is significantly different from zero.
13.3 Causes of Overdispersion
| Cause | Description |
|---|---|
| Unobserved heterogeneity | Unmeasured variables cause variation in the true rate across observations |
| Clustering / correlation | Observations within groups are not independent |
| Zero inflation | More zeros than expected under Poisson/Binomial (see Section 13.5) |
| Contagion | One event increases the probability of subsequent events (positive feedback) |
| Model misspecification | Wrong distributional family, missing covariates, wrong link function |
| Outliers | One or a few extreme observations inflate the apparent variance |
13.4 Solutions for Overdispersion
13.4.1 Quasi-GLM (Quasi-Poisson / Quasi-Binomial)
The simplest fix: Estimate from the data and use it to inflate all standard errors:
The coefficient estimates are identical to the standard GLM; only the standard errors, test statistics, and confidence intervals change. Use -tests instead of tests for model comparison.
When to use: When overdispersion is mild to moderate and no specific mechanism is known.
13.4.2 Negative Binomial Regression
Models overdispersion via an additional parameter (the dispersion parameter):
As , the Negative Binomial → Poisson. A significant improvement in fit over Poisson (LRT test for ) confirms overdispersion.
When to use: When counts are overdispersed and overdispersion follows a Gamma-mixture structure (i.e., unobserved heterogeneity).
13.4.3 Zero-Inflated Models
When excess zeros are the source of overdispersion, zero-inflated models combine:
- A binary model for whether the count is structurally zero (e.g., logistic regression).
- A count model for the actual count given it is non-zero (e.g., Poisson or Negative Binomial).
Zero-Inflated Poisson (ZIP):
Where:
- = probability of a structural zero (modelled as a logistic regression).
- = Poisson mean for non-structural counts (modelled via log link).
Zero-Inflated Negative Binomial (ZINB): Combines structural zeros with a Negative Binomial count process.
Hurdle Models: Similar to zero-inflated models but use a different two-part structure — a binary process for zero vs. positive, and a truncated count model for positive values.
Vuong Test: Compares a standard Poisson/NB model against its zero-inflated counterpart. A significant positive test statistic favours the zero-inflated model.
13.4.4 Mixed Models (GLMM)
When overdispersion arises from clustered or hierarchical data (e.g., patients within hospitals, students within schools), Generalised Linear Mixed Models (GLMMs) include random effects to account for within-group correlation:
Where is a random effect for cluster .
13.5 Underdispersion
Underdispersion (variance less than expected) is less common but can occur when:
- Counts are bounded (e.g., maximum possible count is small).
- There is negative contagion (one event inhibits subsequent events).
- Data are from a very controlled process.
Solutions include Conway-Maxwell-Poisson (CMP) regression, which handles both over- and underdispersion via an additional parameter.
14. Using the GLM Component
The GLM component in the DataStatPro application provides a full end-to-end workflow for fitting, evaluating, and interpreting Generalized Linear Models.
Step-by-Step Guide
Step 1 — Select Dataset Choose the dataset from the "Dataset" dropdown. The dataset should contain one response variable and one or more predictor variables.
Step 2 — Select Distribution Family Choose the distribution appropriate for your response variable:
- Gaussian (Normal): Continuous, unbounded.
- Binomial: Binary (0/1) or proportions ().
- Poisson: Non-negative integer counts.
- Negative Binomial: Overdispersed counts.
- Gamma: Continuous, strictly positive.
- Inverse Gaussian: Continuous, positive, highly skewed.
- Tweedie: Zero-inflated positive continuous.
- Quasi-Poisson: Poisson with estimated dispersion.
- Quasi-Binomial: Binomial with estimated dispersion.
Step 3 — Select Link Function Choose the link function. The default is the canonical link for each distribution:
- Gaussian → Identity
- Binomial → Logit (alternatives: Probit, Cloglog, Log)
- Poisson → Log (alternatives: Square Root, Identity)
- Negative Binomial → Log
- Gamma → Log (alternatives: Inverse, Identity)
- Inverse Gaussian → Inverse Squared (alternatives: Log, Inverse)
- Tweedie → Log
Step 4 — Select Response Variable (Y) Select the response variable from the "Response Variable (Y)" dropdown. For Binomial with proportions, you will be prompted to also select the trials variable (total counts ).
Step 5 — Select Predictor Variables (X) Select one or more predictor variables from the "Predictor Variables (X)" dropdown. These can be:
- Numeric (continuous or ordinal).
- Categorical (the application automatically creates dummy variables; you will be prompted to select the reference category).
Step 6 — Configure Offset (Optional) If the response is a rate, select the offset variable from the "Offset" dropdown. The application will include in the linear predictor automatically.
Step 7 — Configure Interactions (Optional) Specify interaction terms by selecting pairs (or groups) of variables. The application will create and include the product terms.
Step 8 — Select Confidence Level Choose the confidence level for confidence intervals and prediction intervals (default: 95%).
Step 9 — Configure Dispersion For Quasi-Poisson and Quasi-Binomial, select the dispersion estimation method:
- Pearson (recommended default)
- Deviance
For Negative Binomial, choose the method for estimating :
- MLE (recommended)
- Method of Moments
Step 10 — Select Display Options Choose which outputs to display:
- ✅ Coefficient Table (estimates, SEs, -values, p-values, CIs, exp(β))
- ✅ Analysis of Deviance Table
- ✅ Model Fit Statistics (, , , AIC, BIC, )
- ✅ Predicted vs. Observed Plot
- ✅ Residuals vs. Fitted Plot
- ✅ Normal Q-Q Plot of Residuals
- ✅ Scale-Location Plot
- ✅ Cook's Distance Plot
- ✅ Residuals vs. Leverage Plot
- ✅ Marginal Effects Plot
- ✅ Hosmer-Lemeshow Test (Binomial only)
- ✅ Overdispersion Test (Poisson/Binomial)
- ✅ Prediction Tool with CI
Step 11 — Run the Analysis Click "Run GLM". The application will:
- Encode categorical variables using dummy coding.
- Fit the GLM using IRLS.
- Compute coefficients, SEs, -values, p-values, and CIs (Wald and profile likelihood).
- Compute deviance, Pearson , AIC, BIC, and pseudo R².
- Estimate the dispersion parameter (if applicable).
- Compute all residual types and diagnostic statistics.
- Generate all selected diagnostic plots.
- Run goodness-of-fit tests.
15. Computational and Formula Details
15.1 IRLS Algorithm: Full Step-by-Step
Inputs: Response , design matrix (), prior weights (default: all 1), link function , variance function .
Step 0: Initialise
For iteration until convergence:
Step 1: Compute working response :
Step 2: Compute working weights :
Step 3: Solve weighted least squares:
Step 4: Update linear predictor and mean:
Step 5: Check convergence:
Or equivalently, check the change in deviance.
15.2 Link Function Derivatives
The working response and working weights require :
| Link | |||
|---|---|---|---|
| Identity | |||
| Log | |||
| Logit | |||
| Probit | |||
| Cloglog | |||
| Inverse | |||
| Inv. Squared | |||
| Square root |
15.3 Deviance Formulas by Distribution
| Distribution | Total Deviance |
|---|---|
| Gaussian | |
| Binomial | |
| Poisson | |
| Gamma | |
| Inv. Gaussian | |
| Neg. Binomial |
15.4 Marginal Effects
For models with non-identity link functions, the coefficient describes the effect of on the link scale — not directly on the response scale. Marginal effects translate coefficients to the response scale.
Average Marginal Effect (AME):
For the log link: , so .
For the logit link: , so .
Marginal Effect at the Mean (MEM):
AME is generally preferred over MEM as it averages over the actual distribution of observations rather than evaluating at the mean (which may not be a representative point).
16. Worked Examples
Example 1: Poisson GLM — Modelling Insurance Claim Counts
Research Question: What factors predict the number of insurance claims filed by policyholders? Does age, vehicle type, and driving experience affect claim frequency?
Data: policyholders; response = number of claims in one year; exposure = years of coverage (offset); predictors: Age (years), VehicleType (Car/Van/Truck; reference = Car), Experience (years of driving).
Step 1: Check Response Distribution
Mean claims per year: . Histogram shows right-skewed counts with many zeros. Poisson GLM with log link and log(exposure) offset is appropriate.
Step 2: Fit Poisson GLM
Step 3: Coefficient Table
| Parameter | SE | -value | p-value | (Rate Ratio) | 95% CI for Rate Ratio | |
|---|---|---|---|---|---|---|
| Intercept | -2.183 | 0.241 | -9.06 | < 0.001 | 0.113 | [0.070, 0.181] |
| Age | -0.018 | 0.006 | -2.87 | 0.004 | 0.982 | [0.970, 0.994] |
| Van | 0.421 | 0.112 | 3.76 | < 0.001 | 1.524 | [1.224, 1.897] |
| Truck | 0.683 | 0.148 | 4.61 | < 0.001 | 1.980 | [1.481, 2.645] |
| Experience | -0.031 | 0.009 | -3.44 | 0.001 | 0.969 | [0.951, 0.988] |
Step 4: Interpretation
- Age: For each additional year of age, the expected claim rate is multiplied by — a 1.8% decrease per year, holding other variables constant ().
- Van vs. Car: Van drivers have a 52.4% higher claim rate than car drivers (, ).
- Truck vs. Car: Truck drivers have a 98.0% higher claim rate (nearly double) compared to car drivers (, ).
- Experience: Each additional year of driving experience reduces the expected claim rate by 3.1% (, ).
Step 5: Model Fit Statistics
Step 6: Check for Overdispersion
Mild overdispersion (). Refit with Quasi-Poisson:
Quasi-Poisson multiplies all SEs by . Conclusions are largely unchanged but confidence intervals are slightly wider.
Prediction for new policyholder: Age = 35, Van, Experience = 10, Exposure = 1 year:
Example 2: Gamma GLM — Modelling Healthcare Costs
Research Question: What patient characteristics predict the total annual healthcare cost?
Data: patients; response = total annual healthcare cost (USD > 0); predictors: Age (years), ChronicConditions (count), Smoker (0/1), BMI.
Step 1: Assess Distribution
Healthcare costs are strictly positive with a right-skewed distribution and variance proportional to (coefficient of variation approximately constant). Gamma GLM with log link is appropriate.
Step 2: Fit Gamma GLM
Step 3: Coefficient Table
| Parameter | SE | -value | p-value | (Cost Ratio) | 95% CI | |
|---|---|---|---|---|---|---|
| Intercept | 6.421 | 0.382 | 16.81 | < 0.001 | 614.3 | [290.5, 1298.0] |
| Age | 0.028 | 0.007 | 3.89 | < 0.001 | 1.028 | [1.014, 1.043] |
| Chronic | 0.341 | 0.042 | 8.12 | < 0.001 | 1.406 | [1.295, 1.527] |
| Smoker | 0.287 | 0.098 | 2.93 | 0.003 | 1.332 | [1.099, 1.615] |
| BMI | 0.019 | 0.008 | 2.38 | 0.017 | 1.019 | [1.003, 1.036] |
Step 4: Interpretation (log link → cost ratios)
- Age: Each additional year of age increases expected costs by 2.8% (, ).
- Chronic Conditions: Each additional chronic condition increases expected costs by 40.6% (, ).
- Smoking: Smokers have 33.2% higher expected costs than non-smokers (, ).
- BMI: Each unit increase in BMI increases expected costs by 1.9% (, ).
Step 5: Model Fit
Step 6: Predicted Cost for New Patient
Age = 55, Chronic = 3, Smoker = 1, BMI = 28:
95% CI for : Computed on scale and back-transformed: [\14{,}210$23{,}020$].
Example 3: Negative Binomial GLM — Modelling Overdispersed Species Counts
Research Question: What environmental variables predict the abundance of a bird species across survey sites?
Data: survey sites; = bird count; predictors: Altitude (m), ForestCover (%), Distance to Water (km), Temperature (°C).
Step 1: Fit Poisson GLM and Check Overdispersion
Poisson GLM fit: — substantial overdispersion (). Switch to Negative Binomial.
Step 2: Fit Negative Binomial GLM
Estimated overdispersion parameter: (SE = 0.48).
LRT for overdispersion vs. Poisson: , → Negative Binomial is strongly preferred.
Step 3: Coefficient Table
| Parameter | SE | -value | p-value | Rate Ratio | |
|---|---|---|---|---|---|
| Intercept | 1.842 | 0.412 | 4.47 | < 0.001 | 6.31 |
| Altitude (per 100m) | -0.124 | 0.038 | -3.26 | 0.001 | 0.883 |
| Forest Cover (per 10%) | 0.218 | 0.061 | 3.57 | < 0.001 | 1.244 |
| Distance to Water | -0.083 | 0.024 | -3.46 | 0.001 | 0.920 |
| Temperature | 0.041 | 0.019 | 2.16 | 0.031 | 1.042 |
Step 4: Interpretation
- Altitude: For each 100m increase in altitude, expected count is multiplied by — a 11.7% decrease ().
- Forest Cover: For each 10% increase in forest cover, expected count increases by 24.4% ().
- Distance to Water: Each additional km from water reduces expected count by 8.0% ().
- Temperature: Each degree increase in temperature increases expected count by 4.2% ().
Step 5: Fit Statistics
Example 4: Binomial GLM with Probit Link — Predicting Product Failure
Research Question: What material and design factors predict whether a component will fail a stress test?
Data: components; (0 = pass, 1 = fail); predictors: Thickness (mm), Temperature (°C), MaterialGrade (A/B/C; reference = A).
Step 1: Fit Binomial GLM with Probit Link
Step 2: Coefficient Table
| Parameter | SE | -value | p-value | |
|---|---|---|---|---|
| Intercept | -2.841 | 0.531 | -5.35 | < 0.001 |
| Thickness | -0.384 | 0.092 | -4.17 | < 0.001 |
| Temperature | 0.041 | 0.012 | 3.42 | 0.001 |
| Grade B | 0.612 | 0.214 | 2.86 | 0.004 |
| Grade C | 1.183 | 0.241 | 4.91 | < 0.001 |
Step 3: Interpretation (probit scale)
- Thickness: Each mm increase in thickness decreases the probit of failure by 0.384 — the component is less likely to fail.
- Temperature: Each degree increase increases the probit of failure by 0.041.
- Grade B vs. A: Grade B components have a probit of failure that is 0.612 higher (more likely to fail) than Grade A.
Average Marginal Effect of Thickness:
On average, each mm increase in thickness reduces the probability of failure by 12.0 percentage points.
Step 4: Predicted Probability for New Component
Thickness = 4.5mm, Temperature = 80°C, Grade B:
Predicted probability of failure: 24.9%.
17. Common Mistakes and How to Avoid Them
Mistake 1: Using Gaussian GLM for Non-Normal Response Variables
Problem: Applying ordinary linear regression (Gaussian GLM) to count, proportion, or positive continuous data, which violates distributional assumptions, produces predictions outside valid ranges (e.g., negative counts or probabilities > 1), and leads to invalid inference.
Solution: Match the distribution to the response type: Poisson/Negative Binomial for counts; Binomial for proportions; Gamma for positive continuous. Always check the range and distribution of before selecting a family.
Mistake 2: Ignoring Overdispersion in Poisson/Binomial Models
Problem: Fitting a Poisson or Binomial GLM to overdispersed data () without correction. Standard errors are underestimated, leading to spuriously small p-values and narrow confidence intervals.
Solution: Always compute after fitting. If , use Quasi-Poisson, Quasi-Binomial, or Negative Binomial as appropriate. For severe overdispersion or excess zeros, consider zero-inflated models.
Mistake 3: Interpreting Coefficients on the Wrong Scale
Problem: Interpreting a log-link coefficient as "a 0.35 unit increase in the mean" when it actually represents a multiplicative change: the mean is multiplied by (a 42% increase).
Solution: Always interpret GLM coefficients on the appropriate scale. For log links, report and interpret (rate ratio, cost ratio, etc.). For logit links, report as an odds ratio. Always state clearly which scale is being used.
Mistake 4: Choosing the Wrong Link Function
Problem: Using an inappropriate link function (e.g., identity link for a Poisson model), which can produce predicted values outside valid ranges and poor model fit.
Solution: Use the canonical link as the default. Consider alternative links when domain knowledge suggests a specific functional form. Check the fit of alternative link functions using AIC and residual plots.
Mistake 5: Forgetting the Offset in Rate Models
Problem: Modelling count data without including an offset for different exposure periods or population sizes, attributing variation in counts entirely to predictors when it is partly due to different exposures.
Solution: Always include as an offset when modelling rates from count data. Verify that the offset variable is on the log scale (for log-link models) and has a fixed coefficient of 1.
Mistake 6: Treating the Deviance as an Absolute Goodness-of-Fit Test for All Distributions
Problem: Using to test model fit for Poisson or Binomial models with small expected counts, where the approximation is unreliable.
Solution: The deviance goodness-of-fit test is only reliable when all expected counts . For sparse data, use the Pearson test, collapse categories, or use simulation-based tests. For Binomial with individual binary responses, use the Hosmer-Lemeshow test instead.
Mistake 7: Not Checking for Complete Separation in Binomial Models
Problem: A predictor or combination of predictors perfectly separates successes from failures, causing IRLS to fail to converge and producing extremely large coefficient estimates with huge standard errors.
Solution: Look for IRLS convergence warnings and inspect coefficient estimates. If separation is detected, use Firth's bias-reduced logistic regression, exact logistic regression, or regularised estimation. Remove or merge categories that cause separation.
Mistake 8: Applying GLMs to Dependent Observations
Problem: Using a standard GLM for longitudinal, clustered, or spatially correlated data, where observations within groups are correlated, violating the independence assumption and leading to underestimated standard errors.
Solution: Use Generalised Estimating Equations (GEE) for marginal (population-averaged) inference, or Generalised Linear Mixed Models (GLMM) for subject-specific inference. Always consider the study design before choosing a model.
Mistake 9: Comparing Models Across Different Datasets Using AIC
Problem: Comparing AIC values between models fit to different subsets of data (e.g., after listwise deletion of missing values reduces the dataset differently for different models), leading to invalid comparisons.
Solution: AIC is only comparable between models fit to exactly the same observations. Ensure all candidate models use the same dataset. Handle missing data before model selection, not during.
Mistake 10: Over-Interpreting Pseudo R² Values
Problem: Comparing a GLM's pseudo R² directly to the R² from linear regression and concluding the GLM fits poorly because the pseudo R² is "only 0.20."
Solution: Pseudo R² values for GLMs are not directly comparable to OLS R². McFadden's represents a good fit in many GLM applications. Always interpret pseudo R² relative to the scale typical for that type of model and outcome, and supplement with deviance, AIC, and residual diagnostics.
18. Troubleshooting
| Issue | Likely Cause | Solution |
|---|---|---|
| IRLS fails to converge | Complete separation (Binomial); extreme predictor values; poor starting values | Check for separation; standardise predictors; use robust starting values; reduce model complexity |
| Very large coefficient estimates () | Complete or quasi-complete separation (Binomial); collinearity | Inspect data for perfect predictors; check VIF; use Firth regression or regularisation |
| Very large standard errors | Multicollinearity; separation; too few events per variable | Check VIF; remove correlated variables; collect more data; use penalised estimation |
| Residual deviance | Overdispersion; model misspecification; influential outliers | Compute ; switch to Quasi/NB model; check residual plots for outliers and non-linearity |
| Negative predicted values (log/Poisson model) | Should not occur with log link; check for identity link used accidentally | Verify link function specification; refit with correct link |
| Predicted probabilities at 0 or 1 exactly | Complete separation; very extreme linear predictor values | Check for separation; use Firth regression; inspect extreme observations |
| AIC is not reported | Quasi-GLM selected (no proper likelihood) | Use F-tests and deviance for model comparison; note AIC is unavailable for quasi-models |
| (underdispersion) | Counts are bounded; negative contagion; over-specified model | Consider Conway-Maxwell-Poisson; check model specification; verify data are correct |
| Hosmer-Lemeshow test significant () | Poor calibration; missing covariates; wrong link; non-linearity | Add missing predictors; try alternative link; add polynomial terms; inspect residual plots |
| All Pearson residuals similar in magnitude | Normal/Gaussian family used on count data (constant variance) | Switch to Poisson or Negative Binomial with appropriate variance function |
| Cook's distance very large for one observation | Extreme influential observation; data entry error | Investigate observation; verify data accuracy; refit with and without it to assess influence |
| Profile likelihood CI very asymmetric vs. Wald CI | Strong non-linearity of likelihood; small sample | Report profile likelihood CI; note asymmetry as evidence of non-normality of MLE distribution |
| Dispersion estimate varies wildly across subgroups | Heteroscedasticity; model misspecification | Consider separate models per subgroup; add interaction terms; use heteroscedasticity-robust SEs |
19. Quick Reference Cheat Sheet
Core GLM Formulas
| Formula | Description |
|---|---|
| GLM specification | |
| GLM variance structure | |
| IRLS update | |
| IRLS working weight | |
| IRLS working response | |
| Covariance of MLE | |
| Wald z-statistic | |
| Likelihood ratio test | |
| Residual deviance | |
| McFadden's pseudo R² | |
| AIC | |
| Pearson dispersion estimate |
Distribution and Link Function Selection
| Response Type | Distribution | Default Link | Interpretation |
|---|---|---|---|
| Binary (0/1) | Binomial | Logit | Odds ratio |
| Binary (0/1), latent normal | Binomial | Probit | Change in probit |
| Binary (0/1), rare event / hazard | Binomial | Cloglog | Hazard ratio |
| Proportion () | Binomial | Logit | Odds ratio |
| Count, equidispersed | Poisson | Log | Rate ratio |
| Count, overdispersed | Neg. Binomial | Log | Rate ratio |
| Count, overdispersed (mild) | Quasi-Poisson | Log | Rate ratio (corrected SEs) |
| Count, excess zeros | ZIP / ZINB | Log | Rate ratio (count component) |
| Continuous, positive, const | Gamma | Log | Cost/mean ratio |
| Continuous, positive, high skew | Inverse Gaussian | Log | Mean ratio |
| Zero-inflated positive continuous | Tweedie | Log | Mean ratio |
| Continuous, unbounded | Gaussian | Identity | Additive change in mean |
Residual Types Summary
| Residual | Formula | Best For |
|---|---|---|
| Raw | Simple inspection | |
| Pearson | Dispersion assessment | |
| Deviance | General diagnostics | |
| Quantile | Best normality approximation; discrete data | |
| Anscombe | Variance-stabilised residual | Normality plots |
Overdispersion Decision Tree
Fit standard Poisson/Binomial GLM
↓
Compute φ̂ = X²/(n-p-1)
↓
φ̂ ≈ 1? ──Yes──→ Model is adequate
↓ No
φ̂ > 1? ──Yes──→ Overdispersion
↓ ↓
φ̂ < 1 ←─── Excess zeros?
(Underdispersion) ↓ Yes ↓ No
Conway-Maxwell ZIP/ZINB φ̂ < 2?
Poisson (CMP) ↓Yes ↓No
Quasi-GLM Negative Binomial
or GLMM (if clustered)
Model Comparison Guide
| Scenario | Method | Statistic |
|---|---|---|
| Two nested models (proper likelihood) | LRT | |
| Two nested quasi-GLM models | F-test | |
| Non-nested models | AIC / BIC | Lower is better |
| Overall model significance | Analysis of deviance | |
| Single coefficient | Wald test | |
| Group of coefficients | LRT or Wald | |
| Small samples / asymmetric likelihood | Profile LRT |
Pseudo R² Benchmarks (McFadden)
| Model Fit | |
|---|---|
| Poor | |
| Acceptable | |
| Good | |
| Very good | |
| Excellent |
Key Diagnostic Thresholds
| Diagnostic | Threshold | Action |
|---|---|---|
| (dispersion) | Investigate overdispersion | |
| Standardised residual | (flag), (outlier) | Investigate observation |
| Leverage | High leverage; check predictor values | |
| Cook's distance | or | Influential observation; refit without it |
| VIF | (concern), (serious) | Multicollinearity; consider variable removal |
| Hosmer-Lemeshow | Poor calibration (Binomial models) | |
| LRT for NB vs. Poisson | Use Negative Binomial |
GLM vs. Related Models
| Model | Extension of GLM | Key Addition | When to Use |
|---|---|---|---|
| GLMM | Yes | Random effects | Clustered / hierarchical data |
| GEE | Marginal GLM | Working correlation | Longitudinal / repeated measures |
| Zero-Inflated GLM | Yes | Structural zeros component | Excess zeros in counts |
| Hurdle Model | Yes | Two-part: binary + truncated | Zeros arise from a distinct process |
| Ordinal GLM | Yes | Cumulative link | Ordered categorical response |
| Multinomial GLM | Yes | Multiple linear predictors | Nominal categorical response (> 2 classes) |
| Survival GLM | Yes | Censoring mechanism | Time-to-event data |
| Quasi-GLM | Yes | Estimated dispersion | Overdispersion without full distribution |
| GAMLSS | Yes | All parameters modelled | Distributional regression |
This tutorial provides a comprehensive foundation for understanding, applying, and interpreting Generalized Linear Models using the DataStatPro application. For further reading, consult McCullagh & Nelder's "Generalized Linear Models" (2nd ed., Chapman & Hall, 1989), Dobson & Barnett's "An Introduction to Generalized Linear Models" (4th ed., CRC Press, 2018), or Agresti's "Foundations of Linear and Generalized Linear Models" (Wiley, 2015). For feature requests or support, contact the DataStatPro team.