Knowledge Base / Generalized Linear Models Advanced Analysis 61 min read

Generalized Linear Models

Comprehensive reference guide for Generalized Linear Models (GLM) analysis.

Generalized Linear Models: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of Generalized Linear Models (GLMs) all the way through advanced model specification, estimation, diagnostics, and practical usage within the DataStatPro application. Whether you are a complete beginner or an experienced analyst, this guide is structured to build your understanding step by step.


Table of Contents

  1. Prerequisites and Background Concepts
  2. What are Generalized Linear Models?
  3. The Mathematical Framework of GLMs
  4. The Exponential Family of Distributions
  5. Link Functions
  6. GLM Distributions and Their Applications
  7. Assumptions of GLMs
  8. Parameter Estimation: Maximum Likelihood and IRLS
  9. Model Fit and Evaluation
  10. Hypothesis Testing and Inference
  11. Model Diagnostics and Residuals
  12. Model Selection and Variable Selection
  13. Overdispersion and Underdispersion
  14. Using the GLM Component
  15. Computational and Formula Details
  16. Worked Examples
  17. Common Mistakes and How to Avoid Them
  18. Troubleshooting
  19. Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into Generalized Linear Models, it is helpful to be familiar with the following foundational concepts. Do not worry if you are not — each concept is briefly explained here.

1.1 Probability Distributions

A probability distribution describes how the values of a random variable are distributed. Key distributions used in GLMs:

1.2 The Likelihood Function

The likelihood function L(θ;y)L(\boldsymbol{\theta}; \mathbf{y}) measures how probable the observed data y\mathbf{y} are, given a parameter vector θ\boldsymbol{\theta}. For independent observations:

L(θ;y)=i=1nf(yi;θ)L(\boldsymbol{\theta}; \mathbf{y}) = \prod_{i=1}^n f(y_i; \boldsymbol{\theta})

The log-likelihood is more convenient to work with:

(θ;y)=i=1nlnf(yi;θ)\ell(\boldsymbol{\theta}; \mathbf{y}) = \sum_{i=1}^n \ln f(y_i; \boldsymbol{\theta})

Maximum Likelihood Estimation (MLE) finds the parameter values that maximise (θ;y)\ell(\boldsymbol{\theta}; \mathbf{y}).

1.3 The Linear Predictor

A linear predictor η\eta is a weighted linear combination of predictor variables:

η=β0+β1X1+β2X2++βpXp=xTβ\eta = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p = \mathbf{x}^T \boldsymbol{\beta}

This is the core structure inherited from linear regression. In GLMs, η\eta is not the outcome itself but is transformed through a link function to relate to the mean of the response distribution.

1.4 Ordinary Linear Regression Recap

In ordinary linear regression (OLS):

Yi=β0+β1Xi1++βpXip+ϵi,ϵiN(0,σ2)Y_i = \beta_0 + \beta_1 X_{i1} + \dots + \beta_p X_{ip} + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2)

This model has three implicit components:

  1. A distribution for the response: YiN(μi,σ2)Y_i \sim \mathcal{N}(\mu_i, \sigma^2).
  2. A linear predictor: ηi=xiTβ\eta_i = \mathbf{x}_i^T \boldsymbol{\beta}.
  3. A link function connecting μi\mu_i to ηi\eta_i: ηi=μi\eta_i = \mu_i (the identity link).

GLMs generalise this framework by allowing different distributions and link functions.

1.5 The Score Equations and the Information Matrix

The score vector is the gradient of the log-likelihood with respect to the parameters:

s(β)=β\mathbf{s}(\boldsymbol{\beta}) = \frac{\partial \ell}{\partial \boldsymbol{\beta}}

Setting s(β)=0\mathbf{s}(\boldsymbol{\beta}) = \mathbf{0} gives the MLE. The Fisher information matrix is:

I(β)=E[2ββT]\mathcal{I}(\boldsymbol{\beta}) = -E\left[\frac{\partial^2 \ell}{\partial \boldsymbol{\beta} \partial \boldsymbol{\beta}^T}\right]

Its inverse I1(β)\mathcal{I}^{-1}(\boldsymbol{\beta}) gives the asymptotic covariance matrix of the MLE, used to compute standard errors and confidence intervals.


2. What are Generalized Linear Models?

Generalized Linear Models (GLMs) are a unified class of regression models that extend ordinary linear regression to accommodate response variables with non-normal distributions. Introduced by Nelder and Wedderburn (1972), GLMs provide a coherent framework for modelling a wide variety of outcome types — counts, proportions, binary outcomes, continuous positive values, and more — using a single, elegant mathematical structure.

2.1 The Central Idea

Ordinary linear regression assumes the response YY is normally distributed and that the mean μ\mu equals the linear predictor directly: μ=η\mu = \eta. GLMs relax both restrictions:

  1. The distribution of YY can be any member of the exponential family (Normal, Binomial, Poisson, Gamma, Inverse Gaussian, etc.).
  2. The link function g(μ)=ηg(\mu) = \eta can be any monotone, differentiable function that maps the mean μ\mu (constrained to its natural range) to the real line (,+)(-\infty, +\infty).

This two-step generalisation unlocks an enormous range of practical modelling scenarios while preserving the interpretability of regression coefficients.

2.2 Real-World Applications

GLMs are among the most widely used statistical models in applied science, business, and policy:

2.3 The Three Components of a GLM

Every GLM is fully specified by three components:

ComponentSymbolDescriptionExample (Logistic Regression)
Random Componentf(y;θ,ϕ)f(y; \theta, \phi)Distribution of YY from the exponential familyBinomial(n,p)(n, p)
Systematic Componentη=xTβ\eta = \mathbf{x}^T\boldsymbol{\beta}Linear predictorβ0+β1X1++βpXp\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p
Link Functiong(μ)=ηg(\mu) = \etaConnects E[Y]=μE[Y] = \mu to η\etaLogit: ln(p/(1p))\ln(p/(1-p))

2.4 How GLMs Generalise Linear Regression

FeatureLinear RegressionGLM
Response distributionNormal onlyAny exponential family
Link functionIdentity (μ=η\mu = \eta)Any valid link g(μ)=ηg(\mu) = \eta
VarianceConstant σ2\sigma^2Function of μ\mu: Var(Y)=ϕV(μ)\text{Var}(Y) = \phi \cdot V(\mu)
EstimationOLS (closed form)MLE via IRLS (iterative)
Goodness of fitR2R^2, FF-testDeviance, AIC, likelihood ratio tests
ResidualsRaw, standardisedPearson, deviance, Anscombe

3. The Mathematical Framework of GLMs

3.1 The Three-Component Structure in Detail

A GLM specifies that the ii-th response YiY_i has:

Random Component: Yif(yi;θi,ϕ)(exponential family distribution)Y_i \sim f(y_i; \theta_i, \phi) \quad \text{(exponential family distribution)}

With mean E[Yi]=μiE[Y_i] = \mu_i and variance Var(Yi)=ϕV(μi)\text{Var}(Y_i) = \phi \cdot V(\mu_i), where:

Systematic Component: ηi=β0+β1Xi1+β2Xi2++βpXip=xiTβ\eta_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \dots + \beta_p X_{ip} = \mathbf{x}_i^T \boldsymbol{\beta}

Link Function: g(μi)=ηig(\mu_i) = \eta_i

So the mean is related to the predictors through: μi=g1(ηi)=g1(xiTβ)\mu_i = g^{-1}(\eta_i) = g^{-1}(\mathbf{x}_i^T \boldsymbol{\beta})

Where g1g^{-1} is the inverse link function (also called the mean function or response function).

3.2 The Variance Function V(μ)V(\mu)

The variance function characterises how the variance of the response depends on its mean. Each exponential family distribution has a specific variance function:

DistributionV(μ)V(\mu)Interpretation
Normal11Variance is constant (homoscedastic)
Binomialμ(1μ)\mu(1-\mu)Variance is bell-shaped, maximum at μ=0.5\mu = 0.5
Poissonμ\muVariance equals the mean
Gammaμ2\mu^2Variance is proportional to the square of the mean
Inverse Gaussianμ3\mu^3Variance grows as the cube of the mean
Negative Binomialμ+μ2/k\mu + \mu^2/kVariance exceeds the mean (overdispersion)
Tweedieμp\mu^pPower variance function; p(1,2)p \in (1,2) for compound Poisson-Gamma

3.3 The Dispersion Parameter ϕ\phi

The full variance of YiY_i is:

Var(Yi)=ϕV(μi)wi\text{Var}(Y_i) = \frac{\phi \cdot V(\mu_i)}{w_i}

Where wiw_i is a known prior weight (e.g., wi=niw_i = n_i for binomial proportions). The dispersion parameter ϕ\phi:

3.4 The Canonical Link

For each distribution, there is a canonical link function that arises naturally from the mathematical structure of the exponential family. Using the canonical link has desirable statistical properties (sufficient statistics, simpler score equations):

DistributionCanonical Linkg(μ)g(\mu)
NormalIdentityμ\mu
BinomialLogitln(μ1μ)\ln\left(\frac{\mu}{1-\mu}\right)
PoissonLogln(μ)\ln(\mu)
GammaInverse1μ\frac{1}{\mu}
Inverse GaussianInverse squared1μ2\frac{1}{\mu^2}

Non-canonical links can also be used and may be more interpretable in certain applications. The canonical link is the default in most GLM software but is not obligatory.

3.5 The Offset

An offset is a term added to the linear predictor with a fixed coefficient of 1:

ηi=xiTβ+offseti\eta_i = \mathbf{x}_i^T \boldsymbol{\beta} + \text{offset}_i

Offsets are used when the response is a rate and observations have different exposure times or population sizes. The offset is known (not estimated) and is included on the linear predictor scale.

Example: Modelling disease incidence counts YiY_i for regions with different population sizes NiN_i. The rate per person is λi=μi/Ni\lambda_i = \mu_i / N_i. Using a Poisson model with log link:

ln(μi)=xiTβ+ln(Ni)\ln(\mu_i) = \mathbf{x}_i^T \boldsymbol{\beta} + \ln(N_i)

Where ln(Ni)\ln(N_i) is the offset. The model then estimates the log rate ln(λi)=xiTβ\ln(\lambda_i) = \mathbf{x}_i^T \boldsymbol{\beta}.


4. The Exponential Family of Distributions

The exponential family is a broad class of distributions that share a common mathematical form, which is the foundation of the GLM framework.

4.1 The Exponential Family Form

A distribution belongs to the exponential family if its probability density (or mass) function can be written as:

f(y;θ,ϕ)=exp{yθb(θ)a(ϕ)+c(y,ϕ)}f(y; \theta, \phi) = \exp\left\{\frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi)\right\}

Where:

Key relationships derived from b(θ)b(\theta):

μ=E[Y]=b(θ)(first derivative of b)\mu = E[Y] = b'(\theta) \quad \text{(first derivative of } b\text{)}

Var(Y)=a(ϕ)b(θ)=ϕV(μ)(second derivative of b)\text{Var}(Y) = a(\phi) \cdot b''(\theta) = \phi \cdot V(\mu) \quad \text{(second derivative of } b\text{)}

This elegant structure means that all moments of the distribution follow automatically from the cumulant function b(θ)b(\theta).

4.2 Major Exponential Family Distributions for GLMs

4.2.1 Normal Distribution

f(y;μ,σ2)=12πσ2exp{(yμ)22σ2}f(y; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left\{-\frac{(y-\mu)^2}{2\sigma^2}\right\}

4.2.2 Binomial Distribution

f(y;n,p)=(ny)py(1p)ny,y{0,1,,n}f(y; n, p) = \binom{n}{y} p^y (1-p)^{n-y}, \quad y \in \{0, 1, \dots, n\}

4.2.3 Poisson Distribution

f(y;λ)=eλλyy!,y{0,1,2,}f(y; \lambda) = \frac{e^{-\lambda} \lambda^y}{y!}, \quad y \in \{0, 1, 2, \dots\}

4.2.4 Gamma Distribution

f(y;α,β)=yα1eyββαΓ(α),y>0f(y; \alpha, \beta) = \frac{y^{\alpha-1} e^{-y\beta} \beta^\alpha}{\Gamma(\alpha)}, \quad y > 0

4.2.5 Inverse Gaussian Distribution

f(y;μ,λ)=λ2πy3exp{λ(yμ)22μ2y},y>0f(y; \mu, \lambda) = \sqrt{\frac{\lambda}{2\pi y^3}} \exp\left\{-\frac{\lambda(y-\mu)^2}{2\mu^2 y}\right\}, \quad y > 0

4.3 The Negative Binomial Distribution

While the Negative Binomial is not a member of the exponential family in its most general form, it can be treated as a quasi-exponential family or as a Poisson-Gamma mixture:

f(y;μ,k)=Γ(y+k)Γ(k)y!(kk+μ)k(μk+μ)y,y{0,1,2,}f(y; \mu, k) = \frac{\Gamma(y+k)}{\Gamma(k) y!} \left(\frac{k}{k+\mu}\right)^k \left(\frac{\mu}{k+\mu}\right)^y, \quad y \in \{0, 1, 2, \dots\}

4.4 The Tweedie Distribution

The Tweedie distribution is a special case of the exponential dispersion model with power variance function V(μ)=μpV(\mu) = \mu^p:

Power ppDistribution
p=0p = 0Normal
p=1p = 1Poisson
1<p<21 < p < 2Compound Poisson-Gamma (supports exact zeros + positive values)
p=2p = 2Gamma
p=3p = 3Inverse Gaussian

The Tweedie distribution with 1<p<21 < p < 2 is particularly valuable in insurance (pure premium modelling) and ecology (biomass data) because it naturally accommodates data with a mass at zero and a continuous positive distribution for non-zero values.


5. Link Functions

The link function g(μ)=ηg(\mu) = \eta is the bridge between the mean of the response distribution and the linear predictor. Choosing an appropriate link function is a key modelling decision.

5.1 Requirements for a Valid Link Function

A valid link function must be:

  1. Monotone: Strictly increasing or decreasing.
  2. Differentiable: g(μ)g'(\mu) must exist and be non-zero.
  3. Range-compatible: The range of g1(η)g^{-1}(\eta) must match the support of the mean μ\mu.

5.2 Commonly Used Link Functions

Link Nameg(μ)=ηg(\mu) = \etaμ=g1(η)\mu = g^{-1}(\eta)Range of μ\muCanonical For
Identityμ\muη\eta(,+)(-\infty, +\infty)Normal
Logln(μ)\ln(\mu)eηe^\eta(0,+)(0, +\infty)Poisson
Logitln(μ1μ)\ln\left(\frac{\mu}{1-\mu}\right)eη1+eη\frac{e^\eta}{1+e^\eta}(0,1)(0, 1)Binomial
ProbitΦ1(μ)\Phi^{-1}(\mu)Φ(η)\Phi(\eta)(0,1)(0, 1)Binomial (alt)
Complementary log-log (cloglog)ln(ln(1μ))\ln(-\ln(1-\mu))1eeη1 - e^{-e^\eta}(0,1)(0, 1)Binomial (alt)
Inverse1/μ1/\mu1/η1/\eta(0,+)(0, +\infty)Gamma
Inverse squared1/μ21/\mu^21/η1/\sqrt{\eta}(0,+)(0, +\infty)Inverse Gaussian
Square rootμ\sqrt{\mu}η2\eta^2(0,+)(0, +\infty)Poisson (alt)
Negative logln(μ)-\ln(\mu)eηe^{-\eta}(0,+)(0, +\infty)Complementary log
Log-logln(ln(μ))-\ln(-\ln(\mu))eeηe^{-e^{-\eta}}(0,1)(0, 1)Binomial (alt)

5.3 Logit Link (Binomial GLM)

g(μ)=logit(μ)=ln(μ1μ)g(\mu) = \text{logit}(\mu) = \ln\left(\frac{\mu}{1-\mu}\right)

5.4 Probit Link (Binomial GLM)

g(μ)=Φ1(μ)g(\mu) = \Phi^{-1}(\mu)

Where Φ1\Phi^{-1} is the quantile function of the standard normal distribution.

5.5 Complementary Log-Log Link (Binomial GLM)

g(μ)=ln(ln(1μ))g(\mu) = \ln(-\ln(1-\mu))

5.6 Log Link (Poisson, Negative Binomial, Gamma GLM)

g(μ)=ln(μ)g(\mu) = \ln(\mu)

5.7 Inverse Link (Gamma GLM)

g(μ)=1μg(\mu) = \frac{1}{\mu}

5.8 Choosing the Link Function

ScenarioRecommended Link
Binomial: symmetric probability, easy odds interpretationLogit
Binomial: latent normal model assumedProbit
Binomial: rare events, extreme probabilitiesComplementary log-log
Binomial: log-linear probability model neededLog (with care about μ>1\mu > 1 constraint)
Poisson / Negative Binomial / Gamma: multiplicative effectsLog
Gamma: when inverse relationships are naturalInverse
Normal / continuous unboundedIdentity
Positive continuous: when multiplicative effects expectedLog

6. GLM Distributions and Their Applications

6.1 Binomial GLM (Logistic, Probit, Cloglog Regression)

Use when: Response is a binary outcome (0/1, yes/no, success/failure) or a proportion y/ny/n where both yy (successes) and nn (trials) are known.

Model:

YiBinomial(ni,μi),g(μi)=ηi=xiTβY_i \sim \text{Binomial}(n_i, \mu_i), \quad g(\mu_i) = \eta_i = \mathbf{x}_i^T \boldsymbol{\beta}

Default link: Logit.

Interpretation (logit link): eβje^{\beta_j} is the odds ratio — the multiplicative change in the odds of success for a one-unit increase in XjX_j.

Special cases:

Applications: Disease diagnosis, credit default, customer churn, election outcomes, clinical trial response rates.

6.2 Poisson GLM (Poisson Regression)

Use when: Response is a count of events that could in principle be any non-negative integer, arising from a process with a constant rate.

Model:

YiPoisson(μi),ln(μi)=xiTβ+offsetiY_i \sim \text{Poisson}(\mu_i), \quad \ln(\mu_i) = \mathbf{x}_i^T \boldsymbol{\beta} + \text{offset}_i

Default link: Log.

Interpretation (log link): eβje^{\beta_j} is the rate ratio (or incidence rate ratio) — the multiplicative change in the expected count for a one-unit increase in XjX_j.

Key assumption: E[Yi]=Var(Yi)=μiE[Y_i] = \text{Var}(Y_i) = \mu_i (equidispersion). Violations lead to overdispersion (Section 13).

Applications: Number of accidents, hospital admissions, species counts, web page visits, insurance claims frequency.

6.3 Negative Binomial GLM

Use when: Response is a count variable with overdispersion (variance exceeds the mean) — the most common departure from Poisson assumptions.

Model:

YiNegBin(μi,k),ln(μi)=xiTβY_i \sim \text{NegBin}(\mu_i, k), \quad \ln(\mu_i) = \mathbf{x}_i^T \boldsymbol{\beta}

Var(Yi)=μi+μi2k\text{Var}(Y_i) = \mu_i + \frac{\mu_i^2}{k}

Interpretation: Same as Poisson (log link, rate ratios), but with an additional overdispersion parameter kk estimated from the data.

Applications: Same as Poisson but when the Poisson assumption of equidispersion is violated — common in ecology (species abundance), healthcare (hospitalisation counts), and insurance.

6.4 Gamma GLM

Use when: Response is continuous and strictly positive, with variance that increases proportionally to the square of the mean (coefficient of variation is roughly constant).

Model:

YiGamma(μi,ϕ),g(μi)=ηiY_i \sim \text{Gamma}(\mu_i, \phi), \quad g(\mu_i) = \eta_i

Common links: Log (most interpretable), inverse (canonical), identity.

Interpretation (log link): eβje^{\beta_j} is the multiplicative change in the mean response per unit increase in XjX_j.

Applications: Insurance claim severity (cost per claim), income, hospital costs, reaction times, survival times (without censoring), environmental concentrations.

6.5 Inverse Gaussian GLM

Use when: Response is continuous, strictly positive, and highly right-skewed, with variance increasing as the cube of the mean — more extreme than Gamma.

Model:

YiInverseGaussian(μi,ϕ),g(μi)=ηiY_i \sim \text{InverseGaussian}(\mu_i, \phi), \quad g(\mu_i) = \eta_i

Common links: Inverse squared (canonical), log, inverse.

Applications: First-passage times, repair times, extreme claim sizes, some types of survival data.

6.6 Gaussian GLM (Standard Linear Regression)

Use when: Response is continuous, unbounded, approximately normally distributed, with constant variance.

Model:

YiN(μi,σ2),μi=xiTβY_i \sim \mathcal{N}(\mu_i, \sigma^2), \quad \mu_i = \mathbf{x}_i^T \boldsymbol{\beta}

This is identical to OLS with the identity link. Including it in the GLM framework confirms that linear regression is a special case of GLMs.

Applications: All classical linear regression applications.

6.7 Tweedie GLM

Use when: Response contains exact zeros mixed with positive continuous values (a "zero-inflated" continuous distribution), or when the appropriate power variance is uncertain.

Model:

YiTweedie(μi,ϕ,p),ln(μi)=xiTβY_i \sim \text{Tweedie}(\mu_i, \phi, p), \quad \ln(\mu_i) = \mathbf{x}_i^T \boldsymbol{\beta}

The power p(1,2)p \in (1, 2) is estimated from the data or set by domain knowledge.

Applications: Insurance pure premium (frequency × severity), rainfall amounts, ecological biomass, fisheries catch data.

6.8 Quasi-GLMs

When the distributional assumption is uncertain or violated, quasi-GLMs relax the full distributional assumption and specify only the mean and variance function:

E[Yi]=μi,Var(Yi)=ϕV(μi)E[Y_i] = \mu_i, \quad \text{Var}(Y_i) = \phi \cdot V(\mu_i)

The dispersion parameter ϕ\phi is estimated from the data (not fixed at 1), providing valid inference even when the count data are overdispersed or underdispersed.

Common quasi-GLMs:

⚠️ Quasi-GLMs do not have a full likelihood, so AIC/BIC cannot be computed. Use deviance and F-tests for model comparison instead.

6.9 Summary of GLM Distributions

DistributionResponse TypeVariance V(μ)V(\mu)Default LinkDispersion ϕ\phi
GaussianContinuous, unbounded11IdentityEstimated
BinomialBinary / Proportionsμ(1μ)\mu(1-\mu)LogitKnown (=1=1)
PoissonCounts (integer 0\geq 0)μ\muLogKnown (=1=1)
Negative BinomialCounts (overdispersed)μ+μ2/k\mu + \mu^2/kLogEstimated (kk)
GammaContinuous, positiveμ2\mu^2Log / InverseEstimated
Inverse GaussianContinuous, positive, skewedμ3\mu^3Inv. squared / LogEstimated
TweedieZero-inflated positiveμp\mu^pLogEstimated (pp, ϕ\phi)
Quasi-PoissonCounts (overdispersed)μ\muLogEstimated
Quasi-BinomialProportions (overdispersed)μ(1μ)\mu(1-\mu)LogitEstimated

7. Assumptions of GLMs

7.1 Correct Distributional Family

The chosen distribution must be appropriate for the type of response variable. For example:

How to check: Inspect the response variable's distribution (histogram, range), consider the data-generating process, and verify using residual diagnostics and goodness-of-fit tests.

7.2 Correct Link Function

The link function must be appropriate for the chosen distribution and the expected relationship between predictors and the mean.

How to check: Inspect residual plots; compare alternative link functions using AIC; use added-variable plots for the link function.

7.3 Linearity on the Link Scale

GLMs assume a linear relationship between the predictors xi\mathbf{x}_i and the transformed mean g(μi)g(\mu_i):

g(μi)=β0+β1Xi1++βpXipg(\mu_i) = \beta_0 + \beta_1 X_{i1} + \dots + \beta_p X_{ip}

This means the relationship between each XjX_j and g(μ)g(\mu) must be linear, even if the relationship between XjX_j and μ\mu itself is non-linear.

How to check: Partial residual plots (component-plus-residual plots); LOESS-smoothed plots of residuals vs. each predictor.

7.4 Independence of Observations

Observations must be independent of each other. Clustered, longitudinal, or spatial data may have within-group correlations that violate this assumption.

How to check: Consider the study design. For clustered data, use Generalised Estimating Equations (GEE) or mixed models (GLMM) instead.

7.5 Correct Specification of the Variance Function

The variance function must correctly describe how variability changes with the mean. Misspecification leads to:

How to check: Residual vs. fitted value plots; scale-location plots; Pearson χ2\chi^2 / deviance tests for dispersion.

7.6 No Perfect Multicollinearity

As in OLS, perfect multicollinearity (one predictor is a perfect linear function of others) prevents estimation. Near-multicollinearity inflates standard errors.

How to check: Variance Inflation Factor (VIF); condition number of the design matrix.

7.7 No Complete Separation (for Binomial GLMs)

For logistic regression and other binomial GLMs, complete separation (a predictor or combination perfectly predicts the outcome) causes the MLE to diverge to ±\pm\infty.

How to check: Warning messages from the fitting algorithm; extremely large coefficient estimates with huge standard errors.

7.8 Sufficient Sample Size

GLM inference is based on asymptotic (large-sample) theory. The adequacy of asymptotic approximations depends on:

Small expected counts reduce the reliability of likelihood ratio tests, Wald tests, and residual diagnostics.


8. Parameter Estimation: Maximum Likelihood and IRLS

8.1 The Log-Likelihood for GLMs

For independent observations, the log-likelihood is:

(β;y)=i=1ni(β;yi)=i=1nyiθib(θi)a(ϕ)+c(yi,ϕ)\ell(\boldsymbol{\beta}; \mathbf{y}) = \sum_{i=1}^n \ell_i(\boldsymbol{\beta}; y_i) = \sum_{i=1}^n \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi)

Where θi=θ(μi)\theta_i = \theta(\mu_i) and μi=g1(xiTβ)\mu_i = g^{-1}(\mathbf{x}_i^T \boldsymbol{\beta}) depend on the regression coefficients β\boldsymbol{\beta} through the link function.

8.2 The Score Equations

Setting the gradient of the log-likelihood to zero gives the score equations (MLE conditions):

βj=i=1n(yiμi)a(ϕ)V(μi)μiηixij=0,j=0,1,,p\frac{\partial \ell}{\partial \beta_j} = \sum_{i=1}^n \frac{(y_i - \mu_i)}{a(\phi) V(\mu_i)} \frac{\partial \mu_i}{\partial \eta_i} x_{ij} = 0, \quad j = 0, 1, \dots, p

In matrix form:

XTW1/2D(yμ)=0\mathbf{X}^T \mathbf{W}^{1/2} \mathbf{D} (\mathbf{y} - \boldsymbol{\mu}) = \mathbf{0}

Where D=diag(μi/ηi)\mathbf{D} = \text{diag}(\partial \mu_i / \partial \eta_i) and W=diag(wi/(V(μi)(g(μi))2))\mathbf{W} = \text{diag}(w_i / (V(\mu_i) (g'(\mu_i))^2)).

These equations are generally non-linear in β\boldsymbol{\beta} and require iterative solution.

8.3 Iteratively Reweighted Least Squares (IRLS)

The standard algorithm for fitting GLMs is Iteratively Reweighted Least Squares (IRLS), a Newton-Raphson optimisation applied to the log-likelihood.

At each iteration tt:

Step 1: Compute the adjusted dependent variable (working response):

zi(t)=η^i(t)+(yiμ^i(t))dηidμiμ^i(t)=η^i(t)+(yiμ^i(t))g(μ^i(t))z_i^{(t)} = \hat{\eta}_i^{(t)} + (y_i - \hat{\mu}_i^{(t)}) \frac{d\eta_i}{d\mu_i}\Bigg|_{\hat{\mu}_i^{(t)}}= \hat{\eta}_i^{(t)} + (y_i - \hat{\mu}_i^{(t)}) g'(\hat{\mu}_i^{(t)})

Step 2: Compute the working weights:

wi(t)=wiV(μ^i(t))[g(μ^i(t))]2w_i^{(t)} = \frac{w_i}{V(\hat{\mu}_i^{(t)}) \left[g'(\hat{\mu}_i^{(t)})\right]^2}

Step 3: Solve the weighted least squares problem:

β(t+1)=(XTW(t)X)1XTW(t)z(t)\boldsymbol{\beta}^{(t+1)} = \left(\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{W}^{(t)} \mathbf{z}^{(t)}

Convergence: Repeat until β(t+1)β(t)<ϵ\|\boldsymbol{\beta}^{(t+1)} - \boldsymbol{\beta}^{(t)}\| < \epsilon (e.g., ϵ=108\epsilon = 10^{-8}) or the change in deviance is negligible.

Starting values: Typically μ^i(0)=yi+δ\hat{\mu}_i^{(0)} = y_i + \delta (small constant to avoid boundary issues) or the overall mean yˉ\bar{y}.

8.4 The Fisher Information Matrix and Standard Errors

At convergence, the observed Fisher information matrix is:

I(β^)=XTW^X\mathcal{I}(\hat{\boldsymbol{\beta}}) = \mathbf{X}^T \hat{\mathbf{W}} \mathbf{X}

Where W^\hat{\mathbf{W}} is the weight matrix evaluated at β^\hat{\boldsymbol{\beta}}. The asymptotic covariance matrix of β^\hat{\boldsymbol{\beta}} is:

Cov(β^)=ϕ(XTW^X)1\text{Cov}(\hat{\boldsymbol{\beta}}) = \phi \left(\mathbf{X}^T \hat{\mathbf{W}} \mathbf{X}\right)^{-1}

The standard error of β^j\hat{\beta}_j:

SE(β^j)=ϕ^[(XTW^X)1]jjSE(\hat{\beta}_j) = \sqrt{\hat{\phi} \left[\left(\mathbf{X}^T \hat{\mathbf{W}} \mathbf{X}\right)^{-1}\right]_{jj}}

For known-dispersion models (Poisson, Binomial with ϕ=1\phi = 1):

SE(β^j)=[(XTW^X)1]jjSE(\hat{\beta}_j) = \sqrt{\left[\left(\mathbf{X}^T \hat{\mathbf{W}} \mathbf{X}\right)^{-1}\right]_{jj}}

8.5 Estimating the Dispersion Parameter

For distributions with estimated dispersion (Normal, Gamma, Inverse Gaussian), ϕ\phi is estimated after obtaining β^\hat{\boldsymbol{\beta}}:

Method of Moments (Pearson χ2\chi^2):

ϕ^Pearson=1np1i=1n(yiμ^i)2V(μ^i)\hat{\phi}_{Pearson} = \frac{1}{n - p - 1} \sum_{i=1}^n \frac{(y_i - \hat{\mu}_i)^2}{V(\hat{\mu}_i)}

Maximum Likelihood / Deviance Estimator:

ϕ^deviance=D(y,μ^)np1\hat{\phi}_{deviance} = \frac{D(\mathbf{y}, \hat{\boldsymbol{\mu}})}{n - p - 1}

Where D(y,μ^)D(\mathbf{y}, \hat{\boldsymbol{\mu}}) is the residual deviance (see Section 9).

💡 The Pearson χ2\chi^2 estimator of ϕ\phi is generally preferred for its robustness. For Poisson and Binomial, ϕ=1\phi = 1 is known; if the Pearson estimator gives ϕ^1\hat{\phi} \gg 1, overdispersion is present (Section 13).


9. Model Fit and Evaluation

9.1 The Deviance

The deviance D(y,μ^)D(\mathbf{y}, \hat{\boldsymbol{\mu}}) is the primary goodness-of-fit measure for GLMs. It is defined as twice the log-likelihood difference between the saturated model (perfect fit, one parameter per observation) and the fitted model:

D(y,μ^)=2[(saturated)(μ^)]=2i=1ndiD(\mathbf{y}, \hat{\boldsymbol{\mu}}) = 2\left[\ell(\text{saturated}) - \ell(\hat{\boldsymbol{\mu}})\right] = 2\sum_{i=1}^n d_i

Where the deviance contribution of observation ii is:

di=yiln(yiμ^i)(yiμ^i)(for Poisson)d_i = y_i \ln\left(\frac{y_i}{\hat{\mu}_i}\right) - (y_i - \hat{\mu}_i) \quad \text{(for Poisson)}

Or more generally, twice the contribution of observation ii to the log-likelihood difference.

Deviance for each distribution:

DistributionDeviance Contribution did_i
Gaussian(yiμ^i)2(y_i - \hat{\mu}_i)^2
Binomial2[yiln(yiμ^i)+(niyi)ln(niyiniμ^i)]2\left[y_i \ln\left(\frac{y_i}{\hat{\mu}_i}\right) + (n_i-y_i)\ln\left(\frac{n_i-y_i}{n_i-\hat{\mu}_i}\right)\right]
Poisson2[yiln(yiμ^i)(yiμ^i)]2\left[y_i \ln\left(\frac{y_i}{\hat{\mu}_i}\right) - (y_i - \hat{\mu}_i)\right]
Gamma2[ln(yiμ^i)+yiμ^iμ^i]2\left[-\ln\left(\frac{y_i}{\hat{\mu}_i}\right) + \frac{y_i - \hat{\mu}_i}{\hat{\mu}_i}\right]
Inverse Gaussian(yiμ^i)2μ^i2yi\frac{(y_i - \hat{\mu}_i)^2}{\hat{\mu}_i^2 y_i}

9.2 Null Deviance and Residual Deviance

The null deviance D0D_0 is the deviance of the null model (intercept only):

D0=D(y,μ^0)where μ^0=yˉD_0 = D(\mathbf{y}, \hat{\mu}_0) \quad \text{where } \hat{\mu}_0 = \bar{y}

The residual deviance DrD_r is the deviance of the fitted model:

Dr=D(y,μ^)D_r = D(\mathbf{y}, \hat{\boldsymbol{\mu}})

The difference D0DrD_0 - D_r measures how much the predictors have reduced the unexplained deviance — analogous to the regression sum of squares in linear regression.

For known-dispersion models (Poisson, Binomial), the residual deviance approximately follows χnp12\chi^2_{n-p-1} when the model is correct. A residual deviance much larger than np1n - p - 1 suggests poor fit or overdispersion.

9.3 The Pearson χ2\chi^2 Statistic

The Pearson χ2\chi^2 statistic provides an alternative goodness-of-fit measure:

X2=i=1n(yiμ^i)2V(μ^i)/wiX^2 = \sum_{i=1}^n \frac{(y_i - \hat{\mu}_i)^2}{V(\hat{\mu}_i)/w_i}

Under the correct model with known dispersion, X2χnp12X^2 \approx \chi^2_{n-p-1} for large samples. The Pearson dispersion estimate is ϕ^=X2/(np1)\hat{\phi} = X^2 / (n - p - 1).

9.4 Pseudo R² Measures

Since ordinary R2R^2 is not directly meaningful for non-Gaussian GLMs, several pseudo R² measures have been developed:

McFadden's Pseudo R²:

RMcFadden2=1(μ^)(μ^0)=1Dr+2satD0+2satR^2_{McFadden} = 1 - \frac{\ell(\hat{\boldsymbol{\mu}})}{\ell(\hat{\mu}_0)} = 1 - \frac{D_r + 2\ell_{sat}}{D_0 + 2\ell_{sat}}

Simplified using deviances:

RMcFadden2=1DrD0R^2_{McFadden} = 1 - \frac{D_r}{D_0}

Cox-Snell Pseudo R²:

RCS2=1(L0Lμ^)2/nR^2_{CS} = 1 - \left(\frac{L_0}{L_{\hat{\boldsymbol{\mu}}}}\right)^{2/n}

Nagelkerke Pseudo R² (scaled to reach maximum of 1):

RN2=RCS21L02/nR^2_{N} = \frac{R^2_{CS}}{1 - L_0^{2/n}}

Deviance R² (common in GLM literature):

RD2=1DrD0=RMcFadden2R^2_D = 1 - \frac{D_r}{D_0} = R^2_{McFadden}

Interpretation of McFadden's R2R^2 for GLMs:

RMcFadden2R^2_{McFadden}Interpretation
0.000.100.00 - 0.10Poor fit
0.100.200.10 - 0.20Acceptable fit
0.200.300.20 - 0.30Good fit
0.300.400.30 - 0.40Very good fit
>0.40> 0.40Excellent fit

9.5 AIC and BIC

Akaike Information Criterion:

AIC=2(μ^)+2(p+1)AIC = -2\ell(\hat{\boldsymbol{\mu}}) + 2(p+1)

Bayesian Information Criterion:

BIC=2(μ^)+(p+1)ln(n)BIC = -2\ell(\hat{\boldsymbol{\mu}}) + (p+1)\ln(n)

Where p+1p + 1 is the number of estimated regression parameters (including the intercept). For models where ϕ\phi is estimated, include it as an additional parameter.

Lower AIC/BIC indicates a better model (adjusted for complexity). AIC favours predictive accuracy; BIC imposes a stronger penalty for complexity and prefers parsimonious models.

⚠️ AIC and BIC require a proper likelihood. They cannot be computed for quasi-GLMs, which use a pseudo-likelihood. For quasi-models, use the F-test for model comparison.


10. Hypothesis Testing and Inference

10.1 Wald Tests for Individual Coefficients

For each coefficient βj\beta_j, the Wald test tests H0:βj=0H_0: \beta_j = 0:

zj=β^jSE(β^j)N(0,1)(approximately, for large n)z_j = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)} \sim \mathcal{N}(0, 1) \quad \text{(approximately, for large } n\text{)}

Two-sided p-value:

p-value=2×P(Z>zj)=2×(1Φ(zj))p\text{-value} = 2 \times P(Z > |z_j|) = 2 \times (1 - \Phi(|z_j|))

A (1α)×100%(1-\alpha) \times 100\% Wald confidence interval for βj\beta_j:

β^j±zα/2×SE(β^j)\hat{\beta}_j \pm z_{\alpha/2} \times SE(\hat{\beta}_j)

For the effect on the original response scale, exponentiate:

Confidence interval on the original scale: [eβ^jzα/2SE(β^j),  eβ^j+zα/2SE(β^j)]\left[e^{\hat{\beta}_j - z_{\alpha/2} SE(\hat{\beta}_j)}, \; e^{\hat{\beta}_j + z_{\alpha/2} SE(\hat{\beta}_j)}\right].

10.2 Likelihood Ratio Test (LRT)

The likelihood ratio test compares two nested models: a smaller (restricted) model M0M_0 and a larger (full) model M1M_1:

Λ=2[(M1)(M0)]=D(M0)D(M1)\Lambda = 2\left[\ell(M_1) - \ell(M_0)\right] = D(M_0) - D(M_1)

Under H0H_0 (the restrictions hold), Λχdf2\Lambda \sim \chi^2_{df} where dfdf is the difference in the number of parameters between M1M_1 and M0M_0.

For testing a single coefficient (H0:βj=0H_0: \beta_j = 0), df=1df = 1. For testing a group of qq coefficients jointly, df=qdf = q.

💡 The LRT is generally preferred over the Wald test for GLMs because it is more accurate in small samples and avoids the Wald test's known deficiencies (e.g., the Hauck-Donner effect, where Wald zz-values can decrease for very large effects).

10.3 Score Test (Rao Test)

The score test evaluates whether the gradient of the log-likelihood (the score) is significantly different from zero at the restricted (null) parameter values:

S=s(β^0)TI1(β^0)s(β^0)χdf2S = \mathbf{s}(\hat{\boldsymbol{\beta}}_0)^T \mathcal{I}^{-1}(\hat{\boldsymbol{\beta}}_0) \mathbf{s}(\hat{\boldsymbol{\beta}}_0) \sim \chi^2_{df}

The score test only requires fitting the null model (not the full model), making it computationally convenient when the null model is much simpler.

10.4 Analysis of Deviance Table

The analysis of deviance is the GLM analogue of the ANOVA table in linear regression. It sequentially adds predictors and reports the reduction in deviance:

SourceDfDevianceResidual DfResidual Deviancepp-value
Null modeln1n-1D0D_0
X1X_11D0D1D_0 - D_1n2n-2D1D_1P(χ12>D0D1)P(\chi^2_1 > D_0 - D_1)
X2X1X_2 \mid X_11D1D2D_1 - D_2n3n-3D2D_2P(χ12>D1D2)P(\chi^2_1 > D_1 - D_2)
\vdots\vdots\vdots\vdots\vdots\vdots
XprestX_p \mid \text{rest}1Dp1DrD_{p-1} - D_rnp1n-p-1DrD_rP(χ12>Dp1Dr)P(\chi^2_1 > D_{p-1} - D_r)

For overdispersed models (quasi-GLMs), use an F-test instead of the χ2\chi^2 test, dividing the deviance change by the estimated dispersion ϕ^\hat{\phi}:

F=(D0Dr)/pϕ^Fp,np1F = \frac{(D_0 - D_r)/p}{\hat{\phi}} \sim F_{p, n-p-1}

10.5 Confidence Intervals for the Mean Response

A confidence interval for the mean response μnew\mu_{new} at a new predictor vector xnew\mathbf{x}_{new} is constructed on the linear predictor scale (where the asymptotic normality applies) and back-transformed:

Linear predictor and its SE:

η^new=xnewTβ^,SE(η^new)=ϕ^xnewT(XTW^X)1xnew\hat{\eta}_{new} = \mathbf{x}_{new}^T \hat{\boldsymbol{\beta}}, \quad SE(\hat{\eta}_{new}) = \sqrt{\hat{\phi} \cdot \mathbf{x}_{new}^T (\mathbf{X}^T \hat{\mathbf{W}} \mathbf{X})^{-1} \mathbf{x}_{new}}

Confidence interval on the η\eta scale:

η^new±zα/2×SE(η^new)\hat{\eta}_{new} \pm z_{\alpha/2} \times SE(\hat{\eta}_{new})

Back-transform to the μ\mu scale using g1g^{-1}:

[g1(η^newzα/2×SE(η^new)),  g1(η^new+zα/2×SE(η^new))]\left[g^{-1}\left(\hat{\eta}_{new} - z_{\alpha/2} \times SE(\hat{\eta}_{new})\right), \; g^{-1}\left(\hat{\eta}_{new} + z_{\alpha/2} \times SE(\hat{\eta}_{new})\right)\right]

💡 Constructing confidence intervals on the link scale and back-transforming (rather than constructing them directly on the μ\mu scale) ensures the bounds respect the natural constraints of μ\mu (e.g., positivity for Poisson/Gamma, [0,1][0,1] for Binomial).

10.6 Profile Likelihood Confidence Intervals

Profile likelihood confidence intervals are more accurate than Wald intervals, especially in small samples or when the likelihood is asymmetric:

CIprofile={βj:2[(β^)(β^(βj))]χ1,α2}CI_{profile} = \left\{\beta_j : 2[\ell(\hat{\boldsymbol{\beta}}) - \ell(\hat{\boldsymbol{\beta}}_{(\beta_j)})] \leq \chi^2_{1, \alpha}\right\}

Where β^(βj)\hat{\boldsymbol{\beta}}_{(\beta_j)} is the MLE with βj\beta_j fixed at a test value. The DataStatPro application computes both Wald and profile likelihood CIs.


11. Model Diagnostics and Residuals

11.1 Types of GLM Residuals

Unlike linear regression, which has a single natural residual yiμ^iy_i - \hat{\mu}_i, GLMs have several types of residuals, each useful for different diagnostic purposes.

11.1.1 Raw (Response) Residuals

riraw=yiμ^ir_i^{raw} = y_i - \hat{\mu}_i

Simple but not standardised — larger values of μi\mu_i tend to produce larger raw residuals even if the fit is equally good.

11.1.2 Pearson Residuals

riP=yiμ^iV(μ^i)/wir_i^P = \frac{y_i - \hat{\mu}_i}{\sqrt{V(\hat{\mu}_i)/w_i}}

Standardised by the expected standard deviation under the model. Pearson residuals should be approximately N(0,1)\mathcal{N}(0, 1) for large samples if the model is correct.

Standardised Pearson residuals (adjusted for leverage):

riPS=riPϕ^(1hii)r_i^{PS} = \frac{r_i^P}{\sqrt{\hat{\phi}(1 - h_{ii})}}

Where hiih_{ii} is the leverage (hat matrix diagonal). Values riPS>2|r_i^{PS}| > 2 warrant investigation.

11.1.3 Deviance Residuals

riD=sign(yiμ^i)dir_i^D = \text{sign}(y_i - \hat{\mu}_i) \sqrt{d_i}

Where did_i is the deviance contribution of observation ii (see Section 9.1). The sum of squared deviance residuals equals the total deviance: i(riD)2=Dr\sum_i (r_i^D)^2 = D_r.

Deviance residuals are generally preferred for normality assessments because they are closer to normally distributed than Pearson residuals in many GLMs.

Standardised deviance residuals:

riDS=riDϕ^(1hii)r_i^{DS} = \frac{r_i^D}{\sqrt{\hat{\phi}(1 - h_{ii})}}

11.1.4 Anscombe Residuals

Anscombe residuals are constructed using a variance-stabilising transformation A(μ)A(\mu) chosen so that residuals are approximately normally distributed:

riA=A(yi)A(μ^i)A(μ^i)V(μ^i)/wir_i^A = \frac{A(y_i) - A(\hat{\mu}_i)}{A'(\hat{\mu}_i)\sqrt{V(\hat{\mu}_i)/w_i}}

The Anscombe transformation for each distribution:

DistributionA(μ)A(\mu)
Normalμ\mu
Poisson23μ2/3\frac{2}{3}\mu^{2/3}
Binomialarcsin(y/n)\arcsin\left(\sqrt{y/n}\right) (approximately)
Gammaln(μ)\ln(\mu)
Inverse Gaussian1μ\frac{1}{\sqrt{\mu}}

11.1.5 Quantile (Randomised) Residuals

Quantile residuals (Dunn & Smyth, 1996) are defined as:

riQ=Φ1(ui)r_i^Q = \Phi^{-1}(u_i)

Where ui=F(yi;μ^i,ϕ^)u_i = F(y_i; \hat{\mu}_i, \hat{\phi}) is the cumulative probability of the observed value under the fitted model. For discrete distributions, uiu_i is drawn uniformly from [F(yi;μ^i),F(yi;μ^i)][F(y_i^{-}; \hat{\mu}_i), F(y_i; \hat{\mu}_i)] (randomised).

Quantile residuals are exactly normally distributed (by construction) when the model is correct, making them the gold standard for GLM diagnostics. They are particularly useful for discrete distributions (Poisson, Binomial, Negative Binomial) where other residuals are not well-approximated by a normal distribution.

11.2 Leverage, Influence, and Cook's Distance

Hat matrix (leverage): For GLMs, the hat matrix is:

H=W^1/2X(XTW^X)1XTW^1/2\mathbf{H} = \hat{\mathbf{W}}^{1/2} \mathbf{X} (\mathbf{X}^T \hat{\mathbf{W}} \mathbf{X})^{-1} \mathbf{X}^T \hat{\mathbf{W}}^{1/2}

The diagonal elements hiih_{ii} are the leverages — the influence of observation ii on its own fitted value. High leverage (hii>2(p+1)/nh_{ii} > 2(p+1)/n) indicates an observation with unusual predictor values.

Cook's Distance: Measures the influence of observation ii on all fitted values:

Ci=(riPS)2hii(p+1)(1hii)C_i = \frac{(r_i^{PS})^2 h_{ii}}{(p+1)(1 - h_{ii})}

Values Ci>1C_i > 1 (or Ci>4/nC_i > 4/n) suggest influential observations.

DFBETA: Change in coefficient estimates when observation ii is excluded:

DFBETAij=β^jβ^j(i)DFBETA_{ij} = \hat{\beta}_j - \hat{\beta}_j^{(-i)}

11.3 Diagnostic Plots

A comprehensive GLM diagnostic assessment includes the following plots:

PlotWhat to Look For
Residuals vs. Fitted valuesNo pattern; random scatter around zero
Scale-Location (√|residuals| vs. Fitted)Horizontal band; no trend (homoscedasticity)
Normal Q-Q of residualsPoints near the diagonal line (normality)
Residuals vs. LeverageNo high-leverage + high-residual points
Cook's DistanceNo observations with Ci>1C_i > 1
Added Variable PlotsLinear relationship on the link scale
Partial Residual PlotsDetect non-linearity in individual predictors
Index Plot of Deviance ResidualsIdentify outliers by observation number

11.4 Goodness-of-Fit Tests

Hosmer-Lemeshow Test (for Binomial GLM): Groups observations into G=10G = 10 deciles of fitted probabilities and tests observed vs. expected event counts:

χHL2=g=1G(OgEg)2Eg(1Eg/ng)χG22\chi^2_{HL} = \sum_{g=1}^G \frac{(O_g - E_g)^2}{E_g(1 - E_g/n_g)} \sim \chi^2_{G-2}

A non-significant result (p>0.05p > 0.05) indicates adequate calibration.

Deviance Goodness-of-Fit Test (for Poisson/Binomial):

Drχnp12(approximately, for large expected counts)D_r \sim \chi^2_{n-p-1} \quad \text{(approximately, for large expected counts)}

A significant deviance (p<0.05p < 0.05) may indicate lack of fit, overdispersion, or missing covariates.

Pearson χ2\chi^2 Goodness-of-Fit Test:

X2χnp12X^2 \sim \chi^2_{n-p-1}

Similar interpretation to the deviance test. For sparse data, X2X^2 may be more reliable than DrD_r.


12. Model Selection and Variable Selection

12.1 Nested Model Comparison via LRT

To compare two nested models M0M1M_0 \subset M_1:

Λ=D(M0)D(M1)χdfM0dfM12\Lambda = D(M_0) - D(M_1) \sim \chi^2_{df_{M_0} - df_{M_1}}

For quasi-GLMs (estimated dispersion):

F=(D(M0)D(M1))/(dfM1dfM0)ϕ^FdfM1dfM0,npM11F = \frac{(D(M_0) - D(M_1))/(df_{M_1} - df_{M_0})}{\hat{\phi}} \sim F_{df_{M_1} - df_{M_0},\, n-p_{M_1}-1}

12.2 AIC-Based Model Selection

For non-nested models or exploratory model building, use AIC:

AIC=2(μ^)+2kAIC = -2\ell(\hat{\boldsymbol{\mu}}) + 2k

Select the model with the lowest AIC. A difference ΔAIC>2\Delta AIC > 2 is considered meaningful; ΔAIC>10\Delta AIC > 10 is strong evidence for the lower-AIC model.

12.3 Stepwise Variable Selection

Forward selection: Start with the null model; add the variable that most reduces AIC at each step; stop when no addition improves AIC.

Backward elimination: Start with the full model; remove the variable that least increases AIC (i.e., most reduces AIC) at each step; stop when no removal improves AIC.

Bidirectional stepwise: Combine forward and backward; at each step, consider both additions and removals.

⚠️ Stepwise selection using p-values suffers from multiple testing inflation and instability. AIC-based stepwise is preferred. Neither should be used as a substitute for theory-driven model building.

12.4 Handling Categorical Predictors

Categorical predictors with kk categories are encoded as k1k-1 dummy variables using a reference category. For a categorical variable "Group" with categories A (reference), B, and C:

DB=1[Group=B],DC=1[Group=C]D_B = \mathbf{1}[\text{Group} = B], \quad D_C = \mathbf{1}[\text{Group} = C]

The model becomes:

g(μi)=β0+βBDB,i+βCDC,i+g(\mu_i) = \beta_0 + \beta_B D_{B,i} + \beta_C D_{C,i} + \dots

eβBe^{\beta_B} (for log link) is the ratio of the mean for group B relative to the reference group A.

Testing all categories jointly (LRT):

Drop all k1k-1 dummy variables simultaneously and compare to the full model:

Λ=D(Mgroup)D(Mfull)χk12\Lambda = D(M_{-group}) - D(M_{full}) \sim \chi^2_{k-1}

12.5 Interaction Terms

Interactions model situations where the effect of one predictor on the response depends on the value of another predictor:

g(μi)=β0+β1Xi1+β2Xi2+β12Xi1Xi2g(\mu_i) = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \beta_{12} X_{i1} X_{i2}

In a log-link model: eβ12e^{\beta_{12}} is the multiplicative modification to the rate ratio of X1X_1 for each unit increase in X2X_2.

Test whether the interaction term is needed using the LRT with df=1df = 1 (or more for categorical interactions).

12.6 Polynomial and Spline Terms

For non-linear relationships on the link scale, include polynomial or spline terms:

Polynomial: g(μi)=β0+β1Xi+β2Xi2+g(\mu_i) = \beta_0 + \beta_1 X_i + \beta_2 X_i^2 + \dots

Natural Cubic Spline: Replaces XX with a set of basis functions that allow flexible non-linear fitting while remaining linear at the extremes. The number of knots controls flexibility.

LOESS Smoothed Partial Residual Plot: Helps identify non-linearity — if the LOESS curve departs substantially from a straight line, a polynomial or spline term may be needed.


13. Overdispersion and Underdispersion

13.1 What is Overdispersion?

Overdispersion occurs when the observed variance in the data exceeds the variance predicted by the model. It is most commonly encountered with Poisson and Binomial GLMs.

For Poisson: overdispersion means Var(Yi)>E[Yi]=μi\text{Var}(Y_i) > E[Y_i] = \mu_i. For Binomial: overdispersion means Var(Yi)>niμi(1μi)\text{Var}(Y_i) > n_i\mu_i(1-\mu_i).

Consequences of ignoring overdispersion:

13.2 Detecting Overdispersion

Informal check: Compute the ratio:

ϕ^=X2np1=i(yiμ^i)2/V(μ^i)np1\hat{\phi} = \frac{X^2}{n - p - 1} = \frac{\sum_i (y_i - \hat{\mu}_i)^2 / V(\hat{\mu}_i)}{n - p - 1}

Formal test: Test H0:ϕ=1H_0: \phi = 1 using:

χ2=X2χnp12under H0\chi^2 = X^2 \sim \chi^2_{n-p-1} \quad \text{under } H_0

A significant result (p<0.05p < 0.05) confirms overdispersion.

Cameron-Trivedi test: Regresses (yiμ^i)2yi(y_i - \hat{\mu}_i)^2 - y_i on μ^i2\hat{\mu}_i^2 (for Poisson) and tests whether the slope is significantly different from zero.

13.3 Causes of Overdispersion

CauseDescription
Unobserved heterogeneityUnmeasured variables cause variation in the true rate across observations
Clustering / correlationObservations within groups are not independent
Zero inflationMore zeros than expected under Poisson/Binomial (see Section 13.5)
ContagionOne event increases the probability of subsequent events (positive feedback)
Model misspecificationWrong distributional family, missing covariates, wrong link function
OutliersOne or a few extreme observations inflate the apparent variance

13.4 Solutions for Overdispersion

13.4.1 Quasi-GLM (Quasi-Poisson / Quasi-Binomial)

The simplest fix: Estimate ϕ\phi from the data and use it to inflate all standard errors:

SEquasi(β^j)=ϕ^×SEstandard(β^j)SE_{quasi}(\hat{\beta}_j) = \sqrt{\hat{\phi}} \times SE_{standard}(\hat{\beta}_j)

The coefficient estimates are identical to the standard GLM; only the standard errors, test statistics, and confidence intervals change. Use FF-tests instead of χ2\chi^2 tests for model comparison.

When to use: When overdispersion is mild to moderate and no specific mechanism is known.

13.4.2 Negative Binomial Regression

Models overdispersion via an additional parameter kk (the dispersion parameter):

Var(Yi)=μi+μi2k\text{Var}(Y_i) = \mu_i + \frac{\mu_i^2}{k}

As kk \to \infty, the Negative Binomial → Poisson. A significant improvement in fit over Poisson (LRT test for kk) confirms overdispersion.

When to use: When counts are overdispersed and overdispersion follows a Gamma-mixture structure (i.e., unobserved heterogeneity).

13.4.3 Zero-Inflated Models

When excess zeros are the source of overdispersion, zero-inflated models combine:

  1. A binary model for whether the count is structurally zero (e.g., logistic regression).
  2. A count model for the actual count given it is non-zero (e.g., Poisson or Negative Binomial).

Zero-Inflated Poisson (ZIP):

P(Yi=0)=πi+(1πi)eμiP(Y_i = 0) = \pi_i + (1-\pi_i)e^{-\mu_i} P(Yi=y)=(1πi)eμiμiyy!,y>0P(Y_i = y) = (1-\pi_i) \frac{e^{-\mu_i}\mu_i^y}{y!}, \quad y > 0

Where:

Zero-Inflated Negative Binomial (ZINB): Combines structural zeros with a Negative Binomial count process.

Hurdle Models: Similar to zero-inflated models but use a different two-part structure — a binary process for zero vs. positive, and a truncated count model for positive values.

Vuong Test: Compares a standard Poisson/NB model against its zero-inflated counterpart. A significant positive test statistic favours the zero-inflated model.

13.4.4 Mixed Models (GLMM)

When overdispersion arises from clustered or hierarchical data (e.g., patients within hospitals, students within schools), Generalised Linear Mixed Models (GLMMs) include random effects to account for within-group correlation:

g(μij)=xijTβ+bi,biN(0,σb2)g(\mu_{ij}) = \mathbf{x}_{ij}^T \boldsymbol{\beta} + b_i, \quad b_i \sim \mathcal{N}(0, \sigma^2_b)

Where bib_i is a random effect for cluster ii.

13.5 Underdispersion

Underdispersion (variance less than expected) is less common but can occur when:

Solutions include Conway-Maxwell-Poisson (CMP) regression, which handles both over- and underdispersion via an additional parameter.


14. Using the GLM Component

The GLM component in the DataStatPro application provides a full end-to-end workflow for fitting, evaluating, and interpreting Generalized Linear Models.

Step-by-Step Guide

Step 1 — Select Dataset Choose the dataset from the "Dataset" dropdown. The dataset should contain one response variable and one or more predictor variables.

Step 2 — Select Distribution Family Choose the distribution appropriate for your response variable:

Step 3 — Select Link Function Choose the link function. The default is the canonical link for each distribution:

Step 4 — Select Response Variable (Y) Select the response variable from the "Response Variable (Y)" dropdown. For Binomial with proportions, you will be prompted to also select the trials variable (total counts nin_i).

Step 5 — Select Predictor Variables (X) Select one or more predictor variables from the "Predictor Variables (X)" dropdown. These can be:

Step 6 — Configure Offset (Optional) If the response is a rate, select the offset variable from the "Offset" dropdown. The application will include ln(offset)\ln(\text{offset}) in the linear predictor automatically.

Step 7 — Configure Interactions (Optional) Specify interaction terms by selecting pairs (or groups) of variables. The application will create and include the product terms.

Step 8 — Select Confidence Level Choose the confidence level for confidence intervals and prediction intervals (default: 95%).

Step 9 — Configure Dispersion For Quasi-Poisson and Quasi-Binomial, select the dispersion estimation method:

For Negative Binomial, choose the method for estimating kk:

Step 10 — Select Display Options Choose which outputs to display:

Step 11 — Run the Analysis Click "Run GLM". The application will:

  1. Encode categorical variables using dummy coding.
  2. Fit the GLM using IRLS.
  3. Compute coefficients, SEs, zz-values, p-values, and CIs (Wald and profile likelihood).
  4. Compute deviance, Pearson χ2\chi^2, AIC, BIC, and pseudo R².
  5. Estimate the dispersion parameter (if applicable).
  6. Compute all residual types and diagnostic statistics.
  7. Generate all selected diagnostic plots.
  8. Run goodness-of-fit tests.

15. Computational and Formula Details

15.1 IRLS Algorithm: Full Step-by-Step

Inputs: Response y\mathbf{y}, design matrix X\mathbf{X} (n×(p+1)n \times (p+1)), prior weights w\mathbf{w} (default: all 1), link function gg, variance function VV.

Step 0: Initialise μ^(0)\hat{\boldsymbol{\mu}}^{(0)}

μ^i(0)=yi+δ(e.g., δ=0.1 for Poisson/Gamma),η^i(0)=g(μ^i(0))\hat{\mu}_i^{(0)} = y_i + \delta \quad (\text{e.g., } \delta = 0.1 \text{ for Poisson/Gamma}), \quad \hat{\eta}_i^{(0)} = g(\hat{\mu}_i^{(0)})

For iteration t=0,1,2,t = 0, 1, 2, \dots until convergence:

Step 1: Compute working response z(t)\mathbf{z}^{(t)}:

zi(t)=η^i(t)+(yiμ^i(t))g(μ^i(t))z_i^{(t)} = \hat{\eta}_i^{(t)} + (y_i - \hat{\mu}_i^{(t)}) \cdot g'(\hat{\mu}_i^{(t)})

Step 2: Compute working weights W(t)\mathbf{W}^{(t)}:

Wii(t)=wiV(μ^i(t))[g(μ^i(t))]2W_{ii}^{(t)} = \frac{w_i}{V(\hat{\mu}_i^{(t)}) \cdot \left[g'(\hat{\mu}_i^{(t)})\right]^2}

Step 3: Solve weighted least squares:

β^(t+1)=(XTW(t)X)1XTW(t)z(t)\hat{\boldsymbol{\beta}}^{(t+1)} = \left(\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{W}^{(t)} \mathbf{z}^{(t)}

Step 4: Update linear predictor and mean:

η^(t+1)=Xβ^(t+1),μ^(t+1)=g1(η^(t+1))\hat{\boldsymbol{\eta}}^{(t+1)} = \mathbf{X} \hat{\boldsymbol{\beta}}^{(t+1)}, \quad \hat{\boldsymbol{\mu}}^{(t+1)} = g^{-1}(\hat{\boldsymbol{\eta}}^{(t+1)})

Step 5: Check convergence:

β^(t+1)β^(t)β^(t)+1010<ϵ(e.g., ϵ=108)\frac{\|\hat{\boldsymbol{\beta}}^{(t+1)} - \hat{\boldsymbol{\beta}}^{(t)}\|}{\|\hat{\boldsymbol{\beta}}^{(t)}\| + 10^{-10}} < \epsilon \quad \text{(e.g., } \epsilon = 10^{-8}\text{)}

Or equivalently, check the change in deviance.

15.2 Link Function Derivatives

The working response and working weights require g(μ)=dη/dμ=dg/dμg'(\mu) = d\eta/d\mu = dg/d\mu:

Linkg(μ)g(\mu)g(μ)=dg/dμg'(\mu) = dg/d\mug1(η)g^{-1}(\eta)
Identityμ\mu11η\eta
Loglnμ\ln\mu1/μ1/\mueηe^\eta
Logitln(μ/(1μ))\ln(\mu/(1-\mu))1/(μ(1μ))1/(\mu(1-\mu))eη/(1+eη)e^\eta/(1+e^\eta)
ProbitΦ1(μ)\Phi^{-1}(\mu)1/ϕ(Φ1(μ))1/\phi(\Phi^{-1}(\mu))Φ(η)\Phi(\eta)
Cloglogln(ln(1μ))\ln(-\ln(1-\mu))1/((1μ)ln(1μ))1/((1-\mu)\ln(1-\mu))1eeη1-e^{-e^\eta}
Inverse1/μ1/\mu1/μ2-1/\mu^21/η1/\eta
Inv. Squared1/μ21/\mu^22/μ3-2/\mu^31/η1/\sqrt{\eta}
Square rootμ\sqrt{\mu}1/(2μ)1/(2\sqrt{\mu})η2\eta^2

15.3 Deviance Formulas by Distribution

DistributionTotal Deviance D(y,μ^)D(\mathbf{y}, \hat{\boldsymbol{\mu}})
Gaussianiwi(yiμ^i)2\sum_i w_i(y_i - \hat{\mu}_i)^2
Binomial2i[yiln(yi/μ^i)+(niyi)ln((niyi)/(niμ^i))]2\sum_i \left[y_i \ln(y_i/\hat{\mu}_i) + (n_i-y_i)\ln((n_i-y_i)/(n_i-\hat{\mu}_i))\right]
Poisson2i[yiln(yi/μ^i)(yiμ^i)]2\sum_i \left[y_i \ln(y_i/\hat{\mu}_i) - (y_i-\hat{\mu}_i)\right]
Gamma2i[ln(yi/μ^i)+(yiμ^i)/μ^i]2\sum_i \left[-\ln(y_i/\hat{\mu}_i) + (y_i-\hat{\mu}_i)/\hat{\mu}_i\right]
Inv. Gaussiani(yiμ^i)2/(yiμ^i2)\sum_i (y_i-\hat{\mu}_i)^2/(y_i\hat{\mu}_i^2)
Neg. Binomial2i[yiln(yi/μ^i)(yi+k)ln((yi+k)/(μ^i+k))]2\sum_i \left[y_i\ln(y_i/\hat{\mu}_i) - (y_i+k)\ln((y_i+k)/(\hat{\mu}_i+k))\right]

15.4 Marginal Effects

For models with non-identity link functions, the coefficient βj\beta_j describes the effect of XjX_j on the link scale — not directly on the response scale. Marginal effects translate coefficients to the response scale.

Average Marginal Effect (AME):

AMEj=1ni=1nμ^iXij=1ni=1nβ^j(g1)(η^i)AME_j = \frac{1}{n}\sum_{i=1}^n \frac{\partial \hat{\mu}_i}{\partial X_{ij}} = \frac{1}{n}\sum_{i=1}^n \hat{\beta}_j \cdot (g^{-1})'(\hat{\eta}_i)

For the log link: μ^iXij=β^jμ^i\frac{\partial \hat{\mu}_i}{\partial X_{ij}} = \hat{\beta}_j \hat{\mu}_i, so AMEj=β^jμ^ˉAME_j = \hat{\beta}_j \bar{\hat{\mu}}.

For the logit link: μ^iXij=β^jμ^i(1μ^i)\frac{\partial \hat{\mu}_i}{\partial X_{ij}} = \hat{\beta}_j \hat{\mu}_i(1-\hat{\mu}_i), so AMEj=β^jμ^(1μ^)AME_j = \hat{\beta}_j \overline{\hat{\mu}(1-\hat{\mu})}.

Marginal Effect at the Mean (MEM):

MEMj=β^j(g1)(η^(xˉ))MEM_j = \hat{\beta}_j \cdot (g^{-1})'(\hat{\eta}(\bar{\mathbf{x}}))

AME is generally preferred over MEM as it averages over the actual distribution of observations rather than evaluating at the mean (which may not be a representative point).


16. Worked Examples

Example 1: Poisson GLM — Modelling Insurance Claim Counts

Research Question: What factors predict the number of insurance claims filed by policyholders? Does age, vehicle type, and driving experience affect claim frequency?

Data: n=500n = 500 policyholders; response YY = number of claims in one year; exposure EE = years of coverage (offset); predictors: Age (years), VehicleType (Car/Van/Truck; reference = Car), Experience (years of driving).

Step 1: Check Response Distribution

Mean claims per year: yˉ/Eˉ=0.18\bar{y}/\bar{E} = 0.18. Histogram shows right-skewed counts with many zeros. Poisson GLM with log link and log(exposure) offset is appropriate.

Step 2: Fit Poisson GLM

ln(μi)=β0+β1Agei+β2DVan,i+β3DTruck,i+β4Experiencei+ln(Ei)\ln(\mu_i) = \beta_0 + \beta_1 \text{Age}_i + \beta_2 D_{Van,i} + \beta_3 D_{Truck,i} + \beta_4 \text{Experience}_i + \ln(E_i)

Step 3: Coefficient Table

Parameterβ^\hat{\beta}SEzz-valuep-valueeβ^e^{\hat{\beta}} (Rate Ratio)95% CI for Rate Ratio
Intercept-2.1830.241-9.06< 0.0010.113[0.070, 0.181]
Age-0.0180.006-2.870.0040.982[0.970, 0.994]
Van0.4210.1123.76< 0.0011.524[1.224, 1.897]
Truck0.6830.1484.61< 0.0011.980[1.481, 2.645]
Experience-0.0310.009-3.440.0010.969[0.951, 0.988]

Step 4: Interpretation

Step 5: Model Fit Statistics

D0=641.3 (df=499),Dr=572.4 (df=495),LRT χ42=68.9, p<0.001D_0 = 641.3\ (df = 499), \quad D_r = 572.4\ (df = 495), \quad \text{LRT } \chi^2_4 = 68.9,\ p < 0.001

RMcFadden2=1572.4/641.3=0.107,AIC=1184.2R^2_{McFadden} = 1 - 572.4/641.3 = 0.107, \quad AIC = 1184.2

Step 6: Check for Overdispersion

ϕ^=X2/(np1)=591.3/495=1.194\hat{\phi} = X^2/(n-p-1) = 591.3/495 = 1.194

Mild overdispersion (ϕ^>1\hat{\phi} > 1). Refit with Quasi-Poisson:

Quasi-Poisson multiplies all SEs by 1.194=1.093\sqrt{1.194} = 1.093. Conclusions are largely unchanged but confidence intervals are slightly wider.

Prediction for new policyholder: Age = 35, Van, Experience = 10, Exposure = 1 year:

η^=2.183+(0.018)(35)+0.421(1)+(0.031)(10)+ln(1)=2.1830.630+0.4210.310+0=2.702\hat{\eta} = -2.183 + (-0.018)(35) + 0.421(1) + (-0.031)(10) + \ln(1) = -2.183 - 0.630 + 0.421 - 0.310 + 0 = -2.702

μ^=e2.702=0.067 claims per year\hat{\mu} = e^{-2.702} = 0.067 \text{ claims per year}


Example 2: Gamma GLM — Modelling Healthcare Costs

Research Question: What patient characteristics predict the total annual healthcare cost?

Data: n=300n = 300 patients; response YY = total annual healthcare cost (USD > 0); predictors: Age (years), ChronicConditions (count), Smoker (0/1), BMI.

Step 1: Assess Distribution

Healthcare costs are strictly positive with a right-skewed distribution and variance proportional to μ2\mu^2 (coefficient of variation approximately constant). Gamma GLM with log link is appropriate.

Step 2: Fit Gamma GLM

ln(μi)=β0+β1Agei+β2Chronici+β3Smokeri+β4BMIi\ln(\mu_i) = \beta_0 + \beta_1 \text{Age}_i + \beta_2 \text{Chronic}_i + \beta_3 \text{Smoker}_i + \beta_4 \text{BMI}_i

Step 3: Coefficient Table

Parameterβ^\hat{\beta}SEzz-valuep-valueeβ^e^{\hat{\beta}} (Cost Ratio)95% CI
Intercept6.4210.38216.81< 0.001614.3[290.5, 1298.0]
Age0.0280.0073.89< 0.0011.028[1.014, 1.043]
Chronic0.3410.0428.12< 0.0011.406[1.295, 1.527]
Smoker0.2870.0982.930.0031.332[1.099, 1.615]
BMI0.0190.0082.380.0171.019[1.003, 1.036]

Step 4: Interpretation (log link → cost ratios)

Step 5: Model Fit

ϕ^Pearson=X2/295=308.2/295=1.045(no overdispersion concern)\hat{\phi}_{Pearson} = X^2/295 = 308.2/295 = 1.045 \quad \text{(no overdispersion concern)}

D0=411.2,Dr=289.7,RMcFadden2=0.295,AIC=4821.3D_0 = 411.2, \quad D_r = 289.7, \quad R^2_{McFadden} = 0.295, \quad AIC = 4821.3

Step 6: Predicted Cost for New Patient

Age = 55, Chronic = 3, Smoker = 1, BMI = 28:

η^=6.421+0.028(55)+0.341(3)+0.287(1)+0.019(28)=6.421+1.540+1.023+0.287+0.532=9.803\hat{\eta} = 6.421 + 0.028(55) + 0.341(3) + 0.287(1) + 0.019(28) = 6.421 + 1.540 + 1.023 + 0.287 + 0.532 = 9.803

μ^=e9.803=$18,090\hat{\mu} = e^{9.803} = \$18{,}090

95% CI for μ\mu: Computed on η\eta scale and back-transformed: [\14{,}210,, $23{,}020$].


Example 3: Negative Binomial GLM — Modelling Overdispersed Species Counts

Research Question: What environmental variables predict the abundance of a bird species across survey sites?

Data: n=150n = 150 survey sites; YY = bird count; predictors: Altitude (m), ForestCover (%), Distance to Water (km), Temperature (°C).

Step 1: Fit Poisson GLM and Check Overdispersion

Poisson GLM fit: ϕ^=X2/145=421.3/145=2.906\hat{\phi} = X^2/145 = 421.3/145 = 2.906 — substantial overdispersion (ϕ^3\hat{\phi} \approx 3). Switch to Negative Binomial.

Step 2: Fit Negative Binomial GLM

ln(μi)=β0+β1Altitudei+β2Foresti+β3Distancei+β4Tempi\ln(\mu_i) = \beta_0 + \beta_1 \text{Altitude}_i + \beta_2 \text{Forest}_i + \beta_3 \text{Distance}_i + \beta_4 \text{Temp}_i

Estimated overdispersion parameter: k^=2.14\hat{k} = 2.14 (SE = 0.48).

LRT for overdispersion vs. Poisson: χ12=84.3\chi^2_1 = 84.3, p<0.001p < 0.001 → Negative Binomial is strongly preferred.

Step 3: Coefficient Table

Parameterβ^\hat{\beta}SEzz-valuep-valueRate Ratio
Intercept1.8420.4124.47< 0.0016.31
Altitude (per 100m)-0.1240.038-3.260.0010.883
Forest Cover (per 10%)0.2180.0613.57< 0.0011.244
Distance to Water-0.0830.024-3.460.0010.920
Temperature0.0410.0192.160.0311.042

Step 4: Interpretation

Step 5: Fit Statistics

AICNB=812.4 vs. AICPoisson=893.1ΔAIC=80.7Strong support for NBAIC_{NB} = 812.4 \text{ vs. } AIC_{Poisson} = 893.1 \quad \Delta AIC = 80.7 \to \text{Strong support for NB}

RMcFadden2=0.218R^2_{McFadden} = 0.218


Example 4: Binomial GLM with Probit Link — Predicting Product Failure

Research Question: What material and design factors predict whether a component will fail a stress test?

Data: n=200n = 200 components; Y{0,1}Y \in \{0, 1\} (0 = pass, 1 = fail); predictors: Thickness (mm), Temperature (°C), MaterialGrade (A/B/C; reference = A).

Step 1: Fit Binomial GLM with Probit Link

Φ1(μi)=β0+β1Thicknessi+β2Tempi+β3DB,i+β4DC,i\Phi^{-1}(\mu_i) = \beta_0 + \beta_1 \text{Thickness}_i + \beta_2 \text{Temp}_i + \beta_3 D_{B,i} + \beta_4 D_{C,i}

Step 2: Coefficient Table

Parameterβ^\hat{\beta}SEzz-valuep-value
Intercept-2.8410.531-5.35< 0.001
Thickness-0.3840.092-4.17< 0.001
Temperature0.0410.0123.420.001
Grade B0.6120.2142.860.004
Grade C1.1830.2414.91< 0.001

Step 3: Interpretation (probit scale)

Average Marginal Effect of Thickness:

AME=β^1×1niϕ(η^i)=0.384×0.312=0.120AME = \hat{\beta}_1 \times \frac{1}{n}\sum_i \phi(\hat{\eta}_i) = -0.384 \times 0.312 = -0.120

On average, each mm increase in thickness reduces the probability of failure by 12.0 percentage points.

Step 4: Predicted Probability for New Component

Thickness = 4.5mm, Temperature = 80°C, Grade B:

η^=2.841+(0.384)(4.5)+(0.041)(80)+0.612=2.8411.728+3.280+0.612=0.677\hat{\eta} = -2.841 + (-0.384)(4.5) + (0.041)(80) + 0.612 = -2.841 - 1.728 + 3.280 + 0.612 = -0.677

μ^=Φ(0.677)=0.249\hat{\mu} = \Phi(-0.677) = 0.249

Predicted probability of failure: 24.9%.

95% CI: Φ(0.677±1.96×0.143)=Φ[0.957,0.397]=[0.169,0.346]\text{95\% CI: } \Phi(-0.677 \pm 1.96 \times 0.143) = \Phi[-0.957, -0.397] = [0.169, 0.346]


17. Common Mistakes and How to Avoid Them

Mistake 1: Using Gaussian GLM for Non-Normal Response Variables

Problem: Applying ordinary linear regression (Gaussian GLM) to count, proportion, or positive continuous data, which violates distributional assumptions, produces predictions outside valid ranges (e.g., negative counts or probabilities > 1), and leads to invalid inference.
Solution: Match the distribution to the response type: Poisson/Negative Binomial for counts; Binomial for proportions; Gamma for positive continuous. Always check the range and distribution of YY before selecting a family.

Mistake 2: Ignoring Overdispersion in Poisson/Binomial Models

Problem: Fitting a Poisson or Binomial GLM to overdispersed data (ϕ^>1\hat{\phi} > 1) without correction. Standard errors are underestimated, leading to spuriously small p-values and narrow confidence intervals.
Solution: Always compute ϕ^=X2/(np1)\hat{\phi} = X^2/(n-p-1) after fitting. If ϕ^>1.2\hat{\phi} > 1.2, use Quasi-Poisson, Quasi-Binomial, or Negative Binomial as appropriate. For severe overdispersion or excess zeros, consider zero-inflated models.

Mistake 3: Interpreting Coefficients on the Wrong Scale

Problem: Interpreting a log-link coefficient β^j=0.35\hat{\beta}_j = 0.35 as "a 0.35 unit increase in the mean" when it actually represents a multiplicative change: the mean is multiplied by e0.351.42e^{0.35} \approx 1.42 (a 42% increase).
Solution: Always interpret GLM coefficients on the appropriate scale. For log links, report and interpret eβ^je^{\hat{\beta}_j} (rate ratio, cost ratio, etc.). For logit links, report eβ^je^{\hat{\beta}_j} as an odds ratio. Always state clearly which scale is being used.

Mistake 4: Choosing the Wrong Link Function

Problem: Using an inappropriate link function (e.g., identity link for a Poisson model), which can produce predicted values outside valid ranges and poor model fit.
Solution: Use the canonical link as the default. Consider alternative links when domain knowledge suggests a specific functional form. Check the fit of alternative link functions using AIC and residual plots.

Mistake 5: Forgetting the Offset in Rate Models

Problem: Modelling count data without including an offset for different exposure periods or population sizes, attributing variation in counts entirely to predictors when it is partly due to different exposures.
Solution: Always include ln(exposure)\ln(\text{exposure}) as an offset when modelling rates from count data. Verify that the offset variable is on the log scale (for log-link models) and has a fixed coefficient of 1.

Mistake 6: Treating the Deviance as an Absolute Goodness-of-Fit Test for All Distributions

Problem: Using Drχnp12D_r \sim \chi^2_{n-p-1} to test model fit for Poisson or Binomial models with small expected counts, where the χ2\chi^2 approximation is unreliable.
Solution: The deviance goodness-of-fit test is only reliable when all expected counts μ^i5\hat{\mu}_i \geq 5. For sparse data, use the Pearson χ2\chi^2 test, collapse categories, or use simulation-based tests. For Binomial with individual binary responses, use the Hosmer-Lemeshow test instead.

Mistake 7: Not Checking for Complete Separation in Binomial Models

Problem: A predictor or combination of predictors perfectly separates successes from failures, causing IRLS to fail to converge and producing extremely large coefficient estimates with huge standard errors.
Solution: Look for IRLS convergence warnings and inspect coefficient estimates. If separation is detected, use Firth's bias-reduced logistic regression, exact logistic regression, or regularised estimation. Remove or merge categories that cause separation.

Mistake 8: Applying GLMs to Dependent Observations

Problem: Using a standard GLM for longitudinal, clustered, or spatially correlated data, where observations within groups are correlated, violating the independence assumption and leading to underestimated standard errors.
Solution: Use Generalised Estimating Equations (GEE) for marginal (population-averaged) inference, or Generalised Linear Mixed Models (GLMM) for subject-specific inference. Always consider the study design before choosing a model.

Mistake 9: Comparing Models Across Different Datasets Using AIC

Problem: Comparing AIC values between models fit to different subsets of data (e.g., after listwise deletion of missing values reduces the dataset differently for different models), leading to invalid comparisons.
Solution: AIC is only comparable between models fit to exactly the same observations. Ensure all candidate models use the same dataset. Handle missing data before model selection, not during.

Mistake 10: Over-Interpreting Pseudo R² Values

Problem: Comparing a GLM's pseudo R² directly to the R² from linear regression and concluding the GLM fits poorly because the pseudo R² is "only 0.20."
Solution: Pseudo R² values for GLMs are not directly comparable to OLS R². McFadden's R2=0.20R^2 = 0.20 represents a good fit in many GLM applications. Always interpret pseudo R² relative to the scale typical for that type of model and outcome, and supplement with deviance, AIC, and residual diagnostics.


18. Troubleshooting

IssueLikely CauseSolution
IRLS fails to convergeComplete separation (Binomial); extreme predictor values; poor starting valuesCheck for separation; standardise predictors; use robust starting values; reduce model complexity
Very large coefficient estimates (β^>10\|\hat{\beta}\| > 10)Complete or quasi-complete separation (Binomial); collinearityInspect data for perfect predictors; check VIF; use Firth regression or regularisation
Very large standard errorsMulticollinearity; separation; too few events per variableCheck VIF; remove correlated variables; collect more data; use penalised estimation
Residual deviance np1\gg n-p-1Overdispersion; model misspecification; influential outliersCompute ϕ^\hat{\phi}; switch to Quasi/NB model; check residual plots for outliers and non-linearity
Negative predicted values (log/Poisson model)Should not occur with log link; check for identity link used accidentallyVerify link function specification; refit with correct link
Predicted probabilities at 0 or 1 exactlyComplete separation; very extreme linear predictor valuesCheck for separation; use Firth regression; inspect extreme observations
AIC is not reportedQuasi-GLM selected (no proper likelihood)Use F-tests and deviance for model comparison; note AIC is unavailable for quasi-models
ϕ^<1\hat{\phi} < 1 (underdispersion)Counts are bounded; negative contagion; over-specified modelConsider Conway-Maxwell-Poisson; check model specification; verify data are correct
Hosmer-Lemeshow test significant (p<0.05p < 0.05)Poor calibration; missing covariates; wrong link; non-linearityAdd missing predictors; try alternative link; add polynomial terms; inspect residual plots
All Pearson residuals similar in magnitudeNormal/Gaussian family used on count data (constant variance)Switch to Poisson or Negative Binomial with appropriate variance function
Cook's distance very large for one observationExtreme influential observation; data entry errorInvestigate observation; verify data accuracy; refit with and without it to assess influence
Profile likelihood CI very asymmetric vs. Wald CIStrong non-linearity of likelihood; small sampleReport profile likelihood CI; note asymmetry as evidence of non-normality of MLE distribution
Dispersion estimate ϕ^\hat{\phi} varies wildly across subgroupsHeteroscedasticity; model misspecificationConsider separate models per subgroup; add interaction terms; use heteroscedasticity-robust SEs

19. Quick Reference Cheat Sheet

Core GLM Formulas

FormulaDescription
g(μi)=ηi=xiTβg(\mu_i) = \eta_i = \mathbf{x}_i^T \boldsymbol{\beta}GLM specification
Var(Yi)=ϕV(μi)/wi\text{Var}(Y_i) = \phi \cdot V(\mu_i) / w_iGLM variance structure
β^(t+1)=(XTW(t)X)1XTW(t)z(t)\hat{\boldsymbol{\beta}}^{(t+1)} = (\mathbf{X}^T\mathbf{W}^{(t)}\mathbf{X})^{-1}\mathbf{X}^T\mathbf{W}^{(t)}\mathbf{z}^{(t)}IRLS update
Wii=wi/[V(μ^i)(g(μ^i))2]W_{ii} = w_i / [V(\hat{\mu}_i)(g'(\hat{\mu}_i))^2]IRLS working weight
zi=η^i+(yiμ^i)g(μ^i)z_i = \hat{\eta}_i + (y_i - \hat{\mu}_i)g'(\hat{\mu}_i)IRLS working response
Cov(β^)=ϕ^(XTW^X)1\text{Cov}(\hat{\boldsymbol{\beta}}) = \hat{\phi}(\mathbf{X}^T\hat{\mathbf{W}}\mathbf{X})^{-1}Covariance of MLE
zj=β^j/SE(β^j)z_j = \hat{\beta}_j / SE(\hat{\beta}_j)Wald z-statistic
Λ=D(M0)D(M1)χdf2\Lambda = D(M_0) - D(M_1) \sim \chi^2_{df}Likelihood ratio test
Dr=2[sat(μ^)]D_r = 2[\ell_{sat} - \ell(\hat{\boldsymbol{\mu}})]Residual deviance
RMcFadden2=1Dr/D0R^2_{McFadden} = 1 - D_r/D_0McFadden's pseudo R²
AIC=2(μ^)+2kAIC = -2\ell(\hat{\boldsymbol{\mu}}) + 2kAIC
ϕ^=X2/(np1)\hat{\phi} = X^2/(n-p-1)Pearson dispersion estimate

Distribution and Link Function Selection

Response TypeDistributionDefault Linkeβ^e^{\hat{\beta}} Interpretation
Binary (0/1)BinomialLogitOdds ratio
Binary (0/1), latent normalBinomialProbitChange in probit
Binary (0/1), rare event / hazardBinomialCloglogHazard ratio
Proportion (y/ny/n)BinomialLogitOdds ratio
Count, equidispersedPoissonLogRate ratio
Count, overdispersedNeg. BinomialLogRate ratio
Count, overdispersed (mild)Quasi-PoissonLogRate ratio (corrected SEs)
Count, excess zerosZIP / ZINBLogRate ratio (count component)
Continuous, positive, CVCV \approx constGammaLogCost/mean ratio
Continuous, positive, high skewInverse GaussianLogMean ratio
Zero-inflated positive continuousTweedieLogMean ratio
Continuous, unboundedGaussianIdentityAdditive change in mean

Residual Types Summary

ResidualFormulaBest For
Rawyiμ^iy_i - \hat{\mu}_iSimple inspection
Pearson(yiμ^i)/V(μ^i)(y_i-\hat{\mu}_i)/\sqrt{V(\hat{\mu}_i)}Dispersion assessment
Deviancesign(yiμ^i)di\text{sign}(y_i-\hat{\mu}_i)\sqrt{d_i}General diagnostics
QuantileΦ1(F(yi;μ^i))\Phi^{-1}(F(y_i;\hat{\mu}_i))Best normality approximation; discrete data
AnscombeVariance-stabilised residualNormality plots

Overdispersion Decision Tree

Fit standard Poisson/Binomial GLM
           ↓
Compute φ̂ = X²/(n-p-1)
           ↓
    φ̂ ≈ 1?  ──Yes──→ Model is adequate
           ↓ No
    φ̂ > 1?  ──Yes──→ Overdispersion
           ↓                 ↓
    φ̂ < 1   ←───    Excess zeros?
  (Underdispersion)     ↓ Yes        ↓ No
  Conway-Maxwell   ZIP/ZINB    φ̂ < 2?
  Poisson (CMP)               ↓Yes        ↓No
                          Quasi-GLM   Negative Binomial
                                      or GLMM (if clustered)

Model Comparison Guide

ScenarioMethodStatistic
Two nested models (proper likelihood)LRTχdf2\chi^2_{df}
Two nested quasi-GLM modelsF-testFdf1,df2F_{df_1, df_2}
Non-nested modelsAIC / BICLower is better
Overall model significanceAnalysis of devianceχp2\chi^2_{p}
Single coefficientWald testzN(0,1)z \sim \mathcal{N}(0,1)
Group of qq coefficientsLRT or Waldχq2\chi^2_q
Small samples / asymmetric likelihoodProfile LRTχ12\chi^2_1

Pseudo R² Benchmarks (McFadden)

RMcFadden2R^2_{McFadden}Model Fit
<0.10< 0.10Poor
0.100.200.10 - 0.20Acceptable
0.200.300.20 - 0.30Good
0.300.400.30 - 0.40Very good
>0.40> 0.40Excellent

Key Diagnostic Thresholds

DiagnosticThresholdAction
ϕ^\hat{\phi} (dispersion)>1.2> 1.2Investigate overdispersion
Standardised residual riDS\|r^{DS}_i\|>2> 2 (flag), >3> 3 (outlier)Investigate observation
Leverage hiih_{ii}>2(p+1)/n> 2(p+1)/nHigh leverage; check predictor values
Cook's distance CiC_i>1> 1 or >4/n> 4/nInfluential observation; refit without it
VIF>5> 5 (concern), >10> 10 (serious)Multicollinearity; consider variable removal
Hosmer-Lemeshow pp<0.05< 0.05Poor calibration (Binomial models)
LRT for NB vs. Poissonp<0.05p < 0.05Use Negative Binomial

GLM vs. Related Models

ModelExtension of GLMKey AdditionWhen to Use
GLMMYesRandom effectsClustered / hierarchical data
GEEMarginal GLMWorking correlationLongitudinal / repeated measures
Zero-Inflated GLMYesStructural zeros componentExcess zeros in counts
Hurdle ModelYesTwo-part: binary + truncatedZeros arise from a distinct process
Ordinal GLMYesCumulative linkOrdered categorical response
Multinomial GLMYesMultiple linear predictorsNominal categorical response (> 2 classes)
Survival GLMYesCensoring mechanismTime-to-event data
Quasi-GLMYesEstimated dispersionOverdispersion without full distribution
GAMLSSYesAll parameters modelledDistributional regression

This tutorial provides a comprehensive foundation for understanding, applying, and interpreting Generalized Linear Models using the DataStatPro application. For further reading, consult McCullagh & Nelder's "Generalized Linear Models" (2nd ed., Chapman & Hall, 1989), Dobson & Barnett's "An Introduction to Generalized Linear Models" (4th ed., CRC Press, 2018), or Agresti's "Foundations of Linear and Generalized Linear Models" (Wiley, 2015). For feature requests or support, contact the DataStatPro team.