Generalized Linear Models: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of Generalized Linear Models (GLMs) all the way through advanced model specification, estimation, diagnostics, and practical usage within the DataStatPro application. Whether you are a complete beginner or an experienced analyst, this guide is structured to build your understanding step by step.

Prerequisites and Background Concepts
What are Generalized Linear Models?
The Mathematical Framework of GLMs
The Exponential Family of Distributions
Link Functions
GLM Distributions and Their Applications
Assumptions of GLMs
Parameter Estimation: Maximum Likelihood and IRLS
Model Fit and Evaluation
Hypothesis Testing and Inference
Model Diagnostics and Residuals
Model Selection and Variable Selection
Overdispersion and Underdispersion
Using the GLM Component
Computational and Formula Details
Worked Examples
Common Mistakes and How to Avoid Them
Troubleshooting
Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into Generalized Linear Models, it is helpful to be familiar with the following foundational concepts. Do not worry if you are not — each concept is briefly explained here.

1.1 Probability Distributions

A probability distribution describes how the values of a random variable are distributed. Key distributions used in GLMs:

Normal (Gaussian): Continuous, symmetric, unbounded. Characterised by mean $\mu$ and variance $\sigma^2$ .
Binomial: Discrete, counts successes in $n$ independent Bernoulli trials. Characterised by $n$ and probability $p$ .
Poisson: Discrete, counts of events in a fixed interval. Characterised by rate $\lambda$ , with $E[Y] = \text{Var}(Y) = \lambda$ .
Gamma: Continuous, positive-valued, right-skewed. Characterised by shape $\alpha$ and rate $\beta$ .
Inverse Gaussian: Continuous, positive-valued, highly right-skewed. Models first-passage times.
Negative Binomial: Discrete, counts with overdispersion relative to Poisson.

1.2 The Likelihood Function

The likelihood function $L(\boldsymbol{\theta}; \mathbf{y})$ measures how probable the observed data $\mathbf{y}$ are, given a parameter vector $\boldsymbol{\theta}$ . For independent observations:

$L(\boldsymbol{\theta}; \mathbf{y}) = \prod_{i=1}^n f(y_i; \boldsymbol{\theta})$

The log-likelihood is more convenient to work with:

$\ell(\boldsymbol{\theta}; \mathbf{y}) = \sum_{i=1}^n \ln f(y_i; \boldsymbol{\theta})$

Maximum Likelihood Estimation (MLE) finds the parameter values that maximise $\ell(\boldsymbol{\theta}; \mathbf{y})$ .

1.3 The Linear Predictor

A linear predictor $\eta$ is a weighted linear combination of predictor variables:

$\eta = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p = \mathbf{x}^T \boldsymbol{\beta}$

This is the core structure inherited from linear regression. In GLMs, $\eta$ is not the outcome itself but is transformed through a link function to relate to the mean of the response distribution.

1.4 Ordinary Linear Regression Recap

In ordinary linear regression (OLS):

$Y_i = \beta_0 + \beta_1 X_{i1} + \dots + \beta_p X_{ip} + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2)$

This model has three implicit components:

A distribution for the response: $Y_i \sim \mathcal{N}(\mu_i, \sigma^2)$ .
A linear predictor: $\eta_i = \mathbf{x}_i^T \boldsymbol{\beta}$ .
A link function connecting $\mu_i$ to $\eta_i$ : $\eta_i = \mu_i$ (the identity link).

GLMs generalise this framework by allowing different distributions and link functions.

1.5 The Score Equations and the Information Matrix

The score vector is the gradient of the log-likelihood with respect to the parameters:

$\mathbf{s}(\boldsymbol{\beta}) = \frac{\partial \ell}{\partial \boldsymbol{\beta}}$

Setting $\mathbf{s}(\boldsymbol{\beta}) = \mathbf{0}$ gives the MLE. The Fisher information matrix is:

$\mathcal{I}(\boldsymbol{\beta}) = -E\left[\frac{\partial^2 \ell}{\partial \boldsymbol{\beta} \partial \boldsymbol{\beta}^T}\right]$

Its inverse $\mathcal{I}^{-1}(\boldsymbol{\beta})$ gives the asymptotic covariance matrix of the MLE, used to compute standard errors and confidence intervals.

2. What are Generalized Linear Models?

Generalized Linear Models (GLMs) are a unified class of regression models that extend ordinary linear regression to accommodate response variables with non-normal distributions. Introduced by Nelder and Wedderburn (1972), GLMs provide a coherent framework for modelling a wide variety of outcome types — counts, proportions, binary outcomes, continuous positive values, and more — using a single, elegant mathematical structure.

2.1 The Central Idea

Ordinary linear regression assumes the response $Y$ is normally distributed and that the mean $\mu$ equals the linear predictor directly: $\mu = \eta$ . GLMs relax both restrictions:

The distribution of $Y$ can be any member of the exponential family (Normal, Binomial, Poisson, Gamma, Inverse Gaussian, etc.).
The link function $g(\mu) = \eta$ can be any monotone, differentiable function that maps the mean $\mu$ (constrained to its natural range) to the real line $(-\infty, +\infty)$ .

This two-step generalisation unlocks an enormous range of practical modelling scenarios while preserving the interpretability of regression coefficients.

2.2 Real-World Applications

GLMs are among the most widely used statistical models in applied science, business, and policy:

Insurance & Actuarial Science: Modelling claim counts (Poisson/Negative Binomial), claim severity (Gamma), and pure premiums (Tweedie).
Public Health & Epidemiology: Modelling disease incidence rates (Poisson with offset), binary disease outcomes (Binomial/logistic), survival and time-to-event data.
Ecology: Modelling species counts (Poisson, Negative Binomial), presence/absence (Binomial), and biomass (Gamma).
Economics & Finance: Modelling discrete choices (Binomial), income (Gamma), financial durations (Inverse Gaussian).
Marketing: Modelling purchase counts (Poisson), click-through rates (Binomial), and customer lifetime value (Gamma/Tweedie).
Clinical Trials: Modelling adverse event counts (Poisson), binary treatment outcomes (Binomial), and length of stay (Gamma).
Manufacturing & Quality Control: Modelling defect counts (Poisson), product lifetimes (Gamma/Inverse Gaussian).
Social Sciences: Modelling ordered survey responses (Ordinal), multinomial choices (Multinomial), and event rates.

2.3 The Three Components of a GLM

Every GLM is fully specified by three components:

Component	Symbol	Description	Example (Logistic Regression)
Random Component	$f(y; \theta, \phi)$	Distribution of $Y$ from the exponential family	Binomial $(n, p)$
Systematic Component	$\eta = \mathbf{x}^T\boldsymbol{\beta}$	Linear predictor	$\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p$
Link Function	$g(\mu) = \eta$	Connects $E[Y] = \mu$ to $\eta$	Logit: $\ln(p/(1-p))$

2.4 How GLMs Generalise Linear Regression

Feature	Linear Regression	GLM
Response distribution	Normal only	Any exponential family
Link function	Identity ( $\mu = \eta$ )	Any valid link $g(\mu) = \eta$
Variance	Constant $\sigma^2$	Function of $\mu$ : $\text{Var}(Y) = \phi \cdot V(\mu)$
Estimation	OLS (closed form)	MLE via IRLS (iterative)
Goodness of fit	$R^2$ , $F$ -test	Deviance, AIC, likelihood ratio tests
Residuals	Raw, standardised	Pearson, deviance, Anscombe

3. The Mathematical Framework of GLMs

3.1 The Three-Component Structure in Detail

A GLM specifies that the $i$ -th response $Y_i$ has:

Random Component: $Y_i \sim f(y_i; \theta_i, \phi) \quad \text{(exponential family distribution)}$

With mean $E[Y_i] = \mu_i$ and variance $\text{Var}(Y_i) = \phi \cdot V(\mu_i)$ , where:

$\phi$ is the dispersion parameter (estimated or known).
$V(\mu)$ is the variance function — a function of the mean that characterises the distribution.

Systematic Component: $\eta_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \dots + \beta_p X_{ip} = \mathbf{x}_i^T \boldsymbol{\beta}$

Link Function: $g(\mu_i) = \eta_i$

So the mean is related to the predictors through: $\mu_i = g^{-1}(\eta_i) = g^{-1}(\mathbf{x}_i^T \boldsymbol{\beta})$

Where $g^{-1}$ is the inverse link function (also called the mean function or response function).

3.2 The Variance Function $V(\mu)$

The variance function characterises how the variance of the response depends on its mean. Each exponential family distribution has a specific variance function:

Distribution	$V(\mu)$	Interpretation
Normal	$1$	Variance is constant (homoscedastic)
Binomial	$\mu(1-\mu)$	Variance is bell-shaped, maximum at $\mu = 0.5$
Poisson	$\mu$	Variance equals the mean
Gamma	$\mu^2$	Variance is proportional to the square of the mean
Inverse Gaussian	$\mu^3$	Variance grows as the cube of the mean
Negative Binomial	$\mu + \mu^2/k$	Variance exceeds the mean (overdispersion)
Tweedie	$\mu^p$	Power variance function; $p \in (1,2)$ for compound Poisson-Gamma

3.3 The Dispersion Parameter $\phi$

The full variance of $Y_i$ is:

$\text{Var}(Y_i) = \frac{\phi \cdot V(\mu_i)}{w_i}$

Where $w_i$ is a known prior weight (e.g., $w_i = n_i$ for binomial proportions). The dispersion parameter $\phi$ :

Is known for Poisson ( $\phi = 1$ ) and Binomial ( $\phi = 1$ ).
Is estimated for Normal, Gamma, and Inverse Gaussian.
Can be estimated for Poisson and Binomial to account for overdispersion (quasi-GLM).

3.4 The Canonical Link

For each distribution, there is a canonical link function that arises naturally from the mathematical structure of the exponential family. Using the canonical link has desirable statistical properties (sufficient statistics, simpler score equations):

Distribution	Canonical Link	$g(\mu)$
Normal	Identity	$\mu$
Binomial	Logit	$\ln\left(\frac{\mu}{1-\mu}\right)$
Poisson	Log	$\ln(\mu)$
Gamma	Inverse	$\frac{1}{\mu}$
Inverse Gaussian	Inverse squared	$\frac{1}{\mu^2}$

Non-canonical links can also be used and may be more interpretable in certain applications. The canonical link is the default in most GLM software but is not obligatory.

3.5 The Offset

An offset is a term added to the linear predictor with a fixed coefficient of 1:

$\eta_i = \mathbf{x}_i^T \boldsymbol{\beta} + \text{offset}_i$

Offsets are used when the response is a rate and observations have different exposure times or population sizes. The offset is known (not estimated) and is included on the linear predictor scale.

Example: Modelling disease incidence counts $Y_i$ for regions with different population sizes $N_i$ . The rate per person is $\lambda_i = \mu_i / N_i$ . Using a Poisson model with log link:

$\ln(\mu_i) = \mathbf{x}_i^T \boldsymbol{\beta} + \ln(N_i)$

Where $\ln(N_i)$ is the offset. The model then estimates the log rate $\ln(\lambda_i) = \mathbf{x}_i^T \boldsymbol{\beta}$ .

4. The Exponential Family of Distributions

The exponential family is a broad class of distributions that share a common mathematical form, which is the foundation of the GLM framework.

4.1 The Exponential Family Form

A distribution belongs to the exponential family if its probability density (or mass) function can be written as:

$f(y; \theta, \phi) = \exp\left\{\frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi)\right\}$

Where:

$\theta$ = natural (canonical) parameter — a function of the mean $\mu$ .
$\phi$ = dispersion parameter.
$b(\theta)$ = cumulant function (log-normalising constant).
$a(\phi)$ = dispersion function (typically $\phi/w$ for prior weight $w$ ).
$c(y, \phi)$ = a function of the data and dispersion only (not $\theta$ ).

Key relationships derived from $b(\theta)$ :

$\mu = E[Y] = b'(\theta) \quad \text{(first derivative of } b\text{)}$

$\text{Var}(Y) = a(\phi) \cdot b''(\theta) = \phi \cdot V(\mu) \quad \text{(second derivative of } b\text{)}$

This elegant structure means that all moments of the distribution follow automatically from the cumulant function $b(\theta)$ .

4.2 Major Exponential Family Distributions for GLMs

4.2.1 Normal Distribution

$f(y; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left\{-\frac{(y-\mu)^2}{2\sigma^2}\right\}$

Natural parameter: $\theta = \mu$
Cumulant function: $b(\theta) = \theta^2/2$
Variance function: $V(\mu) = 1$
Dispersion: $\phi = \sigma^2$
Support: $(-\infty, +\infty)$

4.2.2 Binomial Distribution

$f(y; n, p) = \binom{n}{y} p^y (1-p)^{n-y}, \quad y \in \{0, 1, \dots, n\}$

Natural parameter: $\theta = \ln(p/(1-p))$ (log odds)
Cumulant function: $b(\theta) = n \ln(1 + e^\theta)$
Variance function: $V(\mu) = \mu(1-\mu)$ (where $\mu = p$ )
Dispersion: $\phi = 1$ (known)
Support: $\{0, 1, \dots, n\}$

4.2.3 Poisson Distribution

$f(y; \lambda) = \frac{e^{-\lambda} \lambda^y}{y!}, \quad y \in \{0, 1, 2, \dots\}$

Natural parameter: $\theta = \ln(\lambda)$
Cumulant function: $b(\theta) = e^\theta$
Variance function: $V(\mu) = \mu$
Dispersion: $\phi = 1$ (known)
Support: $\{0, 1, 2, \dots\}$

4.2.4 Gamma Distribution

$f(y; \alpha, \beta) = \frac{y^{\alpha-1} e^{-y\beta} \beta^\alpha}{\Gamma(\alpha)}, \quad y > 0$

Natural parameter: $\theta = -1/\mu$ (inverse of mean)
Cumulant function: $b(\theta) = -\ln(-\theta)$
Variance function: $V(\mu) = \mu^2$
Dispersion: $\phi = 1/\alpha$ (inverse of shape)
Support: $(0, +\infty)$

4.2.5 Inverse Gaussian Distribution

$f(y; \mu, \lambda) = \sqrt{\frac{\lambda}{2\pi y^3}} \exp\left\{-\frac{\lambda(y-\mu)^2}{2\mu^2 y}\right\}, \quad y > 0$

Natural parameter: $\theta = -1/(2\mu^2)$
Cumulant function: $b(\theta) = -\sqrt{-2\theta}$
Variance function: $V(\mu) = \mu^3$
Dispersion: $\phi = 1/\lambda$
Support: $(0, +\infty)$

4.3 The Negative Binomial Distribution

While the Negative Binomial is not a member of the exponential family in its most general form, it can be treated as a quasi-exponential family or as a Poisson-Gamma mixture:

$f(y; \mu, k) = \frac{\Gamma(y+k)}{\Gamma(k) y!} \left(\frac{k}{k+\mu}\right)^k \left(\frac{\mu}{k+\mu}\right)^y, \quad y \in \{0, 1, 2, \dots\}$

Mean: $E[Y] = \mu$
Variance: $\text{Var}(Y) = \mu + \mu^2/k$
Overdispersion parameter: $k > 0$ (smaller $k$ → more overdispersion; as $k \to \infty$ , Negative Binomial → Poisson)
Support: $\{0, 1, 2, \dots\}$

4.4 The Tweedie Distribution

The Tweedie distribution is a special case of the exponential dispersion model with power variance function $V(\mu) = \mu^p$ :

Power $p$	Distribution
$p = 0$	Normal
$p = 1$	Poisson
$1 < p < 2$	Compound Poisson-Gamma (supports exact zeros + positive values)
$p = 2$	Gamma
$p = 3$	Inverse Gaussian

The Tweedie distribution with $1 < p < 2$ is particularly valuable in insurance (pure premium modelling) and ecology (biomass data) because it naturally accommodates data with a mass at zero and a continuous positive distribution for non-zero values.

5. Link Functions

The link function $g(\mu) = \eta$ is the bridge between the mean of the response distribution and the linear predictor. Choosing an appropriate link function is a key modelling decision.

5.1 Requirements for a Valid Link Function

A valid link function must be:

Monotone: Strictly increasing or decreasing.
Differentiable: $g'(\mu)$ must exist and be non-zero.
Range-compatible: The range of $g^{-1}(\eta)$ must match the support of the mean $\mu$ .

5.2 Commonly Used Link Functions

Link Name	$g(\mu) = \eta$	$\mu = g^{-1}(\eta)$	Range of $\mu$	Canonical For
Identity	$\mu$	$\eta$	$(-\infty, +\infty)$	Normal
Log	$\ln(\mu)$	$e^\eta$	$(0, +\infty)$	Poisson
Logit	$\ln\left(\frac{\mu}{1-\mu}\right)$	$\frac{e^\eta}{1+e^\eta}$	$(0, 1)$	Binomial
Probit	$\Phi^{-1}(\mu)$	$\Phi(\eta)$	$(0, 1)$	Binomial (alt)
Complementary log-log (cloglog)	$\ln(-\ln(1-\mu))$	$1 - e^{-e^\eta}$	$(0, 1)$	Binomial (alt)
Inverse	$1/\mu$	$1/\eta$	$(0, +\infty)$	Gamma
Inverse squared	$1/\mu^2$	$1/\sqrt{\eta}$	$(0, +\infty)$	Inverse Gaussian
Square root	$\sqrt{\mu}$	$\eta^2$	$(0, +\infty)$	Poisson (alt)
Negative log	$-\ln(\mu)$	$e^{-\eta}$	$(0, +\infty)$	Complementary log
Log-log	$-\ln(-\ln(\mu))$	$e^{-e^{-\eta}}$	$(0, 1)$	Binomial (alt)

5.3 Logit Link (Binomial GLM)

$g(\mu) = \text{logit}(\mu) = \ln\left(\frac{\mu}{1-\mu}\right)$

Maps probabilities $(0,1)$ to the real line $(-\infty, +\infty)$ .
Produces odds ratio interpretations: $e^{\beta_j}$ is the multiplicative change in odds per unit increase in $X_j$ .
Symmetric around $\mu = 0.5$ .

5.4 Probit Link (Binomial GLM)

$g(\mu) = \Phi^{-1}(\mu)$

Where $\Phi^{-1}$ is the quantile function of the standard normal distribution.

Maps probabilities $(0,1)$ to the real line via the normal distribution.
Produces probit (z-score) interpretations.
Very similar to logit but with lighter tails; differs mainly for extreme probabilities.

5.5 Complementary Log-Log Link (Binomial GLM)

$g(\mu) = \ln(-\ln(1-\mu))$

Asymmetric: Approaches 1 faster from below than it approaches 0.
Appropriate when the probability approaches 1 quickly but approaches 0 slowly — common in survival/hazard models.
Produces a proportional hazards interpretation: $e^{\beta_j}$ is the multiplicative change in the hazard.

5.6 Log Link (Poisson, Negative Binomial, Gamma GLM)

$g(\mu) = \ln(\mu)$

Ensures $\mu > 0$ (positivity constraint satisfied automatically).
Produces multiplicative interpretations: $e^{\beta_j}$ is the multiplicative change in the mean per unit increase in $X_j$ .
Most commonly used link for count and continuous positive data.

5.7 Inverse Link (Gamma GLM)

$g(\mu) = \frac{1}{\mu}$

The canonical link for the Gamma distribution.
Less commonly used than the log link because the inverse parameterisation is less interpretable.
Coefficients represent changes in the reciprocal of the mean.

5.8 Choosing the Link Function

Scenario	Recommended Link
Binomial: symmetric probability, easy odds interpretation	Logit
Binomial: latent normal model assumed	Probit
Binomial: rare events, extreme probabilities	Complementary log-log
Binomial: log-linear probability model needed	Log (with care about $\mu > 1$ constraint)
Poisson / Negative Binomial / Gamma: multiplicative effects	Log
Gamma: when inverse relationships are natural	Inverse
Normal / continuous unbounded	Identity
Positive continuous: when multiplicative effects expected	Log

6. GLM Distributions and Their Applications

6.1 Binomial GLM (Logistic, Probit, Cloglog Regression)

Use when: Response is a binary outcome (0/1, yes/no, success/failure) or a proportion $y/n$ where both $y$ (successes) and $n$ (trials) are known.

Model:

$Y_i \sim \text{Binomial}(n_i, \mu_i), \quad g(\mu_i) = \eta_i = \mathbf{x}_i^T \boldsymbol{\beta}$

Default link: Logit.

Interpretation (logit link): $e^{\beta_j}$ is the odds ratio — the multiplicative change in the odds of success for a one-unit increase in $X_j$ .

Special cases:

$n_i = 1$ for all $i$ : Binary logistic regression.
$n_i > 1$ : Grouped binomial (proportions) regression.

Applications: Disease diagnosis, credit default, customer churn, election outcomes, clinical trial response rates.

6.2 Poisson GLM (Poisson Regression)

Use when: Response is a count of events that could in principle be any non-negative integer, arising from a process with a constant rate.

Model:

$Y_i \sim \text{Poisson}(\mu_i), \quad \ln(\mu_i) = \mathbf{x}_i^T \boldsymbol{\beta} + \text{offset}_i$

Default link: Log.

Interpretation (log link): $e^{\beta_j}$ is the rate ratio (or incidence rate ratio) — the multiplicative change in the expected count for a one-unit increase in $X_j$ .

Key assumption: $E[Y_i] = \text{Var}(Y_i) = \mu_i$ (equidispersion). Violations lead to overdispersion (Section 13).

Applications: Number of accidents, hospital admissions, species counts, web page visits, insurance claims frequency.

6.3 Negative Binomial GLM

Use when: Response is a count variable with overdispersion (variance exceeds the mean) — the most common departure from Poisson assumptions.

Model:

$Y_i \sim \text{NegBin}(\mu_i, k), \quad \ln(\mu_i) = \mathbf{x}_i^T \boldsymbol{\beta}$

$\text{Var}(Y_i) = \mu_i + \frac{\mu_i^2}{k}$

Interpretation: Same as Poisson (log link, rate ratios), but with an additional overdispersion parameter $k$ estimated from the data.

Applications: Same as Poisson but when the Poisson assumption of equidispersion is violated — common in ecology (species abundance), healthcare (hospitalisation counts), and insurance.

6.4 Gamma GLM

Use when: Response is continuous and strictly positive, with variance that increases proportionally to the square of the mean (coefficient of variation is roughly constant).

Model:

$Y_i \sim \text{Gamma}(\mu_i, \phi), \quad g(\mu_i) = \eta_i$

Common links: Log (most interpretable), inverse (canonical), identity.

Interpretation (log link): $e^{\beta_j}$ is the multiplicative change in the mean response per unit increase in $X_j$ .

Applications: Insurance claim severity (cost per claim), income, hospital costs, reaction times, survival times (without censoring), environmental concentrations.

6.5 Inverse Gaussian GLM

Use when: Response is continuous, strictly positive, and highly right-skewed, with variance increasing as the cube of the mean — more extreme than Gamma.

Model:

$Y_i \sim \text{InverseGaussian}(\mu_i, \phi), \quad g(\mu_i) = \eta_i$

Common links: Inverse squared (canonical), log, inverse.

Applications: First-passage times, repair times, extreme claim sizes, some types of survival data.

6.6 Gaussian GLM (Standard Linear Regression)

Use when: Response is continuous, unbounded, approximately normally distributed, with constant variance.

Model:

$Y_i \sim \mathcal{N}(\mu_i, \sigma^2), \quad \mu_i = \mathbf{x}_i^T \boldsymbol{\beta}$

This is identical to OLS with the identity link. Including it in the GLM framework confirms that linear regression is a special case of GLMs.

Applications: All classical linear regression applications.

6.7 Tweedie GLM

Use when: Response contains exact zeros mixed with positive continuous values (a "zero-inflated" continuous distribution), or when the appropriate power variance is uncertain.

Model:

$Y_i \sim \text{Tweedie}(\mu_i, \phi, p), \quad \ln(\mu_i) = \mathbf{x}_i^T \boldsymbol{\beta}$

The power $p \in (1, 2)$ is estimated from the data or set by domain knowledge.

Applications: Insurance pure premium (frequency × severity), rainfall amounts, ecological biomass, fisheries catch data.

6.8 Quasi-GLMs

When the distributional assumption is uncertain or violated, quasi-GLMs relax the full distributional assumption and specify only the mean and variance function:

$E[Y_i] = \mu_i, \quad \text{Var}(Y_i) = \phi \cdot V(\mu_i)$

The dispersion parameter $\phi$ is estimated from the data (not fixed at 1), providing valid inference even when the count data are overdispersed or underdispersed.

Common quasi-GLMs:

Quasi-Poisson: Poisson mean function with estimated $\phi$ .
Quasi-Binomial: Binomial mean function with estimated $\phi$ .

⚠️ Quasi-GLMs do not have a full likelihood, so AIC/BIC cannot be computed. Use deviance and F-tests for model comparison instead.

6.9 Summary of GLM Distributions

Distribution	Response Type	Variance $V(\mu)$	Default Link	Dispersion $\phi$
Gaussian	Continuous, unbounded	$1$	Identity	Estimated
Binomial	Binary / Proportions	$\mu(1-\mu)$	Logit	Known ( $=1$ )
Poisson	Counts (integer $\geq 0$ )	$\mu$	Log	Known ( $=1$ )
Negative Binomial	Counts (overdispersed)	$\mu + \mu^2/k$	Log	Estimated ( $k$ )
Gamma	Continuous, positive	$\mu^2$	Log / Inverse	Estimated
Inverse Gaussian	Continuous, positive, skewed	$\mu^3$	Inv. squared / Log	Estimated
Tweedie	Zero-inflated positive	$\mu^p$	Log	Estimated ( $p$ , $\phi$ )
Quasi-Poisson	Counts (overdispersed)	$\mu$	Log	Estimated
Quasi-Binomial	Proportions (overdispersed)	$\mu(1-\mu)$	Logit	Estimated

7. Assumptions of GLMs

7.1 Correct Distributional Family

The chosen distribution must be appropriate for the type of response variable. For example:

Using Gaussian for count data ignores the non-negativity and discreteness.
Using Poisson for overdispersed counts ignores the excess variance, leading to underestimated standard errors.

How to check: Inspect the response variable's distribution (histogram, range), consider the data-generating process, and verify using residual diagnostics and goodness-of-fit tests.

7.2 Correct Link Function

The link function must be appropriate for the chosen distribution and the expected relationship between predictors and the mean.

How to check: Inspect residual plots; compare alternative link functions using AIC; use added-variable plots for the link function.

7.3 Linearity on the Link Scale

GLMs assume a linear relationship between the predictors $\mathbf{x}_i$ and the transformed mean $g(\mu_i)$ :

$g(\mu_i) = \beta_0 + \beta_1 X_{i1} + \dots + \beta_p X_{ip}$

This means the relationship between each $X_j$ and $g(\mu)$ must be linear, even if the relationship between $X_j$ and $\mu$ itself is non-linear.

How to check: Partial residual plots (component-plus-residual plots); LOESS-smoothed plots of residuals vs. each predictor.

7.4 Independence of Observations

Observations must be independent of each other. Clustered, longitudinal, or spatial data may have within-group correlations that violate this assumption.

How to check: Consider the study design. For clustered data, use Generalised Estimating Equations (GEE) or mixed models (GLMM) instead.

7.5 Correct Specification of the Variance Function

The variance function must correctly describe how variability changes with the mean. Misspecification leads to:

Incorrect standard errors (too small if variance is underestimated).
Invalid hypothesis tests and confidence intervals.

How to check: Residual vs. fitted value plots; scale-location plots; Pearson $\chi^2$ / deviance tests for dispersion.

7.6 No Perfect Multicollinearity

As in OLS, perfect multicollinearity (one predictor is a perfect linear function of others) prevents estimation. Near-multicollinearity inflates standard errors.

How to check: Variance Inflation Factor (VIF); condition number of the design matrix.

7.7 No Complete Separation (for Binomial GLMs)

For logistic regression and other binomial GLMs, complete separation (a predictor or combination perfectly predicts the outcome) causes the MLE to diverge to $\pm\infty$ .

How to check: Warning messages from the fitting algorithm; extremely large coefficient estimates with huge standard errors.

7.8 Sufficient Sample Size

GLM inference is based on asymptotic (large-sample) theory. The adequacy of asymptotic approximations depends on:

Total sample size $n$ .
For Binomial: Expected counts $n_i \mu_i \geq 5$ in each cell.
For Poisson: Expected counts $\mu_i \geq 1$ (preferably $\geq 5$ ) in most cells.

Small expected counts reduce the reliability of likelihood ratio tests, Wald tests, and residual diagnostics.

8. Parameter Estimation: Maximum Likelihood and IRLS

8.1 The Log-Likelihood for GLMs

For independent observations, the log-likelihood is:

$\ell(\boldsymbol{\beta}; \mathbf{y}) = \sum_{i=1}^n \ell_i(\boldsymbol{\beta}; y_i) = \sum_{i=1}^n \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi)$

Where $\theta_i = \theta(\mu_i)$ and $\mu_i = g^{-1}(\mathbf{x}_i^T \boldsymbol{\beta})$ depend on the regression coefficients $\boldsymbol{\beta}$ through the link function.

8.2 The Score Equations

Setting the gradient of the log-likelihood to zero gives the score equations (MLE conditions):

$\frac{\partial \ell}{\partial \beta_j} = \sum_{i=1}^n \frac{(y_i - \mu_i)}{a(\phi) V(\mu_i)} \frac{\partial \mu_i}{\partial \eta_i} x_{ij} = 0, \quad j = 0, 1, \dots, p$

In matrix form:

$\mathbf{X}^T \mathbf{W}^{1/2} \mathbf{D} (\mathbf{y} - \boldsymbol{\mu}) = \mathbf{0}$

Where $\mathbf{D} = \text{diag}(\partial \mu_i / \partial \eta_i)$ and $\mathbf{W} = \text{diag}(w_i / (V(\mu_i) (g'(\mu_i))^2))$ .

These equations are generally non-linear in $\boldsymbol{\beta}$ and require iterative solution.

8.3 Iteratively Reweighted Least Squares (IRLS)

The standard algorithm for fitting GLMs is Iteratively Reweighted Least Squares (IRLS), a Newton-Raphson optimisation applied to the log-likelihood.

At each iteration $t$ :

Step 1: Compute the adjusted dependent variable (working response):

$z_i^{(t)} = \hat{\eta}_i^{(t)} + (y_i - \hat{\mu}_i^{(t)}) \frac{d\eta_i}{d\mu_i}\Bigg|_{\hat{\mu}_i^{(t)}}= \hat{\eta}_i^{(t)} + (y_i - \hat{\mu}_i^{(t)}) g'(\hat{\mu}_i^{(t)})$

Step 2: Compute the working weights:

$w_i^{(t)} = \frac{w_i}{V(\hat{\mu}_i^{(t)}) \left[g'(\hat{\mu}_i^{(t)})\right]^2}$

Step 3: Solve the weighted least squares problem:

$\boldsymbol{\beta}^{(t+1)} = \left(\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{W}^{(t)} \mathbf{z}^{(t)}$

Convergence: Repeat until $\|\boldsymbol{\beta}^{(t+1)} - \boldsymbol{\beta}^{(t)}\| < \epsilon$ (e.g., $\epsilon = 10^{-8}$ ) or the change in deviance is negligible.

Starting values: Typically $\hat{\mu}_i^{(0)} = y_i + \delta$ (small constant to avoid boundary issues) or the overall mean $\bar{y}$ .

8.4 The Fisher Information Matrix and Standard Errors

At convergence, the observed Fisher information matrix is:

$\mathcal{I}(\hat{\boldsymbol{\beta}}) = \mathbf{X}^T \hat{\mathbf{W}} \mathbf{X}$

Where $\hat{\mathbf{W}}$ is the weight matrix evaluated at $\hat{\boldsymbol{\beta}}$ . The asymptotic covariance matrix of $\hat{\boldsymbol{\beta}}$ is:

$\text{Cov}(\hat{\boldsymbol{\beta}}) = \phi \left(\mathbf{X}^T \hat{\mathbf{W}} \mathbf{X}\right)^{-1}$

The standard error of $\hat{\beta}_j$ :

$SE(\hat{\beta}_j) = \sqrt{\hat{\phi} \left[\left(\mathbf{X}^T \hat{\mathbf{W}} \mathbf{X}\right)^{-1}\right]_{jj}}$

For known-dispersion models (Poisson, Binomial with $\phi = 1$ ):

$SE(\hat{\beta}_j) = \sqrt{\left[\left(\mathbf{X}^T \hat{\mathbf{W}} \mathbf{X}\right)^{-1}\right]_{jj}}$

8.5 Estimating the Dispersion Parameter

For distributions with estimated dispersion (Normal, Gamma, Inverse Gaussian), $\phi$ is estimated after obtaining $\hat{\boldsymbol{\beta}}$ :

Method of Moments (Pearson $\chi^2$ ):

$\hat{\phi}_{Pearson} = \frac{1}{n - p - 1} \sum_{i=1}^n \frac{(y_i - \hat{\mu}_i)^2}{V(\hat{\mu}_i)}$

Maximum Likelihood / Deviance Estimator:

$\hat{\phi}_{deviance} = \frac{D(\mathbf{y}, \hat{\boldsymbol{\mu}})}{n - p - 1}$

Where $D(\mathbf{y}, \hat{\boldsymbol{\mu}})$ is the residual deviance (see Section 9).

💡 The Pearson $\chi^2$ estimator of $\phi$ is generally preferred for its robustness. For Poisson and Binomial, $\phi = 1$ is known; if the Pearson estimator gives $\hat{\phi} \gg 1$ , overdispersion is present (Section 13).

9. Model Fit and Evaluation

9.1 The Deviance

The deviance $D(\mathbf{y}, \hat{\boldsymbol{\mu}})$ is the primary goodness-of-fit measure for GLMs. It is defined as twice the log-likelihood difference between the saturated model (perfect fit, one parameter per observation) and the fitted model:

$D(\mathbf{y}, \hat{\boldsymbol{\mu}}) = 2\left[\ell(\text{saturated}) - \ell(\hat{\boldsymbol{\mu}})\right] = 2\sum_{i=1}^n d_i$

Where the deviance contribution of observation $i$ is:

$d_i = y_i \ln\left(\frac{y_i}{\hat{\mu}_i}\right) - (y_i - \hat{\mu}_i) \quad \text{(for Poisson)}$

Or more generally, twice the contribution of observation $i$ to the log-likelihood difference.

Deviance for each distribution:

Distribution	Deviance Contribution $d_i$
Gaussian	$(y_i - \hat{\mu}_i)^2$
Binomial	$2\left[y_i \ln\left(\frac{y_i}{\hat{\mu}_i}\right) + (n_i-y_i)\ln\left(\frac{n_i-y_i}{n_i-\hat{\mu}_i}\right)\right]$
Poisson	$2\left[y_i \ln\left(\frac{y_i}{\hat{\mu}_i}\right) - (y_i - \hat{\mu}_i)\right]$
Gamma	$2\left[-\ln\left(\frac{y_i}{\hat{\mu}_i}\right) + \frac{y_i - \hat{\mu}_i}{\hat{\mu}_i}\right]$
Inverse Gaussian	$\frac{(y_i - \hat{\mu}_i)^2}{\hat{\mu}_i^2 y_i}$

9.2 Null Deviance and Residual Deviance

The null deviance $D_0$ is the deviance of the null model (intercept only):

$D_0 = D(\mathbf{y}, \hat{\mu}_0) \quad \text{where } \hat{\mu}_0 = \bar{y}$

The residual deviance $D_r$ is the deviance of the fitted model:

$D_r = D(\mathbf{y}, \hat{\boldsymbol{\mu}})$

The difference $D_0 - D_r$ measures how much the predictors have reduced the unexplained deviance — analogous to the regression sum of squares in linear regression.

For known-dispersion models (Poisson, Binomial), the residual deviance approximately follows $\chi^2_{n-p-1}$ when the model is correct. A residual deviance much larger than $n - p - 1$ suggests poor fit or overdispersion.

9.3 The Pearson $\chi^2$ Statistic

The Pearson $\chi^2$ statistic provides an alternative goodness-of-fit measure:

$X^2 = \sum_{i=1}^n \frac{(y_i - \hat{\mu}_i)^2}{V(\hat{\mu}_i)/w_i}$

Under the correct model with known dispersion, $X^2 \approx \chi^2_{n-p-1}$ for large samples. The Pearson dispersion estimate is $\hat{\phi} = X^2 / (n - p - 1)$ .

9.4 Pseudo R² Measures

Since ordinary $R^2$ is not directly meaningful for non-Gaussian GLMs, several pseudo R² measures have been developed:

McFadden's Pseudo R²:

$R^2_{McFadden} = 1 - \frac{\ell(\hat{\boldsymbol{\mu}})}{\ell(\hat{\mu}_0)} = 1 - \frac{D_r + 2\ell_{sat}}{D_0 + 2\ell_{sat}}$

Simplified using deviances:

$R^2_{McFadden} = 1 - \frac{D_r}{D_0}$

Cox-Snell Pseudo R²:

$R^2_{CS} = 1 - \left(\frac{L_0}{L_{\hat{\boldsymbol{\mu}}}}\right)^{2/n}$

Nagelkerke Pseudo R² (scaled to reach maximum of 1):

$R^2_{N} = \frac{R^2_{CS}}{1 - L_0^{2/n}}$

Deviance R² (common in GLM literature):

$R^2_D = 1 - \frac{D_r}{D_0} = R^2_{McFadden}$

Interpretation of McFadden's $R^2$ for GLMs:

$R^2_{McFadden}$	Interpretation
$0.00 - 0.10$	Poor fit
$0.10 - 0.20$	Acceptable fit
$0.20 - 0.30$	Good fit
$0.30 - 0.40$	Very good fit
$> 0.40$	Excellent fit

9.5 AIC and BIC

Akaike Information Criterion:

$AIC = -2\ell(\hat{\boldsymbol{\mu}}) + 2(p+1)$

Bayesian Information Criterion:

$BIC = -2\ell(\hat{\boldsymbol{\mu}}) + (p+1)\ln(n)$

Where $p + 1$ is the number of estimated regression parameters (including the intercept). For models where $\phi$ is estimated, include it as an additional parameter.

Lower AIC/BIC indicates a better model (adjusted for complexity). AIC favours predictive accuracy; BIC imposes a stronger penalty for complexity and prefers parsimonious models.

⚠️ AIC and BIC require a proper likelihood. They cannot be computed for quasi-GLMs, which use a pseudo-likelihood. For quasi-models, use the F-test for model comparison.

10. Hypothesis Testing and Inference

10.1 Wald Tests for Individual Coefficients

For each coefficient $\beta_j$ , the Wald test tests $H_0: \beta_j = 0$ :

$z_j = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)} \sim \mathcal{N}(0, 1) \quad \text{(approximately, for large } n\text{)}$

Two-sided p-value:

$p\text{-value} = 2 \times P(Z > |z_j|) = 2 \times (1 - \Phi(|z_j|))$

A $(1-\alpha) \times 100\%$ Wald confidence interval for $\beta_j$ :

$\hat{\beta}_j \pm z_{\alpha/2} \times SE(\hat{\beta}_j)$

For the effect on the original response scale, exponentiate:

Log link: $e^{\hat{\beta}_j}$ is the rate ratio or mean ratio (Poisson/Gamma).
Logit link: $e^{\hat{\beta}_j}$ is the odds ratio (Binomial).

Confidence interval on the original scale: $\left[e^{\hat{\beta}_j - z_{\alpha/2} SE(\hat{\beta}_j)}, \; e^{\hat{\beta}_j + z_{\alpha/2} SE(\hat{\beta}_j)}\right]$ .

10.2 Likelihood Ratio Test (LRT)

The likelihood ratio test compares two nested models: a smaller (restricted) model $M_0$ and a larger (full) model $M_1$ :

$\Lambda = 2\left[\ell(M_1) - \ell(M_0)\right] = D(M_0) - D(M_1)$

Under $H_0$ (the restrictions hold), $\Lambda \sim \chi^2_{df}$ where $df$ is the difference in the number of parameters between $M_1$ and $M_0$ .

For testing a single coefficient ( $H_0: \beta_j = 0$ ), $df = 1$ . For testing a group of $q$ coefficients jointly, $df = q$ .

💡 The LRT is generally preferred over the Wald test for GLMs because it is more accurate in small samples and avoids the Wald test's known deficiencies (e.g., the Hauck-Donner effect, where Wald $z$ -values can decrease for very large effects).

10.3 Score Test (Rao Test)

The score test evaluates whether the gradient of the log-likelihood (the score) is significantly different from zero at the restricted (null) parameter values:

$S = \mathbf{s}(\hat{\boldsymbol{\beta}}_0)^T \mathcal{I}^{-1}(\hat{\boldsymbol{\beta}}_0) \mathbf{s}(\hat{\boldsymbol{\beta}}_0) \sim \chi^2_{df}$

The score test only requires fitting the null model (not the full model), making it computationally convenient when the null model is much simpler.

10.4 Analysis of Deviance Table

The analysis of deviance is the GLM analogue of the ANOVA table in linear regression. It sequentially adds predictors and reports the reduction in deviance:

Source	Df	Deviance	Residual Df	Residual Deviance	$p$ -value
Null model	—	—	$n-1$	$D_0$	—
$X_1$	1	$D_0 - D_1$	$n-2$	$D_1$	$P(\chi^2_1 > D_0 - D_1)$
$X_2 \mid X_1$	1	$D_1 - D_2$	$n-3$	$D_2$	$P(\chi^2_1 > D_1 - D_2)$
$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$
$X_p \mid \text{rest}$	1	$D_{p-1} - D_r$	$n-p-1$	$D_r$	$P(\chi^2_1 > D_{p-1} - D_r)$

For overdispersed models (quasi-GLMs), use an F-test instead of the $\chi^2$ test, dividing the deviance change by the estimated dispersion $\hat{\phi}$ :

$F = \frac{(D_0 - D_r)/p}{\hat{\phi}} \sim F_{p, n-p-1}$

10.5 Confidence Intervals for the Mean Response

A confidence interval for the mean response $\mu_{new}$ at a new predictor vector $\mathbf{x}_{new}$ is constructed on the linear predictor scale (where the asymptotic normality applies) and back-transformed:

Linear predictor and its SE:

$\hat{\eta}_{new} = \mathbf{x}_{new}^T \hat{\boldsymbol{\beta}}, \quad SE(\hat{\eta}_{new}) = \sqrt{\hat{\phi} \cdot \mathbf{x}_{new}^T (\mathbf{X}^T \hat{\mathbf{W}} \mathbf{X})^{-1} \mathbf{x}_{new}}$

Confidence interval on the $\eta$ scale:

$\hat{\eta}_{new} \pm z_{\alpha/2} \times SE(\hat{\eta}_{new})$

Back-transform to the $\mu$ scale using $g^{-1}$ :

$\left[g^{-1}\left(\hat{\eta}_{new} - z_{\alpha/2} \times SE(\hat{\eta}_{new})\right), \; g^{-1}\left(\hat{\eta}_{new} + z_{\alpha/2} \times SE(\hat{\eta}_{new})\right)\right]$

💡 Constructing confidence intervals on the link scale and back-transforming (rather than constructing them directly on the $\mu$ scale) ensures the bounds respect the natural constraints of $\mu$ (e.g., positivity for Poisson/Gamma, $[0,1]$ for Binomial).

10.6 Profile Likelihood Confidence Intervals

Profile likelihood confidence intervals are more accurate than Wald intervals, especially in small samples or when the likelihood is asymmetric:

$CI_{profile} = \left\{\beta_j : 2[\ell(\hat{\boldsymbol{\beta}}) - \ell(\hat{\boldsymbol{\beta}}_{(\beta_j)})] \leq \chi^2_{1, \alpha}\right\}$

Where $\hat{\boldsymbol{\beta}}_{(\beta_j)}$ is the MLE with $\beta_j$ fixed at a test value. The DataStatPro application computes both Wald and profile likelihood CIs.

11. Model Diagnostics and Residuals

11.1 Types of GLM Residuals

Unlike linear regression, which has a single natural residual $y_i - \hat{\mu}_i$ , GLMs have several types of residuals, each useful for different diagnostic purposes.

11.1.1 Raw (Response) Residuals

$r_i^{raw} = y_i - \hat{\mu}_i$

Simple but not standardised — larger values of $\mu_i$ tend to produce larger raw residuals even if the fit is equally good.

11.1.2 Pearson Residuals

$r_i^P = \frac{y_i - \hat{\mu}_i}{\sqrt{V(\hat{\mu}_i)/w_i}}$

Standardised by the expected standard deviation under the model. Pearson residuals should be approximately $\mathcal{N}(0, 1)$ for large samples if the model is correct.

Standardised Pearson residuals (adjusted for leverage):

$r_i^{PS} = \frac{r_i^P}{\sqrt{\hat{\phi}(1 - h_{ii})}}$

Where $h_{ii}$ is the leverage (hat matrix diagonal). Values $|r_i^{PS}| > 2$ warrant investigation.

11.1.3 Deviance Residuals

$r_i^D = \text{sign}(y_i - \hat{\mu}_i) \sqrt{d_i}$

Where $d_i$ is the deviance contribution of observation $i$ (see Section 9.1). The sum of squared deviance residuals equals the total deviance: $\sum_i (r_i^D)^2 = D_r$ .

Deviance residuals are generally preferred for normality assessments because they are closer to normally distributed than Pearson residuals in many GLMs.

Standardised deviance residuals:

$r_i^{DS} = \frac{r_i^D}{\sqrt{\hat{\phi}(1 - h_{ii})}}$

11.1.4 Anscombe Residuals

Anscombe residuals are constructed using a variance-stabilising transformation $A(\mu)$ chosen so that residuals are approximately normally distributed:

$r_i^A = \frac{A(y_i) - A(\hat{\mu}_i)}{A'(\hat{\mu}_i)\sqrt{V(\hat{\mu}_i)/w_i}}$

The Anscombe transformation for each distribution:

Distribution	$A(\mu)$
Normal	$\mu$
Poisson	$\frac{2}{3}\mu^{2/3}$
Binomial	$\arcsin\left(\sqrt{y/n}\right)$ (approximately)
Gamma	$\ln(\mu)$
Inverse Gaussian	$\frac{1}{\sqrt{\mu}}$

11.1.5 Quantile (Randomised) Residuals

Quantile residuals (Dunn & Smyth, 1996) are defined as:

$r_i^Q = \Phi^{-1}(u_i)$

Where $u_i = F(y_i; \hat{\mu}_i, \hat{\phi})$ is the cumulative probability of the observed value under the fitted model. For discrete distributions, $u_i$ is drawn uniformly from $[F(y_i^{-}; \hat{\mu}_i), F(y_i; \hat{\mu}_i)]$ (randomised).

Quantile residuals are exactly normally distributed (by construction) when the model is correct, making them the gold standard for GLM diagnostics. They are particularly useful for discrete distributions (Poisson, Binomial, Negative Binomial) where other residuals are not well-approximated by a normal distribution.

11.2 Leverage, Influence, and Cook's Distance

Hat matrix (leverage): For GLMs, the hat matrix is:

$\mathbf{H} = \hat{\mathbf{W}}^{1/2} \mathbf{X} (\mathbf{X}^T \hat{\mathbf{W}} \mathbf{X})^{-1} \mathbf{X}^T \hat{\mathbf{W}}^{1/2}$

The diagonal elements $h_{ii}$ are the leverages — the influence of observation $i$ on its own fitted value. High leverage ( $h_{ii} > 2(p+1)/n$ ) indicates an observation with unusual predictor values.

Cook's Distance: Measures the influence of observation $i$ on all fitted values:

$C_i = \frac{(r_i^{PS})^2 h_{ii}}{(p+1)(1 - h_{ii})}$

Values $C_i > 1$ (or $C_i > 4/n$ ) suggest influential observations.

DFBETA: Change in coefficient estimates when observation $i$ is excluded:

$DFBETA_{ij} = \hat{\beta}_j - \hat{\beta}_j^{(-i)}$

11.3 Diagnostic Plots

A comprehensive GLM diagnostic assessment includes the following plots:

Plot	What to Look For
Residuals vs. Fitted values	No pattern; random scatter around zero
Scale-Location (√\|residuals\| vs. Fitted)	Horizontal band; no trend (homoscedasticity)
Normal Q-Q of residuals	Points near the diagonal line (normality)
Residuals vs. Leverage	No high-leverage + high-residual points
Cook's Distance	No observations with $C_i > 1$
Added Variable Plots	Linear relationship on the link scale
Partial Residual Plots	Detect non-linearity in individual predictors
Index Plot of Deviance Residuals	Identify outliers by observation number

11.4 Goodness-of-Fit Tests

Hosmer-Lemeshow Test (for Binomial GLM): Groups observations into $G = 10$ deciles of fitted probabilities and tests observed vs. expected event counts:

$\chi^2_{HL} = \sum_{g=1}^G \frac{(O_g - E_g)^2}{E_g(1 - E_g/n_g)} \sim \chi^2_{G-2}$

A non-significant result ( $p > 0.05$ ) indicates adequate calibration.

Deviance Goodness-of-Fit Test (for Poisson/Binomial):

$D_r \sim \chi^2_{n-p-1} \quad \text{(approximately, for large expected counts)}$

A significant deviance ( $p < 0.05$ ) may indicate lack of fit, overdispersion, or missing covariates.

Pearson $\chi^2$ Goodness-of-Fit Test:

$X^2 \sim \chi^2_{n-p-1}$

Similar interpretation to the deviance test. For sparse data, $X^2$ may be more reliable than $D_r$ .

12. Model Selection and Variable Selection

12.1 Nested Model Comparison via LRT

To compare two nested models $M_0 \subset M_1$ :

$\Lambda = D(M_0) - D(M_1) \sim \chi^2_{df_{M_0} - df_{M_1}}$

For quasi-GLMs (estimated dispersion):

$F = \frac{(D(M_0) - D(M_1))/(df_{M_1} - df_{M_0})}{\hat{\phi}} \sim F_{df_{M_1} - df_{M_0},\, n-p_{M_1}-1}$

12.2 AIC-Based Model Selection

For non-nested models or exploratory model building, use AIC:

$AIC = -2\ell(\hat{\boldsymbol{\mu}}) + 2k$

Select the model with the lowest AIC. A difference $\Delta AIC > 2$ is considered meaningful; $\Delta AIC > 10$ is strong evidence for the lower-AIC model.

12.3 Stepwise Variable Selection

Forward selection: Start with the null model; add the variable that most reduces AIC at each step; stop when no addition improves AIC.

Backward elimination: Start with the full model; remove the variable that least increases AIC (i.e., most reduces AIC) at each step; stop when no removal improves AIC.

Bidirectional stepwise: Combine forward and backward; at each step, consider both additions and removals.

⚠️ Stepwise selection using p-values suffers from multiple testing inflation and instability. AIC-based stepwise is preferred. Neither should be used as a substitute for theory-driven model building.

12.4 Handling Categorical Predictors

Categorical predictors with $k$ categories are encoded as $k-1$ dummy variables using a reference category. For a categorical variable "Group" with categories A (reference), B, and C:

$D_B = \mathbf{1}[\text{Group} = B], \quad D_C = \mathbf{1}[\text{Group} = C]$

The model becomes:

$g(\mu_i) = \beta_0 + \beta_B D_{B,i} + \beta_C D_{C,i} + \dots$

$e^{\beta_B}$ (for log link) is the ratio of the mean for group B relative to the reference group A.

Testing all categories jointly (LRT):

Drop all $k-1$ dummy variables simultaneously and compare to the full model:

$\Lambda = D(M_{-group}) - D(M_{full}) \sim \chi^2_{k-1}$

12.5 Interaction Terms

Interactions model situations where the effect of one predictor on the response depends on the value of another predictor:

$g(\mu_i) = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \beta_{12} X_{i1} X_{i2}$

In a log-link model: $e^{\beta_{12}}$ is the multiplicative modification to the rate ratio of $X_1$ for each unit increase in $X_2$ .

Test whether the interaction term is needed using the LRT with $df = 1$ (or more for categorical interactions).

12.6 Polynomial and Spline Terms

For non-linear relationships on the link scale, include polynomial or spline terms:

Polynomial: $g(\mu_i) = \beta_0 + \beta_1 X_i + \beta_2 X_i^2 + \dots$

Natural Cubic Spline: Replaces $X$ with a set of basis functions that allow flexible non-linear fitting while remaining linear at the extremes. The number of knots controls flexibility.

LOESS Smoothed Partial Residual Plot: Helps identify non-linearity — if the LOESS curve departs substantially from a straight line, a polynomial or spline term may be needed.

13. Overdispersion and Underdispersion

13.1 What is Overdispersion?

Overdispersion occurs when the observed variance in the data exceeds the variance predicted by the model. It is most commonly encountered with Poisson and Binomial GLMs.

For Poisson: overdispersion means $\text{Var}(Y_i) > E[Y_i] = \mu_i$ . For Binomial: overdispersion means $\text{Var}(Y_i) > n_i\mu_i(1-\mu_i)$ .

Consequences of ignoring overdispersion:

Standard errors are underestimated (too small).
$t$ -statistics and $\chi^2$ statistics are inflated.
p-values are too small → spurious significance.
Confidence intervals are too narrow.

13.2 Detecting Overdispersion

Informal check: Compute the ratio:

$\hat{\phi} = \frac{X^2}{n - p - 1} = \frac{\sum_i (y_i - \hat{\mu}_i)^2 / V(\hat{\mu}_i)}{n - p - 1}$

$\hat{\phi} \approx 1$ : No overdispersion (Poisson/Binomial assumption holds).
$\hat{\phi} > 1$ : Overdispersion. Values $> 2$ are a clear concern.
$\hat{\phi} < 1$ : Underdispersion (rarer but possible).

Formal test: Test $H_0: \phi = 1$ using:

$\chi^2 = X^2 \sim \chi^2_{n-p-1} \quad \text{under } H_0$

A significant result ( $p < 0.05$ ) confirms overdispersion.

Cameron-Trivedi test: Regresses $(y_i - \hat{\mu}_i)^2 - y_i$ on $\hat{\mu}_i^2$ (for Poisson) and tests whether the slope is significantly different from zero.

13.3 Causes of Overdispersion

Cause	Description
Unobserved heterogeneity	Unmeasured variables cause variation in the true rate across observations
Clustering / correlation	Observations within groups are not independent
Zero inflation	More zeros than expected under Poisson/Binomial (see Section 13.5)
Contagion	One event increases the probability of subsequent events (positive feedback)
Model misspecification	Wrong distributional family, missing covariates, wrong link function
Outliers	One or a few extreme observations inflate the apparent variance

13.4 Solutions for Overdispersion

13.4.1 Quasi-GLM (Quasi-Poisson / Quasi-Binomial)

The simplest fix: Estimate $\phi$ from the data and use it to inflate all standard errors:

$SE_{quasi}(\hat{\beta}_j) = \sqrt{\hat{\phi}} \times SE_{standard}(\hat{\beta}_j)$

The coefficient estimates are identical to the standard GLM; only the standard errors, test statistics, and confidence intervals change. Use $F$ -tests instead of $\chi^2$ tests for model comparison.

When to use: When overdispersion is mild to moderate and no specific mechanism is known.

13.4.2 Negative Binomial Regression

Models overdispersion via an additional parameter $k$ (the dispersion parameter):

$\text{Var}(Y_i) = \mu_i + \frac{\mu_i^2}{k}$

As $k \to \infty$ , the Negative Binomial → Poisson. A significant improvement in fit over Poisson (LRT test for $k$ ) confirms overdispersion.

When to use: When counts are overdispersed and overdispersion follows a Gamma-mixture structure (i.e., unobserved heterogeneity).

13.4.3 Zero-Inflated Models

When excess zeros are the source of overdispersion, zero-inflated models combine:

A binary model for whether the count is structurally zero (e.g., logistic regression).
A count model for the actual count given it is non-zero (e.g., Poisson or Negative Binomial).

Zero-Inflated Poisson (ZIP):

$P(Y_i = 0) = \pi_i + (1-\pi_i)e^{-\mu_i}$ $P(Y_i = y) = (1-\pi_i) \frac{e^{-\mu_i}\mu_i^y}{y!}, \quad y > 0$

Where:

$\pi_i$ = probability of a structural zero (modelled as a logistic regression).
$\mu_i$ = Poisson mean for non-structural counts (modelled via log link).

Zero-Inflated Negative Binomial (ZINB): Combines structural zeros with a Negative Binomial count process.

Hurdle Models: Similar to zero-inflated models but use a different two-part structure — a binary process for zero vs. positive, and a truncated count model for positive values.

Vuong Test: Compares a standard Poisson/NB model against its zero-inflated counterpart. A significant positive test statistic favours the zero-inflated model.

13.4.4 Mixed Models (GLMM)

When overdispersion arises from clustered or hierarchical data (e.g., patients within hospitals, students within schools), Generalised Linear Mixed Models (GLMMs) include random effects to account for within-group correlation:

$g(\mu_{ij}) = \mathbf{x}_{ij}^T \boldsymbol{\beta} + b_i, \quad b_i \sim \mathcal{N}(0, \sigma^2_b)$

Where $b_i$ is a random effect for cluster $i$ .

13.5 Underdispersion

Underdispersion (variance less than expected) is less common but can occur when:

Counts are bounded (e.g., maximum possible count is small).
There is negative contagion (one event inhibits subsequent events).
Data are from a very controlled process.

Solutions include Conway-Maxwell-Poisson (CMP) regression, which handles both over- and underdispersion via an additional parameter.

14. Using the GLM Component

The GLM component in the DataStatPro application provides a full end-to-end workflow for fitting, evaluating, and interpreting Generalized Linear Models.

Step-by-Step Guide

Step 1 — Select Dataset Choose the dataset from the "Dataset" dropdown. The dataset should contain one response variable and one or more predictor variables.

Step 2 — Select Distribution Family Choose the distribution appropriate for your response variable:

Gaussian (Normal): Continuous, unbounded.
Binomial: Binary (0/1) or proportions ( $y/n$ ).
Poisson: Non-negative integer counts.
Negative Binomial: Overdispersed counts.
Gamma: Continuous, strictly positive.
Inverse Gaussian: Continuous, positive, highly skewed.
Tweedie: Zero-inflated positive continuous.
Quasi-Poisson: Poisson with estimated dispersion.
Quasi-Binomial: Binomial with estimated dispersion.

Step 3 — Select Link Function Choose the link function. The default is the canonical link for each distribution:

Gaussian → Identity
Binomial → Logit (alternatives: Probit, Cloglog, Log)
Poisson → Log (alternatives: Square Root, Identity)
Negative Binomial → Log
Gamma → Log (alternatives: Inverse, Identity)
Inverse Gaussian → Inverse Squared (alternatives: Log, Inverse)
Tweedie → Log

Step 4 — Select Response Variable (Y) Select the response variable from the "Response Variable (Y)" dropdown. For Binomial with proportions, you will be prompted to also select the trials variable (total counts $n_i$ ).

Step 5 — Select Predictor Variables (X) Select one or more predictor variables from the "Predictor Variables (X)" dropdown. These can be:

Numeric (continuous or ordinal).
Categorical (the application automatically creates dummy variables; you will be prompted to select the reference category).

Step 6 — Configure Offset (Optional) If the response is a rate, select the offset variable from the "Offset" dropdown. The application will include $\ln(\text{offset})$ in the linear predictor automatically.

Step 7 — Configure Interactions (Optional) Specify interaction terms by selecting pairs (or groups) of variables. The application will create and include the product terms.

Step 8 — Select Confidence Level Choose the confidence level for confidence intervals and prediction intervals (default: 95%).

Step 9 — Configure Dispersion For Quasi-Poisson and Quasi-Binomial, select the dispersion estimation method:

Pearson $\chi^2$ (recommended default)
Deviance

For Negative Binomial, choose the method for estimating $k$ :

MLE (recommended)
Method of Moments

Step 10 — Select Display Options Choose which outputs to display:

✅ Coefficient Table (estimates, SEs, $z$ -values, p-values, CIs, exp(β))
✅ Analysis of Deviance Table
✅ Model Fit Statistics ( $D_r$ , $D_0$ , $R^2_{McFadden}$ , AIC, BIC, $\hat{\phi}$ )
✅ Predicted vs. Observed Plot
✅ Residuals vs. Fitted Plot
✅ Normal Q-Q Plot of Residuals
✅ Scale-Location Plot
✅ Cook's Distance Plot
✅ Residuals vs. Leverage Plot
✅ Marginal Effects Plot
✅ Hosmer-Lemeshow Test (Binomial only)
✅ Overdispersion Test (Poisson/Binomial)
✅ Prediction Tool with CI

Step 11 — Run the Analysis Click "Run GLM". The application will:

Encode categorical variables using dummy coding.
Fit the GLM using IRLS.
Compute coefficients, SEs, $z$ -values, p-values, and CIs (Wald and profile likelihood).
Compute deviance, Pearson $\chi^2$ , AIC, BIC, and pseudo R².
Estimate the dispersion parameter (if applicable).
Compute all residual types and diagnostic statistics.
Generate all selected diagnostic plots.
Run goodness-of-fit tests.

15. Computational and Formula Details

15.1 IRLS Algorithm: Full Step-by-Step

Inputs: Response $\mathbf{y}$ , design matrix $\mathbf{X}$ ( $n \times (p+1)$ ), prior weights $\mathbf{w}$ (default: all 1), link function $g$ , variance function $V$ .

Step 0: Initialise $\hat{\boldsymbol{\mu}}^{(0)}$

$\hat{\mu}_i^{(0)} = y_i + \delta \quad (\text{e.g., } \delta = 0.1 \text{ for Poisson/Gamma}), \quad \hat{\eta}_i^{(0)} = g(\hat{\mu}_i^{(0)})$

For iteration $t = 0, 1, 2, \dots$ until convergence:

Step 1: Compute working response $\mathbf{z}^{(t)}$ :

$z_i^{(t)} = \hat{\eta}_i^{(t)} + (y_i - \hat{\mu}_i^{(t)}) \cdot g'(\hat{\mu}_i^{(t)})$

Step 2: Compute working weights $\mathbf{W}^{(t)}$ :

$W_{ii}^{(t)} = \frac{w_i}{V(\hat{\mu}_i^{(t)}) \cdot \left[g'(\hat{\mu}_i^{(t)})\right]^2}$

Step 3: Solve weighted least squares:

$\hat{\boldsymbol{\beta}}^{(t+1)} = \left(\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{W}^{(t)} \mathbf{z}^{(t)}$

Step 4: Update linear predictor and mean:

$\hat{\boldsymbol{\eta}}^{(t+1)} = \mathbf{X} \hat{\boldsymbol{\beta}}^{(t+1)}, \quad \hat{\boldsymbol{\mu}}^{(t+1)} = g^{-1}(\hat{\boldsymbol{\eta}}^{(t+1)})$

Step 5: Check convergence:

$\frac{\|\hat{\boldsymbol{\beta}}^{(t+1)} - \hat{\boldsymbol{\beta}}^{(t)}\|}{\|\hat{\boldsymbol{\beta}}^{(t)}\| + 10^{-10}} < \epsilon \quad \text{(e.g., } \epsilon = 10^{-8}\text{)}$

Or equivalently, check the change in deviance.

15.2 Link Function Derivatives

The working response and working weights require $g'(\mu) = d\eta/d\mu = dg/d\mu$ :

Link	$g(\mu)$	$g'(\mu) = dg/d\mu$	$g^{-1}(\eta)$
Identity	$\mu$	$1$	$\eta$
Log	$\ln\mu$	$1/\mu$	$e^\eta$
Logit	$\ln(\mu/(1-\mu))$	$1/(\mu(1-\mu))$	$e^\eta/(1+e^\eta)$
Probit	$\Phi^{-1}(\mu)$	$1/\phi(\Phi^{-1}(\mu))$	$\Phi(\eta)$
Cloglog	$\ln(-\ln(1-\mu))$	$1/((1-\mu)\ln(1-\mu))$	$1-e^{-e^\eta}$
Inverse	$1/\mu$	$-1/\mu^2$	$1/\eta$
Inv. Squared	$1/\mu^2$	$-2/\mu^3$	$1/\sqrt{\eta}$
Square root	$\sqrt{\mu}$	$1/(2\sqrt{\mu})$	$\eta^2$

15.3 Deviance Formulas by Distribution

Distribution	Total Deviance $D(\mathbf{y}, \hat{\boldsymbol{\mu}})$
Gaussian	$\sum_i w_i(y_i - \hat{\mu}_i)^2$
Binomial	$2\sum_i \left[y_i \ln(y_i/\hat{\mu}_i) + (n_i-y_i)\ln((n_i-y_i)/(n_i-\hat{\mu}_i))\right]$
Poisson	$2\sum_i \left[y_i \ln(y_i/\hat{\mu}_i) - (y_i-\hat{\mu}_i)\right]$
Gamma	$2\sum_i \left[-\ln(y_i/\hat{\mu}_i) + (y_i-\hat{\mu}_i)/\hat{\mu}_i\right]$
Inv. Gaussian	$\sum_i (y_i-\hat{\mu}_i)^2/(y_i\hat{\mu}_i^2)$
Neg. Binomial	$2\sum_i \left[y_i\ln(y_i/\hat{\mu}_i) - (y_i+k)\ln((y_i+k)/(\hat{\mu}_i+k))\right]$

15.4 Marginal Effects

For models with non-identity link functions, the coefficient $\beta_j$ describes the effect of $X_j$ on the link scale — not directly on the response scale. Marginal effects translate coefficients to the response scale.

Average Marginal Effect (AME):

$AME_j = \frac{1}{n}\sum_{i=1}^n \frac{\partial \hat{\mu}_i}{\partial X_{ij}} = \frac{1}{n}\sum_{i=1}^n \hat{\beta}_j \cdot (g^{-1})'(\hat{\eta}_i)$

For the log link: $\frac{\partial \hat{\mu}_i}{\partial X_{ij}} = \hat{\beta}_j \hat{\mu}_i$ , so $AME_j = \hat{\beta}_j \bar{\hat{\mu}}$ .

For the logit link: $\frac{\partial \hat{\mu}_i}{\partial X_{ij}} = \hat{\beta}_j \hat{\mu}_i(1-\hat{\mu}_i)$ , so $AME_j = \hat{\beta}_j \overline{\hat{\mu}(1-\hat{\mu})}$ .

Marginal Effect at the Mean (MEM):

$MEM_j = \hat{\beta}_j \cdot (g^{-1})'(\hat{\eta}(\bar{\mathbf{x}}))$

AME is generally preferred over MEM as it averages over the actual distribution of observations rather than evaluating at the mean (which may not be a representative point).

16. Worked Examples

Example 1: Poisson GLM — Modelling Insurance Claim Counts

Research Question: What factors predict the number of insurance claims filed by policyholders? Does age, vehicle type, and driving experience affect claim frequency?

Data: $n = 500$ policyholders; response $Y$ = number of claims in one year; exposure $E$ = years of coverage (offset); predictors: Age (years), VehicleType (Car/Van/Truck; reference = Car), Experience (years of driving).

Step 1: Check Response Distribution

Mean claims per year: $\bar{y}/\bar{E} = 0.18$ . Histogram shows right-skewed counts with many zeros. Poisson GLM with log link and log(exposure) offset is appropriate.

Step 2: Fit Poisson GLM

$\ln(\mu_i) = \beta_0 + \beta_1 \text{Age}_i + \beta_2 D_{Van,i} + \beta_3 D_{Truck,i} + \beta_4 \text{Experience}_i + \ln(E_i)$

Step 3: Coefficient Table

Parameter	$\hat{\beta}$	SE	$z$ -value	p-value	$e^{\hat{\beta}}$ (Rate Ratio)	95% CI for Rate Ratio
Intercept	-2.183	0.241	-9.06	< 0.001	0.113	[0.070, 0.181]
Age	-0.018	0.006	-2.87	0.004	0.982	[0.970, 0.994]
Van	0.421	0.112	3.76	< 0.001	1.524	[1.224, 1.897]
Truck	0.683	0.148	4.61	< 0.001	1.980	[1.481, 2.645]
Experience	-0.031	0.009	-3.44	0.001	0.969	[0.951, 0.988]

Step 4: Interpretation

Age: For each additional year of age, the expected claim rate is multiplied by $0.982$ — a 1.8% decrease per year, holding other variables constant ( $p = 0.004$ ).
Van vs. Car: Van drivers have a 52.4% higher claim rate than car drivers ( $RR = 1.524$ , $p < 0.001$ ).
Truck vs. Car: Truck drivers have a 98.0% higher claim rate (nearly double) compared to car drivers ( $RR = 1.980$ , $p < 0.001$ ).
Experience: Each additional year of driving experience reduces the expected claim rate by 3.1% ( $RR = 0.969$ , $p = 0.001$ ).

Step 5: Model Fit Statistics

$D_0 = 641.3\ (df = 499), \quad D_r = 572.4\ (df = 495), \quad \text{LRT } \chi^2_4 = 68.9,\ p < 0.001$

$R^2_{McFadden} = 1 - 572.4/641.3 = 0.107, \quad AIC = 1184.2$

Step 6: Check for Overdispersion

$\hat{\phi} = X^2/(n-p-1) = 591.3/495 = 1.194$

Mild overdispersion ( $\hat{\phi} > 1$ ). Refit with Quasi-Poisson:

Quasi-Poisson multiplies all SEs by $\sqrt{1.194} = 1.093$ . Conclusions are largely unchanged but confidence intervals are slightly wider.

Prediction for new policyholder: Age = 35, Van, Experience = 10, Exposure = 1 year:

$\hat{\eta} = -2.183 + (-0.018)(35) + 0.421(1) + (-0.031)(10) + \ln(1) = -2.183 - 0.630 + 0.421 - 0.310 + 0 = -2.702$

$\hat{\mu} = e^{-2.702} = 0.067 \text{ claims per year}$

Example 2: Gamma GLM — Modelling Healthcare Costs

Research Question: What patient characteristics predict the total annual healthcare cost?

Data: $n = 300$ patients; response $Y$ = total annual healthcare cost (USD > 0); predictors: Age (years), ChronicConditions (count), Smoker (0/1), BMI.

Step 1: Assess Distribution

Healthcare costs are strictly positive with a right-skewed distribution and variance proportional to $\mu^2$ (coefficient of variation approximately constant). Gamma GLM with log link is appropriate.

Step 2: Fit Gamma GLM

$\ln(\mu_i) = \beta_0 + \beta_1 \text{Age}_i + \beta_2 \text{Chronic}_i + \beta_3 \text{Smoker}_i + \beta_4 \text{BMI}_i$

Step 3: Coefficient Table

Parameter	$\hat{\beta}$	SE	$z$ -value	p-value	$e^{\hat{\beta}}$ (Cost Ratio)	95% CI
Intercept	6.421	0.382	16.81	< 0.001	614.3	[290.5, 1298.0]
Age	0.028	0.007	3.89	< 0.001	1.028	[1.014, 1.043]
Chronic	0.341	0.042	8.12	< 0.001	1.406	[1.295, 1.527]
Smoker	0.287	0.098	2.93	0.003	1.332	[1.099, 1.615]
BMI	0.019	0.008	2.38	0.017	1.019	[1.003, 1.036]

Step 4: Interpretation (log link → cost ratios)

Age: Each additional year of age increases expected costs by 2.8% ( $CR = 1.028$ , $p < 0.001$ ).
Chronic Conditions: Each additional chronic condition increases expected costs by 40.6% ( $CR = 1.406$ , $p < 0.001$ ).
Smoking: Smokers have 33.2% higher expected costs than non-smokers ( $CR = 1.332$ , $p = 0.003$ ).
BMI: Each unit increase in BMI increases expected costs by 1.9% ( $CR = 1.019$ , $p = 0.017$ ).

Step 5: Model Fit

$\hat{\phi}_{Pearson} = X^2/295 = 308.2/295 = 1.045 \quad \text{(no overdispersion concern)}$

$D_0 = 411.2, \quad D_r = 289.7, \quad R^2_{McFadden} = 0.295, \quad AIC = 4821.3$

Step 6: Predicted Cost for New Patient

Age = 55, Chronic = 3, Smoker = 1, BMI = 28:

$\hat{\eta} = 6.421 + 0.028(55) + 0.341(3) + 0.287(1) + 0.019(28) = 6.421 + 1.540 + 1.023 + 0.287 + 0.532 = 9.803$

$\hat{\mu} = e^{9.803} = \$18{,}090$

95% CI for $\mu$ : Computed on $\eta$ scale and back-transformed: [ $\$ 14{,}210 $,$ $23{,}020$].

Example 3: Negative Binomial GLM — Modelling Overdispersed Species Counts

Research Question: What environmental variables predict the abundance of a bird species across survey sites?

Data: $n = 150$ survey sites; $Y$ = bird count; predictors: Altitude (m), ForestCover (%), Distance to Water (km), Temperature (°C).

Step 1: Fit Poisson GLM and Check Overdispersion

Poisson GLM fit: $\hat{\phi} = X^2/145 = 421.3/145 = 2.906$ — substantial overdispersion ( $\hat{\phi} \approx 3$ ). Switch to Negative Binomial.

Step 2: Fit Negative Binomial GLM

$\ln(\mu_i) = \beta_0 + \beta_1 \text{Altitude}_i + \beta_2 \text{Forest}_i + \beta_3 \text{Distance}_i + \beta_4 \text{Temp}_i$

Estimated overdispersion parameter: $\hat{k} = 2.14$ (SE = 0.48).

LRT for overdispersion vs. Poisson: $\chi^2_1 = 84.3$ , $p < 0.001$ → Negative Binomial is strongly preferred.

Step 3: Coefficient Table

Parameter	$\hat{\beta}$	SE	$z$ -value	p-value	Rate Ratio
Intercept	1.842	0.412	4.47	< 0.001	6.31
Altitude (per 100m)	-0.124	0.038	-3.26	0.001	0.883
Forest Cover (per 10%)	0.218	0.061	3.57	< 0.001	1.244
Distance to Water	-0.083	0.024	-3.46	0.001	0.920
Temperature	0.041	0.019	2.16	0.031	1.042

Step 4: Interpretation

Altitude: For each 100m increase in altitude, expected count is multiplied by $0.883$ — a 11.7% decrease ( $p = 0.001$ ).
Forest Cover: For each 10% increase in forest cover, expected count increases by 24.4% ( $p < 0.001$ ).
Distance to Water: Each additional km from water reduces expected count by 8.0% ( $p = 0.001$ ).
Temperature: Each degree increase in temperature increases expected count by 4.2% ( $p = 0.031$ ).

Step 5: Fit Statistics

$AIC_{NB} = 812.4 \text{ vs. } AIC_{Poisson} = 893.1 \quad \Delta AIC = 80.7 \to \text{Strong support for NB}$

$R^2_{McFadden} = 0.218$

Example 4: Binomial GLM with Probit Link — Predicting Product Failure

Research Question: What material and design factors predict whether a component will fail a stress test?

Data: $n = 200$ components; $Y \in \{0, 1\}$ (0 = pass, 1 = fail); predictors: Thickness (mm), Temperature (°C), MaterialGrade (A/B/C; reference = A).

Step 1: Fit Binomial GLM with Probit Link

$\Phi^{-1}(\mu_i) = \beta_0 + \beta_1 \text{Thickness}_i + \beta_2 \text{Temp}_i + \beta_3 D_{B,i} + \beta_4 D_{C,i}$

Step 2: Coefficient Table

Parameter	$\hat{\beta}$	SE	$z$ -value	p-value
Intercept	-2.841	0.531	-5.35	< 0.001
Thickness	-0.384	0.092	-4.17	< 0.001
Temperature	0.041	0.012	3.42	0.001
Grade B	0.612	0.214	2.86	0.004
Grade C	1.183	0.241	4.91	< 0.001

Step 3: Interpretation (probit scale)

Thickness: Each mm increase in thickness decreases the probit of failure by 0.384 — the component is less likely to fail.
Temperature: Each degree increase increases the probit of failure by 0.041.
Grade B vs. A: Grade B components have a probit of failure that is 0.612 higher (more likely to fail) than Grade A.

Average Marginal Effect of Thickness:

$AME = \hat{\beta}_1 \times \frac{1}{n}\sum_i \phi(\hat{\eta}_i) = -0.384 \times 0.312 = -0.120$

On average, each mm increase in thickness reduces the probability of failure by 12.0 percentage points.

Step 4: Predicted Probability for New Component

Thickness = 4.5mm, Temperature = 80°C, Grade B:

$\hat{\eta} = -2.841 + (-0.384)(4.5) + (0.041)(80) + 0.612 = -2.841 - 1.728 + 3.280 + 0.612 = -0.677$

$\hat{\mu} = \Phi(-0.677) = 0.249$

Predicted probability of failure: 24.9%.

$\text{95\% CI: } \Phi(-0.677 \pm 1.96 \times 0.143) = \Phi[-0.957, -0.397] = [0.169, 0.346]$

17. Common Mistakes and How to Avoid Them

Mistake 1: Using Gaussian GLM for Non-Normal Response Variables

Problem: Applying ordinary linear regression (Gaussian GLM) to count, proportion, or positive continuous data, which violates distributional assumptions, produces predictions outside valid ranges (e.g., negative counts or probabilities > 1), and leads to invalid inference.
Solution: Match the distribution to the response type: Poisson/Negative Binomial for counts; Binomial for proportions; Gamma for positive continuous. Always check the range and distribution of $Y$ before selecting a family.

Mistake 2: Ignoring Overdispersion in Poisson/Binomial Models

Problem: Fitting a Poisson or Binomial GLM to overdispersed data ( $\hat{\phi} > 1$ ) without correction. Standard errors are underestimated, leading to spuriously small p-values and narrow confidence intervals.
Solution: Always compute $\hat{\phi} = X^2/(n-p-1)$ after fitting. If $\hat{\phi} > 1.2$ , use Quasi-Poisson, Quasi-Binomial, or Negative Binomial as appropriate. For severe overdispersion or excess zeros, consider zero-inflated models.

Mistake 3: Interpreting Coefficients on the Wrong Scale

Problem: Interpreting a log-link coefficient $\hat{\beta}_j = 0.35$ as "a 0.35 unit increase in the mean" when it actually represents a multiplicative change: the mean is multiplied by $e^{0.35} \approx 1.42$ (a 42% increase).
Solution: Always interpret GLM coefficients on the appropriate scale. For log links, report and interpret $e^{\hat{\beta}_j}$ (rate ratio, cost ratio, etc.). For logit links, report $e^{\hat{\beta}_j}$ as an odds ratio. Always state clearly which scale is being used.

Mistake 4: Choosing the Wrong Link Function

Problem: Using an inappropriate link function (e.g., identity link for a Poisson model), which can produce predicted values outside valid ranges and poor model fit.
Solution: Use the canonical link as the default. Consider alternative links when domain knowledge suggests a specific functional form. Check the fit of alternative link functions using AIC and residual plots.

Mistake 5: Forgetting the Offset in Rate Models

Problem: Modelling count data without including an offset for different exposure periods or population sizes, attributing variation in counts entirely to predictors when it is partly due to different exposures.
Solution: Always include $\ln(\text{exposure})$ as an offset when modelling rates from count data. Verify that the offset variable is on the log scale (for log-link models) and has a fixed coefficient of 1.

Mistake 6: Treating the Deviance as an Absolute Goodness-of-Fit Test for All Distributions

Problem: Using $D_r \sim \chi^2_{n-p-1}$ to test model fit for Poisson or Binomial models with small expected counts, where the $\chi^2$ approximation is unreliable.
Solution: The deviance goodness-of-fit test is only reliable when all expected counts $\hat{\mu}_i \geq 5$ . For sparse data, use the Pearson $\chi^2$ test, collapse categories, or use simulation-based tests. For Binomial with individual binary responses, use the Hosmer-Lemeshow test instead.

Mistake 7: Not Checking for Complete Separation in Binomial Models

Problem: A predictor or combination of predictors perfectly separates successes from failures, causing IRLS to fail to converge and producing extremely large coefficient estimates with huge standard errors.
Solution: Look for IRLS convergence warnings and inspect coefficient estimates. If separation is detected, use Firth's bias-reduced logistic regression, exact logistic regression, or regularised estimation. Remove or merge categories that cause separation.

Mistake 8: Applying GLMs to Dependent Observations

Problem: Using a standard GLM for longitudinal, clustered, or spatially correlated data, where observations within groups are correlated, violating the independence assumption and leading to underestimated standard errors.
Solution: Use Generalised Estimating Equations (GEE) for marginal (population-averaged) inference, or Generalised Linear Mixed Models (GLMM) for subject-specific inference. Always consider the study design before choosing a model.

Mistake 9: Comparing Models Across Different Datasets Using AIC

Problem: Comparing AIC values between models fit to different subsets of data (e.g., after listwise deletion of missing values reduces the dataset differently for different models), leading to invalid comparisons.
Solution: AIC is only comparable between models fit to exactly the same observations. Ensure all candidate models use the same dataset. Handle missing data before model selection, not during.

Mistake 10: Over-Interpreting Pseudo R² Values

Problem: Comparing a GLM's pseudo R² directly to the R² from linear regression and concluding the GLM fits poorly because the pseudo R² is "only 0.20."
Solution: Pseudo R² values for GLMs are not directly comparable to OLS R². McFadden's $R^2 = 0.20$ represents a good fit in many GLM applications. Always interpret pseudo R² relative to the scale typical for that type of model and outcome, and supplement with deviance, AIC, and residual diagnostics.

18. Troubleshooting

Issue	Likely Cause	Solution
IRLS fails to converge	Complete separation (Binomial); extreme predictor values; poor starting values	Check for separation; standardise predictors; use robust starting values; reduce model complexity
Very large coefficient estimates ( $\\|\hat{\beta}\\| > 10$ )	Complete or quasi-complete separation (Binomial); collinearity	Inspect data for perfect predictors; check VIF; use Firth regression or regularisation
Very large standard errors	Multicollinearity; separation; too few events per variable	Check VIF; remove correlated variables; collect more data; use penalised estimation
Residual deviance $\gg n-p-1$	Overdispersion; model misspecification; influential outliers	Compute $\hat{\phi}$ ; switch to Quasi/NB model; check residual plots for outliers and non-linearity
Negative predicted values (log/Poisson model)	Should not occur with log link; check for identity link used accidentally	Verify link function specification; refit with correct link
Predicted probabilities at 0 or 1 exactly	Complete separation; very extreme linear predictor values	Check for separation; use Firth regression; inspect extreme observations
AIC is not reported	Quasi-GLM selected (no proper likelihood)	Use F-tests and deviance for model comparison; note AIC is unavailable for quasi-models
$\hat{\phi} < 1$ (underdispersion)	Counts are bounded; negative contagion; over-specified model	Consider Conway-Maxwell-Poisson; check model specification; verify data are correct
Hosmer-Lemeshow test significant ( $p < 0.05$ )	Poor calibration; missing covariates; wrong link; non-linearity	Add missing predictors; try alternative link; add polynomial terms; inspect residual plots
All Pearson residuals similar in magnitude	Normal/Gaussian family used on count data (constant variance)	Switch to Poisson or Negative Binomial with appropriate variance function
Cook's distance very large for one observation	Extreme influential observation; data entry error	Investigate observation; verify data accuracy; refit with and without it to assess influence
Profile likelihood CI very asymmetric vs. Wald CI	Strong non-linearity of likelihood; small sample	Report profile likelihood CI; note asymmetry as evidence of non-normality of MLE distribution
Dispersion estimate $\hat{\phi}$ varies wildly across subgroups	Heteroscedasticity; model misspecification	Consider separate models per subgroup; add interaction terms; use heteroscedasticity-robust SEs

19. Quick Reference Cheat Sheet

Core GLM Formulas

Formula	Description
$g(\mu_i) = \eta_i = \mathbf{x}_i^T \boldsymbol{\beta}$	GLM specification
$\text{Var}(Y_i) = \phi \cdot V(\mu_i) / w_i$	GLM variance structure
$\hat{\boldsymbol{\beta}}^{(t+1)} = (\mathbf{X}^T\mathbf{W}^{(t)}\mathbf{X})^{-1}\mathbf{X}^T\mathbf{W}^{(t)}\mathbf{z}^{(t)}$	IRLS update
$W_{ii} = w_i / [V(\hat{\mu}_i)(g'(\hat{\mu}_i))^2]$	IRLS working weight
$z_i = \hat{\eta}_i + (y_i - \hat{\mu}_i)g'(\hat{\mu}_i)$	IRLS working response
$\text{Cov}(\hat{\boldsymbol{\beta}}) = \hat{\phi}(\mathbf{X}^T\hat{\mathbf{W}}\mathbf{X})^{-1}$	Covariance of MLE
$z_j = \hat{\beta}_j / SE(\hat{\beta}_j)$	Wald z-statistic
$\Lambda = D(M_0) - D(M_1) \sim \chi^2_{df}$	Likelihood ratio test
$D_r = 2[\ell_{sat} - \ell(\hat{\boldsymbol{\mu}})]$	Residual deviance
$R^2_{McFadden} = 1 - D_r/D_0$	McFadden's pseudo R²
$AIC = -2\ell(\hat{\boldsymbol{\mu}}) + 2k$	AIC
$\hat{\phi} = X^2/(n-p-1)$	Pearson dispersion estimate

Distribution and Link Function Selection

Response Type	Distribution	Default Link	$e^{\hat{\beta}}$ Interpretation
Binary (0/1)	Binomial	Logit	Odds ratio
Binary (0/1), latent normal	Binomial	Probit	Change in probit
Binary (0/1), rare event / hazard	Binomial	Cloglog	Hazard ratio
Proportion ( $y/n$ )	Binomial	Logit	Odds ratio
Count, equidispersed	Poisson	Log	Rate ratio
Count, overdispersed	Neg. Binomial	Log	Rate ratio
Count, overdispersed (mild)	Quasi-Poisson	Log	Rate ratio (corrected SEs)
Count, excess zeros	ZIP / ZINB	Log	Rate ratio (count component)
Continuous, positive, $CV \approx$ const	Gamma	Log	Cost/mean ratio
Continuous, positive, high skew	Inverse Gaussian	Log	Mean ratio
Zero-inflated positive continuous	Tweedie	Log	Mean ratio
Continuous, unbounded	Gaussian	Identity	Additive change in mean

Residual Types Summary

Residual	Formula	Best For
Raw	$y_i - \hat{\mu}_i$	Simple inspection
Pearson	$(y_i-\hat{\mu}_i)/\sqrt{V(\hat{\mu}_i)}$	Dispersion assessment
Deviance	$\text{sign}(y_i-\hat{\mu}_i)\sqrt{d_i}$	General diagnostics
Quantile	$\Phi^{-1}(F(y_i;\hat{\mu}_i))$	Best normality approximation; discrete data
Anscombe	Variance-stabilised residual	Normality plots

Overdispersion Decision Tree

Fit standard Poisson/Binomial GLM
           ↓
Compute φ̂ = X²/(n-p-1)
           ↓
    φ̂ ≈ 1?  ──Yes──→ Model is adequate
           ↓ No
    φ̂ > 1?  ──Yes──→ Overdispersion
           ↓                 ↓
    φ̂ < 1   ←───    Excess zeros?
  (Underdispersion)     ↓ Yes        ↓ No
  Conway-Maxwell   ZIP/ZINB    φ̂ < 2?
  Poisson (CMP)               ↓Yes        ↓No
                          Quasi-GLM   Negative Binomial
                                      or GLMM (if clustered)

Model Comparison Guide

Scenario	Method	Statistic
Two nested models (proper likelihood)	LRT	$\chi^2_{df}$
Two nested quasi-GLM models	F-test	$F_{df_1, df_2}$
Non-nested models	AIC / BIC	Lower is better
Overall model significance	Analysis of deviance	$\chi^2_{p}$
Single coefficient	Wald test	$z \sim \mathcal{N}(0,1)$
Group of $q$ coefficients	LRT or Wald	$\chi^2_q$
Small samples / asymmetric likelihood	Profile LRT	$\chi^2_1$

Pseudo R² Benchmarks (McFadden)

$R^2_{McFadden}$	Model Fit
$< 0.10$	Poor
$0.10 - 0.20$	Acceptable
$0.20 - 0.30$	Good
$0.30 - 0.40$	Very good
$> 0.40$	Excellent

Key Diagnostic Thresholds

Diagnostic	Threshold	Action
$\hat{\phi}$ (dispersion)	$> 1.2$	Investigate overdispersion
Standardised residual $\\|r^{DS}_i\\|$	$> 2$ (flag), $> 3$ (outlier)	Investigate observation
Leverage $h_{ii}$	$> 2(p+1)/n$	High leverage; check predictor values
Cook's distance $C_i$	$> 1$ or $> 4/n$	Influential observation; refit without it
VIF	$> 5$ (concern), $> 10$ (serious)	Multicollinearity; consider variable removal
Hosmer-Lemeshow $p$	$< 0.05$	Poor calibration (Binomial models)
LRT for NB vs. Poisson	$p < 0.05$	Use Negative Binomial

GLM vs. Related Models

Model	Extension of GLM	Key Addition	When to Use
GLMM	Yes	Random effects	Clustered / hierarchical data
GEE	Marginal GLM	Working correlation	Longitudinal / repeated measures
Zero-Inflated GLM	Yes	Structural zeros component	Excess zeros in counts
Hurdle Model	Yes	Two-part: binary + truncated	Zeros arise from a distinct process
Ordinal GLM	Yes	Cumulative link	Ordered categorical response
Multinomial GLM	Yes	Multiple linear predictors	Nominal categorical response (> 2 classes)
Survival GLM	Yes	Censoring mechanism	Time-to-event data
Quasi-GLM	Yes	Estimated dispersion	Overdispersion without full distribution
GAMLSS	Yes	All parameters modelled	Distributional regression

This tutorial provides a comprehensive foundation for understanding, applying, and interpreting Generalized Linear Models using the DataStatPro application. For further reading, consult McCullagh & Nelder's "Generalized Linear Models" (2nd ed., Chapman & Hall, 1989), Dobson & Barnett's "An Introduction to Generalized Linear Models" (4th ed., CRC Press, 2018), or Agresti's "Foundations of Linear and Generalized Linear Models" (Wiley, 2015). For feature requests or support, contact the DataStatPro team.

Generalized Linear Models

Generalized Linear Models: Zero to Hero Tutorial

Table of Contents

1. Prerequisites and Background Concepts

1.1 Probability Distributions

1.2 The Likelihood Function

1.3 The Linear Predictor

1.4 Ordinary Linear Regression Recap

1.5 The Score Equations and the Information Matrix

2. What are Generalized Linear Models?

2.1 The Central Idea

2.2 Real-World Applications

2.3 The Three Components of a GLM

2.4 How GLMs Generalise Linear Regression

3. The Mathematical Framework of GLMs

3.1 The Three-Component Structure in Detail

3.2 The Variance Function V(μ)V(\mu)V(μ)

3.3 The Dispersion Parameter ϕ\phiϕ

3.4 The Canonical Link

3.5 The Offset

4. The Exponential Family of Distributions

4.1 The Exponential Family Form

4.2 Major Exponential Family Distributions for GLMs

4.2.1 Normal Distribution

4.2.2 Binomial Distribution

4.2.3 Poisson Distribution

4.2.4 Gamma Distribution

4.2.5 Inverse Gaussian Distribution

4.3 The Negative Binomial Distribution

4.4 The Tweedie Distribution

5. Link Functions

5.1 Requirements for a Valid Link Function

5.2 Commonly Used Link Functions

5.3 Logit Link (Binomial GLM)

5.4 Probit Link (Binomial GLM)

5.5 Complementary Log-Log Link (Binomial GLM)

5.6 Log Link (Poisson, Negative Binomial, Gamma GLM)

5.7 Inverse Link (Gamma GLM)

5.8 Choosing the Link Function

6. GLM Distributions and Their Applications

6.1 Binomial GLM (Logistic, Probit, Cloglog Regression)

6.2 Poisson GLM (Poisson Regression)

6.3 Negative Binomial GLM

6.4 Gamma GLM

6.5 Inverse Gaussian GLM

6.6 Gaussian GLM (Standard Linear Regression)

6.7 Tweedie GLM

6.8 Quasi-GLMs

6.9 Summary of GLM Distributions

7. Assumptions of GLMs

7.1 Correct Distributional Family

7.2 Correct Link Function

7.3 Linearity on the Link Scale

7.4 Independence of Observations

7.5 Correct Specification of the Variance Function

7.6 No Perfect Multicollinearity

7.7 No Complete Separation (for Binomial GLMs)

7.8 Sufficient Sample Size

8. Parameter Estimation: Maximum Likelihood and IRLS

8.1 The Log-Likelihood for GLMs

8.2 The Score Equations

8.3 Iteratively Reweighted Least Squares (IRLS)

8.4 The Fisher Information Matrix and Standard Errors

8.5 Estimating the Dispersion Parameter

9. Model Fit and Evaluation

9.1 The Deviance

9.2 Null Deviance and Residual Deviance

9.3 The Pearson χ2\chi^2χ2 Statistic

9.4 Pseudo R² Measures

9.5 AIC and BIC

10. Hypothesis Testing and Inference

10.1 Wald Tests for Individual Coefficients

10.2 Likelihood Ratio Test (LRT)

10.3 Score Test (Rao Test)

10.4 Analysis of Deviance Table

10.5 Confidence Intervals for the Mean Response

10.6 Profile Likelihood Confidence Intervals

11. Model Diagnostics and Residuals

11.1 Types of GLM Residuals

11.1.1 Raw (Response) Residuals

3.2 The Variance Function $V(\mu)$

3.3 The Dispersion Parameter $\phi$

9.3 The Pearson $\chi^2$ Statistic