Knowledge Base / Structural Equation Modeling Advanced Analysis 65 min read

Structural Equation Modeling

Comprehensive reference guide for Structural Equation Modeling (SEM) analysis.

Discrete Choice Models: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of Discrete Choice modelling all the way through advanced extensions, assumption testing, heterogeneity analysis, and practical usage within the DataStatPro application. Whether you are a complete beginner or an experienced analyst, this guide is structured to build your understanding step by step.


Table of Contents

  1. Prerequisites and Background Concepts
  2. What are Discrete Choice Models?
  3. The Mathematical Framework
  4. Key Assumptions
  5. Identification and Causal Inference
  6. Binary Choice Models: Logit and Probit
  7. Hypothesis Testing and Inference
  8. Effect Size Measures
  9. Model Fit and Evaluation
  10. Diagnostics and Assumption Testing
  11. Extensions: Multinomial and Conditional Logit
  12. Extensions: Ordered Choice Models
  13. Extensions: Nested Logit and Mixed Logit
  14. Extensions: Panel Data Discrete Choice
  15. Using the Discrete Choice Component
  16. Computational and Formula Details
  17. Worked Examples
  18. Common Mistakes and How to Avoid Them
  19. Troubleshooting
  20. Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into Discrete Choice Models, it is helpful to be familiar with the following foundational concepts. Do not worry if you are not — each concept is briefly explained here.

1.1 Random Variables and Probability Distributions

A random variable YY is a variable whose value is determined by a random process. In discrete choice modelling, the outcome variable YY takes on a finite set of values representing alternative choices (e.g., Y{0,1}Y \in \{0, 1\} for binary outcomes, or Y{1,2,,J}Y \in \{1, 2, \dots, J\} for multiple alternatives).

Key distributions used in discrete choice models:

1.2 Likelihood and Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE) is the primary estimation method for discrete choice models. For a sample of nn independent observations, the likelihood function is:

L(θ;y,X)=i=1nf(yixi;θ)\mathcal{L}(\boldsymbol{\theta}; \mathbf{y}, \mathbf{X}) = \prod_{i=1}^n f(y_i \mid \mathbf{x}_i; \boldsymbol{\theta})

The log-likelihood (more convenient for optimisation) is:

(θ)=lnL(θ)=i=1nlnf(yixi;θ)\ell(\boldsymbol{\theta}) = \ln \mathcal{L}(\boldsymbol{\theta}) = \sum_{i=1}^n \ln f(y_i \mid \mathbf{x}_i; \boldsymbol{\theta})

The MLE θ^\hat{\boldsymbol{\theta}} maximises (θ)\ell(\boldsymbol{\theta}):

θ^MLE=argmaxθ(θ)\hat{\boldsymbol{\theta}}_{MLE} = \arg\max_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta})

MLE has attractive properties under regularity conditions: it is consistent, asymptotically normal, and asymptotically efficient.

1.3 Latent Variable Models

Many discrete choice models are derived from an underlying latent variable (unobserved continuous variable). Define a latent utility or propensity:

Yi=xiTβ+ϵiY_i^* = \mathbf{x}_i^T\boldsymbol{\beta} + \epsilon_i

The observed discrete outcome is a deterministic function of the latent variable:

Yi={1if Yi>00if Yi0Y_i = \begin{cases} 1 & \text{if } Y_i^* > 0 \\ 0 & \text{if } Y_i^* \leq 0 \end{cases}

The distribution assumed for ϵi\epsilon_i determines the model family:

1.4 Random Utility Maximisation (RUM)

The Random Utility Model (McFadden, 1974) provides the economic foundation for discrete choice. Each decision-maker ii assigns a utility UijU_{ij} to each alternative jj:

Uij=Vij+ϵijU_{ij} = V_{ij} + \epsilon_{ij}

Where:

The decision-maker chooses the alternative that maximises utility:

Yi=j    Uij>UikkjY_i = j \iff U_{ij} > U_{ik} \quad \forall k \neq j

Different distributional assumptions for ϵij\epsilon_{ij} yield different discrete choice model families.

1.5 Ordinary Least Squares and Its Limitations for Discrete Outcomes

OLS regression applied to a binary outcome (Y{0,1}Y \in \{0,1\}) produces the Linear Probability Model (LPM):

E[Yixi]=xiTβE[Y_i \mid \mathbf{x}_i] = \mathbf{x}_i^T\boldsymbol{\beta}

The LPM has well-known limitations:

Discrete choice models address all three limitations by modelling probabilities through a monotone transformation that maps the real line to (0,1)(0,1).

1.6 Multinomial Outcomes and Ordinal Data

Multinomial outcomes have more than two unordered categories: Y{1,2,,J}Y \in \{1, 2, \dots, J\} where no natural ordering exists (e.g., mode of transport: car, bus, train, bicycle).

Ordinal outcomes have more than two categories with a natural ordering: Y{1,2,,J}Y \in \{1, 2, \dots, J\} where 1<2<<J1 < 2 < \dots < J (e.g., satisfaction: low, medium, high).

Different model families are designed for each type of outcome.


2. What are Discrete Choice Models?

2.1 The Core Idea

Discrete Choice Models (DCMs) are statistical models designed to explain and predict the choices made by individuals (or firms, households, or other decision-making units) when they face a finite set of mutually exclusive alternatives.

The core modelling challenge: the outcome is not a continuous variable but a discrete category. Standard regression is misspecified for such outcomes. DCMs model the probability of choosing each alternative as a function of:

The general DCM probability structure:

P(Yi=jxi)=Fj(xi,θ)P(Y_i = j \mid \mathbf{x}_i) = F_j(\mathbf{x}_i, \boldsymbol{\theta})

Where Fj()F_j(\cdot) is a function mapping covariates and parameters to probabilities, with:

j=1JP(Yi=jxi)=1andP(Yi=jxi)(0,1)\sum_{j=1}^J P(Y_i = j \mid \mathbf{x}_i) = 1 \quad \text{and} \quad P(Y_i = j \mid \mathbf{x}_i) \in (0,1)

2.2 A Taxonomy of Discrete Choice Models

ModelOutcome TypeAlternativesKey Assumption
Binary LogitBinary (J=2J=2)2 unorderedLogistic errors
Binary ProbitBinary (J=2J=2)2 unorderedNormal errors
Multinomial Logit (MNL)Nominal (J>2J>2)JJ unorderedIID Gumbel errors (IIA)
Conditional LogitNominal (J>2J>2)JJ with attributesIID Gumbel errors (IIA)
Nested LogitNominal (J>2J>2)Hierarchical structureCorrelated within nests
Mixed LogitNominal (J>2J>2)JJ with random coefficientsFlexible error structure
Ordered Logit (Proportional Odds)Ordinal (J>2J>2)Ordered categoriesProportional odds
Ordered ProbitOrdinal (J>2J>2)Ordered categoriesNormal latent variable
Multinomial ProbitNominal (J>2J>2)JJ unorderedMultivariate normal errors

2.3 Real-World Applications

Discrete choice models are applied across virtually every field involving individual decision-making:

2.4 Discrete Choice Models vs. Other Regression Methods

MethodOutcomeKey Use CaseKey Limitation
OLS / LPMContinuous (or binary)Simple benchmark; DiD with binary YYPredicted probs outside [0,1][0,1]
Logit / ProbitBinaryBinary classification; probability estimationMarginal effects non-constant
Multinomial LogitNominal (J>2J>2)Unordered multi-category choicesIIA assumption restrictive
Nested LogitNominal (J>2J>2, grouped)Hierarchical choice structuresTree structure pre-specified
Mixed LogitNominal (any JJ)Preference heterogeneity, flexible IIAComputationally intensive
Ordered Logit/ProbitOrdinalRanked categories, Likert scalesProportional odds assumption
Count Models (Poisson/NB)Count dataNumber of eventsNot a DCM; counts not choices
Survival/DurationTime to eventTime until discrete eventDifferent modelling paradigm

3. The Mathematical Framework

3.1 The Binary Logit Model

The logit model specifies the probability of the outcome Yi=1Y_i = 1 as:

P(Yi=1xi)=Λ(xiTβ)=exiTβ1+exiTβ=11+exiTβP(Y_i = 1 \mid \mathbf{x}_i) = \Lambda(\mathbf{x}_i^T\boldsymbol{\beta}) = \frac{e^{\mathbf{x}_i^T\boldsymbol{\beta}}}{1 + e^{\mathbf{x}_i^T\boldsymbol{\beta}}} = \frac{1}{1 + e^{-\mathbf{x}_i^T\boldsymbol{\beta}}}

And:

P(Yi=0xi)=1Λ(xiTβ)=11+exiTβP(Y_i = 0 \mid \mathbf{x}_i) = 1 - \Lambda(\mathbf{x}_i^T\boldsymbol{\beta}) = \frac{1}{1 + e^{\mathbf{x}_i^T\boldsymbol{\beta}}}

The log-odds (logit) transformation linearises the model:

ln(P(Yi=1)P(Yi=0))=ln(Pi1Pi)=xiTβ\ln\left(\frac{P(Y_i=1)}{P(Y_i=0)}\right) = \ln\left(\frac{P_i}{1-P_i}\right) = \mathbf{x}_i^T\boldsymbol{\beta}

This is the log-odds ratio (or logit), and β\boldsymbol{\beta} are the log-odds coefficients.

3.2 The Binary Probit Model

The probit model specifies:

P(Yi=1xi)=Φ(xiTβ)P(Y_i = 1 \mid \mathbf{x}_i) = \Phi(\mathbf{x}_i^T\boldsymbol{\beta})

Where Φ()\Phi(\cdot) is the standard normal CDF. From the latent variable representation:

Yi=xiTβ+ϵi,ϵiN(0,1)Y_i^* = \mathbf{x}_i^T\boldsymbol{\beta} + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0,1)

P(Yi=1)=P(Yi>0)=P(ϵi>xiTβ)=1Φ(xiTβ)=Φ(xiTβ)P(Y_i = 1) = P(Y_i^* > 0) = P(\epsilon_i > -\mathbf{x}_i^T\boldsymbol{\beta}) = 1 - \Phi(-\mathbf{x}_i^T\boldsymbol{\beta}) = \Phi(\mathbf{x}_i^T\boldsymbol{\beta})

3.3 The Logit vs. Probit Comparison

The logistic and standard normal CDFs are very similar in shape. Key differences:

PropertyLogitProbit
Link functionln[p/(1p)]=xTβ\ln[p/(1-p)] = \mathbf{x}^T\boldsymbol{\beta}Φ1(p)=xTβ\Phi^{-1}(p) = \mathbf{x}^T\boldsymbol{\beta}
Error distributionLogistic (heavier tails)Standard Normal
Scale normalisationVar(ϵ)=π2/3\text{Var}(\epsilon) = \pi^2/3Var(ϵ)=1\text{Var}(\epsilon) = 1
Coefficient scalingβ^logit1.6×β^probit\hat{\beta}_{logit} \approx 1.6 \times \hat{\beta}_{probit}
Closed-form probabilities❌ (requires numerical integration)
InterpretabilityLog-odds directly interpretableRequires transformation
Tail behaviourHeavier tailsThinner tails

The rule of thumb for converting coefficients: β^logitπ3β^probit1.81β^probit\hat{\beta}_{logit} \approx \frac{\pi}{\sqrt{3}} \hat{\beta}_{probit} \approx 1.81 \hat{\beta}_{probit}.

3.4 The Multinomial Logit Model

For J>2J > 2 unordered alternatives, the Multinomial Logit (MNL) specifies, for alternative j{1,,J}j \in \{1, \dots, J\} with reference category j=1j = 1:

P(Yi=jxi)=exiTβjk=1JexiTβkP(Y_i = j \mid \mathbf{x}_i) = \frac{e^{\mathbf{x}_i^T\boldsymbol{\beta}_j}}{\sum_{k=1}^J e^{\mathbf{x}_i^T\boldsymbol{\beta}_k}}

With the normalisation β1=0\boldsymbol{\beta}_1 = \mathbf{0} (reference category), so:

P(Yi=jxi)=exiTβj1+k=2JexiTβk,j=2,,JP(Y_i = j \mid \mathbf{x}_i) = \frac{e^{\mathbf{x}_i^T\boldsymbol{\beta}_j}}{1 + \sum_{k=2}^J e^{\mathbf{x}_i^T\boldsymbol{\beta}_k}}, \quad j = 2, \dots, J

P(Yi=1xi)=11+k=2JexiTβkP(Y_i = 1 \mid \mathbf{x}_i) = \frac{1}{1 + \sum_{k=2}^J e^{\mathbf{x}_i^T\boldsymbol{\beta}_k}}

The log-odds ratio relative to the reference category:

ln(P(Yi=j)P(Yi=1))=xiTβj\ln\left(\frac{P(Y_i = j)}{P(Y_i = 1)}\right) = \mathbf{x}_i^T\boldsymbol{\beta}_j

3.5 The Conditional Logit Model

The Conditional Logit (CL) model (McFadden, 1974) allows attributes to vary across alternatives. The utility of alternative jj for individual ii:

Uij=zijTγ+xiTβj+ϵijU_{ij} = \mathbf{z}_{ij}^T\boldsymbol{\gamma} + \mathbf{x}_i^T\boldsymbol{\beta}_j + \epsilon_{ij}

Where zij\mathbf{z}_{ij} are alternative-specific attributes (e.g., price of alternative jj, travel time of option jj) with a common coefficient γ\boldsymbol{\gamma}, and xi\mathbf{x}_i are individual-specific characteristics with alternative-specific coefficients βj\boldsymbol{\beta}_j.

The choice probability:

P(Yi=jZi)=ezijTγ+xiTβjk=1JezikTγ+xiTβkP(Y_i = j \mid \mathbf{Z}_i) = \frac{e^{\mathbf{z}_{ij}^T\boldsymbol{\gamma} + \mathbf{x}_i^T\boldsymbol{\beta}_j}}{\sum_{k=1}^J e^{\mathbf{z}_{ik}^T\boldsymbol{\gamma} + \mathbf{x}_i^T\boldsymbol{\beta}_k}}

3.6 The Ordered Logit Model

For an ordinal outcome Yi{1,2,,J}Y_i \in \{1, 2, \dots, J\}, the Ordered Logit (Proportional Odds) model uses a single latent variable:

Yi=xiTβ+ϵi,ϵiLogistic(0,1)Y_i^* = \mathbf{x}_i^T\boldsymbol{\beta} + \epsilon_i, \quad \epsilon_i \sim \text{Logistic}(0,1)

With J1J-1 threshold (cut-point) parameters τ1<τ2<<τJ1\tau_1 < \tau_2 < \dots < \tau_{J-1}:

Yi=j    τj1<YiτjY_i = j \iff \tau_{j-1} < Y_i^* \leq \tau_j

Where τ0=\tau_0 = -\infty and τJ=+\tau_J = +\infty. The choice probabilities are:

P(Yi=jxi)=Λ(τjxiTβ)Λ(τj1xiTβ)P(Y_i = j \mid \mathbf{x}_i) = \Lambda(\tau_j - \mathbf{x}_i^T\boldsymbol{\beta}) - \Lambda(\tau_{j-1} - \mathbf{x}_i^T\boldsymbol{\beta})

P(Yijxi)=Λ(τjxiTβ)P(Y_i \leq j \mid \mathbf{x}_i) = \Lambda(\tau_j - \mathbf{x}_i^T\boldsymbol{\beta})

3.7 The Nested Logit Model

The Nested Logit partitions the JJ alternatives into MM mutually exclusive nests B1,B2,,BMB_1, B_2, \dots, B_M. The choice probability for alternative jj in nest mm:

P(Yi=j)=P(nest m)×P(jnest m)P(Y_i = j) = P(\text{nest } m) \times P(j \mid \text{nest } m)

P(jnest m)=eVij/λmkBmeVik/λmP(j \mid \text{nest } m) = \frac{e^{V_{ij}/\lambda_m}}{\sum_{k \in B_m} e^{V_{ik}/\lambda_m}}

P(nest m)=eWim+λmIiml=1MeWil+λlIilP(\text{nest } m) = \frac{e^{W_{im} + \lambda_m I_{im}}}{\sum_{l=1}^M e^{W_{il} + \lambda_l I_{il}}}

Where Iim=lnkBmeVik/λmI_{im} = \ln\sum_{k \in B_m} e^{V_{ik}/\lambda_m} is the inclusive value (log-sum), λm(0,1]\lambda_m \in (0,1] is the dissimilarity parameter for nest mm, and WimW_{im} contains nest-level attributes.

3.8 The Mixed Logit Model

The Mixed Logit (also called the Random Parameters Logit) allows coefficients to vary across individuals:

P(Yi=jxi,βi)=exijTβik=1JexikTβiP(Y_i = j \mid \mathbf{x}_i, \boldsymbol{\beta}_i) = \frac{e^{\mathbf{x}_{ij}^T\boldsymbol{\beta}_i}}{\sum_{k=1}^J e^{\mathbf{x}_{ik}^T\boldsymbol{\beta}_i}}

Where βif(βμ,Σ)\boldsymbol{\beta}_i \sim f(\boldsymbol{\beta} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) — individual-specific random coefficients drawn from a mixing distribution (typically normal or log-normal).

The unconditional choice probability integrates over the random coefficient distribution:

P(Yi=jxi)=exijTβk=1JexikTβf(βμ,Σ)dβP(Y_i = j \mid \mathbf{x}_i) = \int \frac{e^{\mathbf{x}_{ij}^T\boldsymbol{\beta}}}{\sum_{k=1}^J e^{\mathbf{x}_{ik}^T\boldsymbol{\beta}}} f(\boldsymbol{\beta} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) \, d\boldsymbol{\beta}

This integral has no closed form and is evaluated by simulation (see Section 16.7).


4. Key Assumptions

4.1 Independence of Irrelevant Alternatives (IIA)

The most important and controversial assumption in multinomial logit models is the Independence of Irrelevant Alternatives (IIA):

The ratio of probabilities for any two alternatives jj and kk is independent of all other alternatives in the choice set.

Formally, for the MNL model:

P(Yi=j)P(Yi=k)=exiTβjexiTβk=exiT(βjβk)\frac{P(Y_i = j)}{P(Y_i = k)} = \frac{e^{\mathbf{x}_i^T\boldsymbol{\beta}_j}}{e^{\mathbf{x}_i^T\boldsymbol{\beta}_k}} = e^{\mathbf{x}_i^T(\boldsymbol{\beta}_j - \boldsymbol{\beta}_k)}

This ratio depends only on jj and kk, not on any other alternative ll.

The Red Bus / Blue Bus Problem: A classic IIA failure. Suppose individuals choose between Car and Red Bus with equal probability (50/50). If a Blue Bus (identical to Red Bus except colour) is added, IIA predicts all three have 1/3 probability each. But intuitively, the split should be 50% car and 50% bus (25% red + 25% blue). IIA allocates "competition" uniformly across all alternatives rather than within similar alternatives.

IIA is implied by: The Type I Extreme Value (Gumbel) distributional assumption on ϵij\epsilon_{ij} and the independence across alternatives.

IIA fails when: Alternatives are correlated substitutes — i.e., some alternatives are more similar to each other than to others. In such cases, the error terms ϵij\epsilon_{ij} are correlated across alternatives.

4.2 The Proportional Odds Assumption

For the Ordered Logit model, the proportional odds assumption (also called parallel lines assumption) requires that the effect of covariates on the log-odds is constant across all thresholds:

ln(P(Yi>j)P(Yij))=xiTβτjj\ln\left(\frac{P(Y_i > j)}{P(Y_i \leq j)}\right) = \mathbf{x}_i^T\boldsymbol{\beta} - \tau_j \quad \forall j

The coefficient vector β\boldsymbol{\beta} is the same for all outcome categories — only the intercept τj\tau_j changes. This is a strong assumption that should be explicitly tested (see Section 10.3).

4.3 Random Utility Consistency

For the RUM foundation to be valid:

  1. Completeness: Decision-makers can rank all alternatives.
  2. Transitivity: Preferences are transitive (if ABA \succ B and BCB \succ C, then ACA \succ C).
  3. Utility maximisation: Decision-makers always choose the alternative with the highest utility.
  4. Stable preferences: Preferences do not change during the observation period.

4.4 Independence of Observations

Standard discrete choice models assume independent observations across individuals. In panel data (Section 14), this assumption is relaxed by allowing within-individual correlation across repeated choices.

4.5 Correct Specification of the Choice Set

The model assumes:


5. Identification and Causal Inference

5.1 What Discrete Choice Models Identify

Identification in discrete choice models means the ability to recover the structural parameters β\boldsymbol{\beta} from the data. Key identification conditions:

5.2 Endogeneity in Discrete Choice Models

Endogeneity arises when a regressor XijX_{ij} is correlated with the unobserved utility component ϵij\epsilon_{ij}. Common sources:

Consequences: MLE estimates are inconsistent under endogeneity — standard corrections are required.

Remedies:

5.3 Average Partial Effects vs. Structural Parameters

In discrete choice models, the raw coefficients β\boldsymbol{\beta} are not directly interpretable as marginal effects. The Average Partial Effect (APE) of continuous covariate XkX_k on P(Yi=j)P(Y_i = j) is:

APEk=1ni=1nP(Yi=jxi)XikAPE_k = \frac{1}{n}\sum_{i=1}^n \frac{\partial P(Y_i = j \mid \mathbf{x}_i)}{\partial X_{ik}}

For binary logit:

APEklogit=1ni=1np^i(1p^i)β^kAPE_k^{logit} = \frac{1}{n}\sum_{i=1}^n \hat{p}_i(1-\hat{p}_i)\hat{\beta}_k

For binary probit:

APEkprobit=1ni=1nϕ(xiTβ^)β^kAPE_k^{probit} = \frac{1}{n}\sum_{i=1}^n \phi(\mathbf{x}_i^T\hat{\boldsymbol{\beta}})\hat{\beta}_k

The APE (also called the Average Marginal Effect, AME) is the primary reported quantity of interest in discrete choice models — analogous to the regression coefficient in OLS.

5.4 Partial Effects at the Mean (PEM) and Partial Effects at Representative Values

Partial Effect at the Mean (PEM): Evaluate the marginal effect at the sample mean xˉ\bar{\mathbf{x}}:

PEMk=P(Yi=jxi=xˉ)XkPEM_k = \frac{\partial P(Y_i = j \mid \mathbf{x}_i = \bar{\mathbf{x}})}{\partial X_k}

For binary logit: PEMk=Λ(xˉTβ^)[1Λ(xˉTβ^)]β^kPEM_k = \Lambda(\bar{\mathbf{x}}^T\hat{\boldsymbol{\beta}})[1 - \Lambda(\bar{\mathbf{x}}^T\hat{\boldsymbol{\beta}})]\hat{\beta}_k

⚠️ The PEM evaluates the marginal effect at a potentially non-existent "average individual." The APE is generally preferred because it averages the marginal effect across actual observations, accounting for the non-linearity of the model.

5.5 Willingness to Pay (WTP) in Choice Models

In models with a cost or price attribute (e.g., transport cost, product price), the Willingness to Pay (WTP) for a change in attribute kk is:

WTPk=γ^kγ^costWTP_k = -\frac{\hat{\gamma}_k}{\hat{\gamma}_{cost}}

Where γ^k\hat{\gamma}_k is the coefficient on attribute kk and γ^cost\hat{\gamma}_{cost} is the coefficient on cost. This ratio gives the marginal rate of substitution between attribute kk and money — a central output of stated preference and transport choice studies.


6. Binary Choice Models: Logit and Probit

6.1 The Log-Likelihood for Binary Models

For a binary outcome Yi{0,1}Y_i \in \{0, 1\}, the log-likelihood is:

(β)=i=1n[YilnPi+(1Yi)ln(1Pi)]\ell(\boldsymbol{\beta}) = \sum_{i=1}^n \left[Y_i \ln P_i + (1 - Y_i) \ln(1 - P_i)\right]

Where Pi=P(Yi=1xi)P_i = P(Y_i = 1 \mid \mathbf{x}_i).

For logit: Pi=Λ(xiTβ)P_i = \Lambda(\mathbf{x}_i^T\boldsymbol{\beta}), giving:

logit(β)=i=1n[YixiTβln(1+exiTβ)]\ell_{logit}(\boldsymbol{\beta}) = \sum_{i=1}^n \left[Y_i \mathbf{x}_i^T\boldsymbol{\beta} - \ln(1 + e^{\mathbf{x}_i^T\boldsymbol{\beta}})\right]

For probit: Pi=Φ(xiTβ)P_i = \Phi(\mathbf{x}_i^T\boldsymbol{\beta}), giving:

probit(β)=i=1n[YilnΦ(xiTβ)+(1Yi)lnΦ(xiTβ)]\ell_{probit}(\boldsymbol{\beta}) = \sum_{i=1}^n \left[Y_i \ln\Phi(\mathbf{x}_i^T\boldsymbol{\beta}) + (1-Y_i)\ln\Phi(-\mathbf{x}_i^T\boldsymbol{\beta})\right]

Both log-likelihoods are globally concave, ensuring a unique maximum.

6.2 The Score and Hessian

The score vector (gradient of the log-likelihood):

s(β)=β=i=1n(YiPi)xi\mathbf{s}(\boldsymbol{\beta}) = \frac{\partial \ell}{\partial \boldsymbol{\beta}} = \sum_{i=1}^n \left(Y_i - P_i\right)\mathbf{x}_i

For logit, this simplifies elegantly because Pi/ηi=Pi(1Pi)\partial P_i / \partial \eta_i = P_i(1-P_i) where ηi=xiTβ\eta_i = \mathbf{x}_i^T\boldsymbol{\beta}.

The Hessian matrix (second derivative):

H(β)=2ββT=i=1nwixixiT\mathbf{H}(\boldsymbol{\beta}) = \frac{\partial^2 \ell}{\partial \boldsymbol{\beta} \partial \boldsymbol{\beta}^T} = -\sum_{i=1}^n w_i \mathbf{x}_i\mathbf{x}_i^T

Where wi=Pi(1Pi)w_i = P_i(1-P_i) for logit and wi=ϕ(xiTβ)2/[Pi(1Pi)]w_i = \phi(\mathbf{x}_i^T\boldsymbol{\beta})^2 / [P_i(1-P_i)] for probit. The negative Hessian is positive definite, confirming concavity.

6.3 Newton-Raphson and IRLS Estimation

The MLE is obtained iteratively using Newton-Raphson (or equivalently, Iteratively Reweighted Least Squares — IRLS):

β(t+1)=β(t)[H(β(t))]1s(β(t))\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} - \left[\mathbf{H}(\boldsymbol{\beta}^{(t)})\right]^{-1}\mathbf{s}(\boldsymbol{\beta}^{(t)})

IRLS Interpretation: At each iteration, solve a weighted OLS problem:

β(t+1)=(XTW(t)X)1XTW(t)z(t)\boldsymbol{\beta}^{(t+1)} = \left(\mathbf{X}^T\mathbf{W}^{(t)}\mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{W}^{(t)}\mathbf{z}^{(t)}

Where W(t)=diag(w1(t),,wn(t))\mathbf{W}^{(t)} = \text{diag}(w_1^{(t)}, \dots, w_n^{(t)}) is a diagonal weight matrix and z(t)=η(t)+(W(t))1(yp^(t))\mathbf{z}^{(t)} = \boldsymbol{\eta}^{(t)} + (\mathbf{W}^{(t)})^{-1}(\mathbf{y} - \hat{\mathbf{p}}^{(t)}) is the adjusted dependent variable.

6.4 Asymptotic Properties of MLE

Under regularity conditions, the MLE is:

n(β^β0)dN(0,I(β0)1)\sqrt{n}(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}_0) \xrightarrow{d} \mathcal{N}(\mathbf{0}, \mathcal{I}(\boldsymbol{\beta}_0)^{-1})

Where the Fisher information matrix is:

I(β)=E[2iββT]=i=1nwixixiT\mathcal{I}(\boldsymbol{\beta}) = -E\left[\frac{\partial^2 \ell_i}{\partial \boldsymbol{\beta}\partial\boldsymbol{\beta}^T}\right] = \sum_{i=1}^n w_i \mathbf{x}_i\mathbf{x}_i^T

The variance-covariance matrix of β^\hat{\boldsymbol{\beta}} is estimated by the inverse of the observed information matrix:

V^(β^)=[H(β^)]1=(XTW^X)1\hat{\mathbf{V}}(\hat{\boldsymbol{\beta}}) = \left[-\mathbf{H}(\hat{\boldsymbol{\beta}})\right]^{-1} = \left(\mathbf{X}^T\hat{\mathbf{W}}\mathbf{X}\right)^{-1}

6.5 Interpreting Logit Coefficients as Odds Ratios

For the logit model, exponentiating the coefficient gives the odds ratio:

ORk=eβ^kOR_k = e^{\hat{\beta}_k}

Interpretation: A one-unit increase in XkX_k multiplies the odds of Y=1Y=1 by eβ^ke^{\hat{\beta}_k}:

P(Y=1Xk+1)/P(Y=0Xk+1)P(Y=1Xk)/P(Y=0Xk)=eβ^k\frac{P(Y=1 \mid X_k + 1) / P(Y=0 \mid X_k + 1)}{P(Y=1 \mid X_k) / P(Y=0 \mid X_k)} = e^{\hat{\beta}_k}

⚠️ Odds ratios are not the same as probability ratios (relative risks). Do not interpret the odds ratio as "X% more likely." Convert to marginal probabilities via the APE for clearer communication.

6.6 Marginal Effects in Binary Models

For a continuous covariate XkX_k, the marginal effect of XkX_k on P(Y=1)P(Y=1) for individual ii:

Logit: PiXik=p^i(1p^i)β^k=Λ(xiTβ^)[1Λ(xiTβ^)]β^k\frac{\partial P_i}{\partial X_{ik}} = \hat{p}_i(1-\hat{p}_i)\hat{\beta}_k = \Lambda(\mathbf{x}_i^T\hat{\boldsymbol{\beta}})[1-\Lambda(\mathbf{x}_i^T\hat{\boldsymbol{\beta}})]\hat{\beta}_k

Probit: PiXik=ϕ(xiTβ^)β^k\frac{\partial P_i}{\partial X_{ik}} = \phi(\mathbf{x}_i^T\hat{\boldsymbol{\beta}})\hat{\beta}_k

For a discrete/binary covariate Xk{0,1}X_k \in \{0,1\}, the marginal effect is the discrete change in predicted probability:

ΔPi=P(Yi=1Xik=1,xi,k)P(Yi=1Xik=0,xi,k)\Delta P_i = P(Y_i=1 \mid X_{ik}=1, \mathbf{x}_{i,-k}) - P(Y_i=1 \mid X_{ik}=0, \mathbf{x}_{i,-k})

6.7 Standard Errors for Average Partial Effects (Delta Method)

The APE is a nonlinear function of β^\hat{\boldsymbol{\beta}}. Standard errors are obtained via the delta method:

SE^(APEk)=gkTV^(β^)gk\widehat{SE}(APE_k) = \sqrt{\mathbf{g}_k^T \hat{\mathbf{V}}(\hat{\boldsymbol{\beta}}) \mathbf{g}_k}

Where gk=APE^k/β^\mathbf{g}_k = \partial \widehat{APE}_k / \partial \hat{\boldsymbol{\beta}} is the gradient of the APE with respect to the coefficient vector. Alternatively, use the bootstrap for more reliable inference with small samples.


7. Hypothesis Testing and Inference

7.1 The Wald Test

The Wald test for H0:βk=0H_0: \beta_k = 0 uses the asymptotic normality of the MLE:

z=β^kSE(β^k)N(0,1)z = \frac{\hat{\beta}_k}{SE(\hat{\beta}_k)} \sim \mathcal{N}(0,1)

Or equivalently, z2χ12z^2 \sim \chi^2_1. For a vector of qq restrictions H0:Rβ=rH_0: \mathbf{R}\boldsymbol{\beta} = \mathbf{r}:

W=(Rβ^r)T[RV^(β^)RT]1(Rβ^r)χq2W = (\mathbf{R}\hat{\boldsymbol{\beta}} - \mathbf{r})^T\left[\mathbf{R}\hat{\mathbf{V}}(\hat{\boldsymbol{\beta}})\mathbf{R}^T\right]^{-1}(\mathbf{R}\hat{\boldsymbol{\beta}} - \mathbf{r}) \sim \chi^2_q

7.2 The Likelihood Ratio Test

The Likelihood Ratio (LR) test compares a restricted model (imposing H0H_0) to an unrestricted model:

LR=2[(β^restricted)(β^unrestricted)]χq2LR = -2\left[\ell(\hat{\boldsymbol{\beta}}_{restricted}) - \ell(\hat{\boldsymbol{\beta}}_{unrestricted})\right] \sim \chi^2_q

Where qq is the number of restrictions. The LR test is generally preferred over the Wald test because it is invariant to reparameterisation and often has better finite-sample properties.

Special case: The LR test comparing a model with covariates to an intercept-only model:

LR=2[0full]χk2LR = -2[\ell_0 - \ell_{full}] \sim \chi^2_k

Where 0=n[p0lnp0+(1p0)ln(1p0)]\ell_0 = n[p_0\ln p_0 + (1-p_0)\ln(1-p_0)] and p0=Yˉp_0 = \bar{Y} is the sample proportion.

7.3 The Score (Lagrange Multiplier) Test

The Score test (Rao test) only requires estimating the restricted model:

S=s(β^restricted)TI^(β^restricted)1s(β^restricted)χq2S = \mathbf{s}(\hat{\boldsymbol{\beta}}_{restricted})^T \hat{\mathcal{I}}(\hat{\boldsymbol{\beta}}_{restricted})^{-1} \mathbf{s}(\hat{\boldsymbol{\beta}}_{restricted}) \sim \chi^2_q

Useful when the unrestricted model is computationally expensive to estimate.

7.4 Test Equivalences and Recommendations

TestRequiresBest ForInvariant to Reparameterisation?
WaldUnrestricted model onlySingle coefficient tests
Likelihood RatioBoth modelsNested model comparison
Score (LM)Restricted model onlyAdding variables to a model

The three tests are asymptotically equivalent but differ in finite samples. The LR test is generally most reliable.

7.5 Confidence Intervals

A (1α)×100%(1-\alpha)\times100\% Wald confidence interval for βk\beta_k:

β^k±zα/2×SE(β^k)\hat{\beta}_k \pm z_{\alpha/2} \times SE(\hat{\beta}_k)

A profile likelihood confidence interval (more reliable for small samples):

CIPL={βk:2[(β^k,βk)(β^)]χα,12}CI_{PL} = \left\{\beta_k : -2[\ell(\hat{\boldsymbol{\beta}}_{-k}, \beta_k) - \ell(\hat{\boldsymbol{\beta}})] \leq \chi^2_{\alpha,1}\right\}

7.6 Testing IIA with the Hausman-McFadden Test

The Hausman-McFadden test for IIA in the MNL compares the full-sample MNL estimates to estimates obtained after removing one alternative from the choice set:

HIIA=(β^sβ^f)T[V^sV^f]1(β^sβ^f)χk2H_{IIA} = (\hat{\boldsymbol{\beta}}_s - \hat{\boldsymbol{\beta}}_f)^T\left[\hat{\mathbf{V}}_s - \hat{\mathbf{V}}_f\right]^{-1}(\hat{\boldsymbol{\beta}}_s - \hat{\boldsymbol{\beta}}_f) \sim \chi^2_k

Where β^s\hat{\boldsymbol{\beta}}_s are estimates from the restricted choice set and β^f\hat{\boldsymbol{\beta}}_f are estimates from the full choice set. Rejection suggests IIA violation.

⚠️ The Hausman-McFadden test has poor finite-sample properties and can produce negative test statistics. The Small-Hsiao test offers an alternative. Neither test is definitive. Subject-matter knowledge about alternative similarity remains essential.

7.7 Testing the Proportional Odds Assumption

The Brant test for the proportional odds assumption in ordered logit estimates a separate binary logit for each cumulative split and tests whether the coefficients are equal across splits:

H0:β(1)=β(2)==β(J1)H_0: \boldsymbol{\beta}^{(1)} = \boldsymbol{\beta}^{(2)} = \dots = \boldsymbol{\beta}^{(J-1)}

A chi-squared test statistic is formed from the sum of squared differences in estimates across cumulative splits, weighted by their precision. Rejection indicates the proportional odds assumption is violated, and a generalised ordered logit or multinomial logit should be considered.

7.8 Robust Standard Errors in Discrete Choice Models

While MLE standard errors are derived from the information matrix, misspecification-robust (sandwich) standard errors are available:

V^sandwich(β^)=H^1B^H^1\hat{\mathbf{V}}_{sandwich}(\hat{\boldsymbol{\beta}}) = \hat{\mathbf{H}}^{-1}\hat{\mathbf{B}}\hat{\mathbf{H}}^{-1}

Where H^=i2i/ββT\hat{\mathbf{H}} = -\sum_i \partial^2\ell_i/\partial\boldsymbol{\beta}\partial\boldsymbol{\beta}^T and B^=i(i/β)(i/β)T\hat{\mathbf{B}} = \sum_i (\partial\ell_i/\partial\boldsymbol{\beta})(\partial\ell_i/\partial\boldsymbol{\beta})^T (the outer product of scores).


8. Effect Size Measures

8.1 Average Partial Effects (APE / AME)

The primary effect size in discrete choice models is the Average Partial Effect (APE), also called the Average Marginal Effect (AME):

APEk=1ni=1nP(Yi=jxi)XikAPE_k = \frac{1}{n}\sum_{i=1}^n \frac{\partial P(Y_i = j \mid \mathbf{x}_i)}{\partial X_{ik}}

Interpretation: The average change in the probability of outcome jj associated with a one-unit increase in XkX_k, averaging over all individuals in the sample.

For a binary covariate Di{0,1}D_i \in \{0,1\}:

APEk=1ni=1n[P(Yi=1Di=1,xi,k)P(Yi=1Di=0,xi,k)]APE_k = \frac{1}{n}\sum_{i=1}^n \left[P(Y_i=1 \mid D_i=1, \mathbf{x}_{i,-k}) - P(Y_i=1 \mid D_i=0, \mathbf{x}_{i,-k})\right]

8.2 Odds Ratios and Relative Risk

MeasureFormulaInterpretation
Odds Ratio (OR)eβ^ke^{\hat{\beta}_k}Multiplicative change in odds per unit increase in XkX_k
Relative Risk (RR)P(Y=1Xk+1)/P(Y=1Xk)P(Y=1 \mid X_k+1) / P(Y=1 \mid X_k)Ratio of probabilities; computed at representative values
Absolute Risk ReductionP(Y=1Xk=1)P(Y=1Xk=0)P(Y=1 \mid X_k=1) - P(Y=1 \mid X_k=0)Difference in probabilities for binary XkX_k
Number Needed to Treat$1 /ARR

8.3 Predicted Probability Changes

For practical communication, report predicted probabilities at meaningful covariate values:

ΔP^=P^(Yi=1xhigh)P^(Yi=1xlow)\Delta\hat{P} = \hat{P}(Y_i=1 \mid \mathbf{x}_{high}) - \hat{P}(Y_i=1 \mid \mathbf{x}_{low})

Where xhigh\mathbf{x}_{high} and xlow\mathbf{x}_{low} represent two substantively meaningful covariate profiles (e.g., high-income vs. low-income; treated vs. untreated).

8.4 Standardised Coefficients in Discrete Choice Models

To compare the relative importance of different covariates, standardise the APE by the standard deviation of the outcome:

βk=APEk×sXksY\beta^*_k = APE_k \times \frac{s_{X_k}}{s_Y}

Where sXks_{X_k} is the standard deviation of XkX_k and sY=pˉ(1pˉ)s_Y = \sqrt{\bar{p}(1-\bar{p})} for binary outcomes. This produces an effect size interpretable as the change in probability (in units of the outcome SD) per SD change in XkX_k.

8.5 McFadden's Pseudo-R2R^2 as Effect Size

McFadden's pseudo-R2R^2 measures the proportional improvement in log-likelihood:

ρ2=1(β^)0\rho^2 = 1 - \frac{\ell(\hat{\boldsymbol{\beta}})}{\ell_0}

Where 0=(β0 only)\ell_0 = \ell(\beta_0 \text{ only}) is the log-likelihood of the intercept-only model. While not a pure effect size, it provides a scale for comparing model fit improvement:

ρ2\rho^2Interpretation
0.000.100.00 - 0.10Poor fit
0.100.200.10 - 0.20Acceptable fit
0.200.400.20 - 0.40Good fit
0.40+0.40+Very good fit

8.6 Willingness to Pay (WTP) as Effect Size in Choice Experiments

In stated or revealed preference studies, WTP contextualises effect sizes economically:

WTPk=APEkAPEcost=γ^kγ^costWTP_k = -\frac{APE_k}{APE_{cost}} = -\frac{\hat{\gamma}_k}{\hat{\gamma}_{cost}}

Report WTP with confidence intervals obtained via the delta method or Krinsky-Robb simulation.


9. Model Fit and Evaluation

9.1 Goodness-of-Fit Statistics

StatisticFormulaDescription
Log-likelihood at convergence(β^)\ell(\hat{\boldsymbol{\beta}})Higher (less negative) is better
Null log-likelihood0\ell_0Baseline (intercept-only)
LR chi-squared2(0)-2(\ell_0 - \ell)Overall model fit test
McFadden's ρ2\rho^21/01 - \ell/\ell_0Proportional LL improvement
Adjusted McFadden's ρ2\rho^21(k)/01 - (\ell - k)/\ell_0Penalised for parameters
AIC2+2k-2\ell + 2kLower is better
BIC2+kln(n)-2\ell + k\ln(n)Lower is better; penalises more
Count R2R^2Correctly classified / nnNaive classification accuracy
Hit rate (vs. base)Count R2R^2 vs. max(pˉ,1pˉ)\max(\bar{p}, 1-\bar{p})Improvement over naive classifier

9.2 Pseudo-R² Measures

Multiple pseudo-R2R^2 measures exist; they capture different aspects of fit:

McFadden (1974): ρMcFadden2=1(β^)0\rho^2_{McFadden} = 1 - \frac{\ell(\hat{\boldsymbol{\beta}})}{\ell_0}

Cox-Snell: RCS2=1(L0L(β^))2/nR^2_{CS} = 1 - \left(\frac{\mathcal{L}_0}{\mathcal{L}(\hat{\boldsymbol{\beta}})}\right)^{2/n}

Nagelkerke (normalised Cox-Snell): RN2=RCS21L02/nR^2_{N} = \frac{R^2_{CS}}{1 - \mathcal{L}_0^{2/n}}

⚠️ No single pseudo-R2R^2 is universally "correct." Report multiple, and always prefer out-of-sample predictive performance metrics (AUC, Brier score) for evaluating predictive models.

9.3 Classification Metrics for Binary Models

For binary models, at a threshold cc (default c=0.5c = 0.5):

Y^i=1[p^ic]\hat{Y}_i = \mathbf{1}[\hat{p}_i \geq c]

MetricFormulaDescription
Accuracy(TP+TN)/(TP+TN+FP+FN)(TP + TN)/(TP+TN+FP+FN)Overall correct classification rate
Sensitivity (Recall)TP/(TP+FN)TP/(TP+FN)True positive rate
SpecificityTN/(TN+FP)TN/(TN+FP)True negative rate
Precision (PPV)TP/(TP+FP)TP/(TP+FP)Positive predictive value
F1 Score2(Precision×Recall)/(Precision+Recall)2 \cdot (Precision \times Recall)/(Precision + Recall)Harmonic mean of precision and recall
AUC-ROCArea under ROC curveDiscrimination across all thresholds

The Receiver Operating Characteristic (ROC) curve plots sensitivity vs. (1specificity)(1-\text{specificity}) across all classification thresholds c[0,1]c \in [0,1]. The Area Under the Curve (AUC) summarises discrimination:

AUCInterpretation
0.500.50No discrimination (random)
0.700.800.70 - 0.80Acceptable discrimination
0.800.900.80 - 0.90Excellent discrimination
0.90+0.90+Outstanding discrimination

9.4 Calibration

Calibration assesses whether predicted probabilities match observed outcome rates.

Hosmer-Lemeshow test: Partition observations into GG (typically 10) quantile groups by predicted probability. Compare observed and expected counts in each group:

H=g=1G(Og1Eg1)2Eg1+(Og0Eg0)2Eg0χG22H = \sum_{g=1}^G \frac{(O_g^1 - E_g^1)^2}{E_g^1} + \frac{(O_g^0 - E_g^0)^2}{E_g^0} \sim \chi^2_{G-2}

Where Og1O_g^1 and Eg1E_g^1 are observed and expected counts of Y=1Y=1 in group gg. Rejection suggests poor calibration.

Calibration plot: Plot mean predicted probability vs. observed proportion in each decile group. A well-calibrated model lies along the 45° diagonal.

9.5 Information Criteria for Model Comparison

When comparing non-nested models (e.g., logit vs. probit; different covariate sets):

AIC=2(β^)+2kAIC = -2\ell(\hat{\boldsymbol{\beta}}) + 2k BIC=2(β^)+kln(n)BIC = -2\ell(\hat{\boldsymbol{\beta}}) + k\ln(n)

Lower values indicate better fit. BIC imposes a heavier penalty on model complexity, favouring parsimony. AIC and BIC are only directly comparable for models fitted to the same dataset with the same outcome variable.

9.6 Out-of-Sample Validation

For predictive models, always assess performance on held-out data:


10. Diagnostics and Assumption Testing

10.1 Residuals in Discrete Choice Models

Unlike OLS, residuals in discrete choice models require careful definition.

Pearson residuals: riP=Yip^ip^i(1p^i)r_i^P = \frac{Y_i - \hat{p}_i}{\sqrt{\hat{p}_i(1-\hat{p}_i)}}

Deviance residuals: riD=sign(Yip^i)2[Yilnp^i+(1Yi)ln(1p^i)]r_i^D = \text{sign}(Y_i - \hat{p}_i)\sqrt{-2\left[Y_i\ln\hat{p}_i + (1-Y_i)\ln(1-\hat{p}_i)\right]}

The deviance (sum of squared deviance residuals) is:

D=i=1n(riD)2=2(β^)D = \sum_{i=1}^n (r_i^D)^2 = -2\ell(\hat{\boldsymbol{\beta}})

Standardised residuals for outlier detection: ristd=riD/1hiir_i^{std} = r_i^D / \sqrt{1 - h_{ii}}, where hiih_{ii} is the hat-value (leverage).

10.2 Influence and Leverage

Leverage in logit/probit: hii=w^ixiT(XTW^X)1xih_{ii} = \hat{w}_i \mathbf{x}_i^T(\mathbf{X}^T\hat{\mathbf{W}}\mathbf{X})^{-1}\mathbf{x}_i

Cook's distance analogue: CDi=(riP)2hiik(1hii)2CD_i = \frac{(r_i^P)^2 h_{ii}}{k(1-h_{ii})^2}

DFFITS and DFBETAS analogues are available for identifying influential observations. Flag observations with CDi>4/n|CD_i| > 4/n or ristd>2|r_i^{std}| > 2 for inspection.

10.3 Testing the Proportional Odds Assumption (Ordered Logit)

Brant (1990) test: Estimates a binary logit for each of the J1J-1 cumulative dichotomisations and tests whether coefficients are equal. Available both as a global test (all covariates) and variable-specific tests:

χBrant2=j=1J2(β^jβ^J1)T[V^j+V^J1]1(β^jβ^J1)\chi^2_{Brant} = \sum_{j=1}^{J-2} (\hat{\boldsymbol{\beta}}_j - \hat{\boldsymbol{\beta}}_{J-1})^T \left[\hat{\mathbf{V}}_j + \hat{\mathbf{V}}_{J-1}\right]^{-1} (\hat{\boldsymbol{\beta}}_j - \hat{\boldsymbol{\beta}}_{J-1})

Graphical check: Plot ordered logit coefficients estimated separately for each binary cumulative split. Coefficients that vary substantially suggest a violation.

Remedy if violated:

10.4 Testing IIA

Multiple tests for IIA are available, each with limitations:

TestMethodReferenceLimitations
Hausman-McFaddenCompare restricted vs. full estimatesHausman & McFadden (1984)Can yield negative test statistic
Small-HsiaoRandom sample split + comparisonSmall & Hsiao (1985)Sample-split dependent
Swait-LouviereScaling test across datasetsSwait & Louviere (1993)Requires two datasets

Remedy if IIA fails:

10.5 Checking for Complete Separation

Complete separation occurs when a covariate or linear combination of covariates perfectly predicts the outcome — the MLE does not exist (the log-likelihood has no finite maximum):

Detection: MLE algorithm fails to converge; extremely large coefficient estimates with very large standard errors; implausible predicted probabilities near 0 or 1.

Remedies:

10.6 Heteroscedasticity in Probit (Heteroscedastic Probit)

In the standard probit, Var(ϵi)=1\text{Var}(\epsilon_i) = 1 for all ii. If the true error variance is heteroscedastic:

Var(ϵi)=σi2=[ehiTδ]2\text{Var}(\epsilon_i) = \sigma_i^2 = [e^{\mathbf{h}_i^T\boldsymbol{\delta}}]^2

The heteroscedastic probit models:

P(Yi=1xi,hi)=Φ(xiTβehiTδ)P(Y_i = 1 \mid \mathbf{x}_i, \mathbf{h}_i) = \Phi\left(\frac{\mathbf{x}_i^T\boldsymbol{\beta}}{e^{\mathbf{h}_i^T\boldsymbol{\delta}}}\right)

Standard probit estimates are inconsistent under heteroscedasticity (unlike OLS which remains consistent, only losing efficiency). The linktest (adding the squared predicted index as a covariate) checks for systematic misspecification.

10.7 Goodness-of-Link Tests

The linktest (Pregibon, 1980) adds η^i2=(xiTβ^)2\hat{\eta}_i^2 = (\mathbf{x}_i^T\hat{\boldsymbol{\beta}})^2 as an additional regressor to the fitted model:

P(Yi=1)=F(η^iδ1+η^i2δ2)P(Y_i=1) = F(\hat{\eta}_i \cdot \delta_1 + \hat{\eta}_i^2 \cdot \delta_2)

Under correct specification, δ^2=0\hat{\delta}_2 = 0 (the squared term should not be significant). A significant δ^2\hat{\delta}_2 indicates link function misspecification or omitted non-linear terms.


11. Extensions: Multinomial and Conditional Logit

11.1 MNL Log-Likelihood

For outcome Yi{1,,J}Y_i \in \{1, \dots, J\} with reference category j=1j=1, the MNL log-likelihood:

({βj}j=2J)=i=1nj=1J1[Yi=j]lnP(Yi=jxi)\ell(\{\boldsymbol{\beta}_j\}_{j=2}^J) = \sum_{i=1}^n \sum_{j=1}^J \mathbf{1}[Y_i=j] \ln P(Y_i=j \mid \mathbf{x}_i)

=i=1n[j=2J1[Yi=j]xiTβjln(1+k=2JexiTβk)]= \sum_{i=1}^n \left[\sum_{j=2}^J \mathbf{1}[Y_i=j]\mathbf{x}_i^T\boldsymbol{\beta}_j - \ln\left(1 + \sum_{k=2}^J e^{\mathbf{x}_i^T\boldsymbol{\beta}_k}\right)\right]

11.2 Marginal Effects in MNL

For the MNL, the marginal effect of XkX_k on P(Yi=j)P(Y_i = j):

P(Yi=j)Xk=P(Yi=j)[βjkl=1JP(Yi=l)βlk]\frac{\partial P(Y_i = j)}{\partial X_k} = P(Y_i=j)\left[\beta_{jk} - \sum_{l=1}^J P(Y_i=l)\beta_{lk}\right]

Note: Cross-effects — the effect of a covariate on a different category's probability — may be positive or negative, depending on model parameters.

Average Partial Effect: APEjk=1ni=1nP^ij[β^jkl=1JP^ilβ^lk]APE_{jk} = \frac{1}{n}\sum_{i=1}^n \hat{P}_{ij}\left[\hat{\beta}_{jk} - \sum_{l=1}^J \hat{P}_{il}\hat{\beta}_{lk}\right]

11.3 The Conditional Logit and Mixed-Effects Specification

The full Mixed Logit specification that includes both individual-varying and alternative-varying attributes:

Vij=zijTγalternative attributes+xiTβjindividual chars. × alt. FEV_{ij} = \underbrace{\mathbf{z}_{ij}^T\boldsymbol{\gamma}}_{\text{alternative attributes}} + \underbrace{\mathbf{x}_i^T\boldsymbol{\beta}_j}_{\text{individual chars. × alt. FE}}

Where:

11.4 Marginal Effects on Log-Odds (MNL)

The log-odds of choosing jj vs. reference category 11:

lnP(Yi=j)P(Yi=1)=xiTβj\ln\frac{P(Y_i = j)}{P(Y_i = 1)} = \mathbf{x}_i^T\boldsymbol{\beta}_j

XklnP(Yi=j)P(Yi=1)=βjk\frac{\partial}{\partial X_k}\ln\frac{P(Y_i=j)}{P(Y_i=1)} = \beta_{jk}

This is the most directly interpretable quantity from the MNL regression output: βjk\beta_{jk} is the effect of XkX_k on the log-odds of jj vs. reference.

11.5 Substitution Patterns and the IIA Implication

Under IIA, the own-price elasticity and cross-price elasticity have rigid implications:

Own elasticity: εjjk=PjzjkzjkPj=γkzjk(1Pj)\varepsilon_{jj}^k = \frac{\partial P_j}{\partial z_{jk}}\frac{z_{jk}}{P_j} = \gamma_k z_{jk}(1 - P_j)

Cross elasticity (between alternatives jj and ll): εjlk=PjzlkzlkPj=γkzlkPl\varepsilon_{jl}^k = \frac{\partial P_j}{\partial z_{lk}}\frac{z_{lk}}{P_j} = -\gamma_k z_{lk} P_l

Under IIA, the cross elasticity is the same for all jlj \neq l — a strong and often unrealistic restriction. The cross elasticity depends only on the attribute level and share of the alternative being changed, not on the similarity between alternatives jj and ll.


12. Extensions: Ordered Choice Models

12.1 The Ordered Logit (Proportional Odds Model)

Recall from Section 3.6 the latent variable:

Yi=xiTβ+ϵi,ϵiLogisticY_i^* = \mathbf{x}_i^T\boldsymbol{\beta} + \epsilon_i, \quad \epsilon_i \sim \text{Logistic}

The ordered logit log-likelihood:

(β,τ)=i=1nj=1J1[Yi=j]ln[Λ(τjxiTβ)Λ(τj1xiTβ)]\ell(\boldsymbol{\beta}, \boldsymbol{\tau}) = \sum_{i=1}^n \sum_{j=1}^J \mathbf{1}[Y_i=j] \ln\left[\Lambda(\tau_j - \mathbf{x}_i^T\boldsymbol{\beta}) - \Lambda(\tau_{j-1} - \mathbf{x}_i^T\boldsymbol{\beta})\right]

Subject to τ0=\tau_0 = -\infty, τJ=+\tau_J = +\infty, and τ1<τ2<<τJ1\tau_1 < \tau_2 < \dots < \tau_{J-1}.

The coefficient vector β\boldsymbol{\beta} and thresholds τ\boldsymbol{\tau} are estimated jointly.

12.2 Marginal Effects in Ordered Models

For a continuous covariate XkX_k, the marginal effect on P(Yi=j)P(Y_i = j):

P(Yi=j)Xk=β^k[λ(τ^jxiTβ^)λ(τ^j1xiTβ^)]\frac{\partial P(Y_i=j)}{\partial X_k} = -\hat{\beta}_k\left[\lambda(\hat{\tau}_j - \mathbf{x}_i^T\hat{\boldsymbol{\beta}}) - \lambda(\hat{\tau}_{j-1} - \mathbf{x}_i^T\hat{\boldsymbol{\beta}})\right]

Where λ()=Λ()[1Λ()]\lambda(\cdot) = \Lambda(\cdot)[1-\Lambda(\cdot)] is the logistic PDF.

Key observation: For the highest category (j=Jj=J) and lowest category (j=1j=1), the signs are: P(Yi=J)Xk=β^kλ(τ^J1xiTβ^)>0 if β^k>0\frac{\partial P(Y_i = J)}{\partial X_k} = \hat{\beta}_k \lambda(\hat{\tau}_{J-1} - \mathbf{x}_i^T\hat{\boldsymbol{\beta}}) > 0 \text{ if } \hat{\beta}_k > 0 P(Yi=1)Xk=β^kλ(τ^1xiTβ^)<0 if β^k>0\frac{\partial P(Y_i = 1)}{\partial X_k} = -\hat{\beta}_k \lambda(\hat{\tau}_1 - \mathbf{x}_i^T\hat{\boldsymbol{\beta}}) < 0 \text{ if } \hat{\beta}_k > 0

For middle categories, the sign depends on parameter values — effects on middle categories can go either way even when the overall latent variable effect is unambiguous.

12.3 Generalised Ordered Logit

When the proportional odds assumption is violated, the Generalised Ordered Logit allows β\boldsymbol{\beta} to vary across thresholds:

P(Yi>jxi)=Λ(xiTβjτj),j=1,,J1P(Y_i > j \mid \mathbf{x}_i) = \Lambda(\mathbf{x}_i^T\boldsymbol{\beta}_j - \tau_j), \quad j = 1, \dots, J-1

The partial proportional odds model constrains some coefficients to be equal across thresholds (for covariates satisfying PO) and allows others to vary:

P(Yi>j)=Λ(x1iTβ+x2iTγjτj)P(Y_i > j) = \Lambda(\mathbf{x}_{1i}^T\boldsymbol{\beta} + \mathbf{x}_{2i}^T\boldsymbol{\gamma}_j - \tau_j)

Where β\boldsymbol{\beta} is common across thresholds and γj\boldsymbol{\gamma}_j varies.

12.4 Ordered Probit

The Ordered Probit replaces the logistic with the normal CDF:

P(Yijxi)=Φ(τjxiTβ)P(Y_i \leq j \mid \mathbf{x}_i) = \Phi(\tau_j - \mathbf{x}_i^T\boldsymbol{\beta})

Marginal effects are analogous, replacing λ\lambda with ϕ\phi (the standard normal PDF):

P(Yi=j)Xk=β^k[ϕ(τ^jxiTβ^)ϕ(τ^j1xiTβ^)]\frac{\partial P(Y_i = j)}{\partial X_k} = -\hat{\beta}_k\left[\phi(\hat{\tau}_j - \mathbf{x}_i^T\hat{\boldsymbol{\beta}}) - \phi(\hat{\tau}_{j-1} - \mathbf{x}_i^T\hat{\boldsymbol{\beta}})\right]


13. Extensions: Nested Logit and Mixed Logit

13.1 The Nested Logit: Addressing IIA

The Nested Logit relaxes IIA by grouping alternatives into nests within which alternatives are correlated substitutes. The choice probability decomposes into:

P(Yi=j)=P(jnestm)within-nest choice×P(nest m)nest choiceP(Y_i = j) = \underbrace{P(j \mid \text{nest} m)}_{\text{within-nest choice}} \times \underbrace{P(\text{nest } m)}_{\text{nest choice}}

The dissimilarity parameter λm(0,1]\lambda_m \in (0,1] governs the correlation within nest mm:

The inclusive value Iim=lnjBmeVij/λmI_{im} = \ln\sum_{j \in B_m}e^{V_{ij}/\lambda_m} summarises the attractiveness of nest mm, allowing it to influence the nest-level choice.

Utility consistency: The Nested Logit is RUM-consistent if and only if λm(0,1]\lambda_m \in (0,1] for all nests. If λ^m>1\hat{\lambda}_m > 1, it signals misspecification or incorrect nesting structure.

13.2 Estimation of the Nested Logit

Sequential (limited information) estimation:

  1. Estimate the within-nest model parameters by fitting a conditional logit within each nest.
  2. Compute the inclusive values I^im\hat{I}_{im}.
  3. Estimate the nest-level model using I^im\hat{I}_{im} as a covariate.

Full information MLE: Simultaneously maximise the full nested logit log-likelihood:

=i=1nlnP(Yi=ji)=i=1n[lnP(jiBmi)+lnP(Bmi)]\ell = \sum_{i=1}^n \ln P(Y_i = j_i) = \sum_{i=1}^n \left[\ln P(j_i \mid B_{m_i}) + \ln P(B_{m_i})\right]

Full MLE is preferred as it produces more efficient estimates; sequential estimation is easier to implement but is less efficient.

13.3 The Mixed Logit: Flexible Preferences

The Mixed Logit approximates virtually any random utility model by allowing random coefficients:

βi=μ+Lηi,ηiN(0,I)\boldsymbol{\beta}_i = \boldsymbol{\mu} + \mathbf{L}\boldsymbol{\eta}_i, \quad \boldsymbol{\eta}_i \sim \mathcal{N}(\mathbf{0}, \mathbf{I})

Where μ\boldsymbol{\mu} are mean preferences and LLT=Σ\mathbf{L}\mathbf{L}^T = \boldsymbol{\Sigma} captures preference heterogeneity and cross-alternative error correlation.

Key advantages over MNL:

  1. No IIA: Correlation across alternatives via Σ\boldsymbol{\Sigma}.
  2. Preference heterogeneity: Estimates the distribution of preferences, not just the mean.
  3. Panel data: Handles repeated choices by the same individual via the mixing distribution.
  4. Flexible substitution: Allows realistic substitution patterns.

13.4 Simulation-Based Estimation for Mixed Logit

Since P(Yi=j)=Lij(β)f(β)dβP(Y_i = j) = \int L_{ij}(\boldsymbol{\beta})f(\boldsymbol{\beta})d\boldsymbol{\beta} has no closed form, use simulation:

Simulated Maximum Likelihood (SML):

P~(Yi=j)=1Rr=1RexijTβ(r)k=1JexikTβ(r)\tilde{P}(Y_i = j) = \frac{1}{R}\sum_{r=1}^R \frac{e^{\mathbf{x}_{ij}^T\boldsymbol{\beta}^{(r)}}}{\sum_{k=1}^J e^{\mathbf{x}_{ik}^T\boldsymbol{\beta}^{(r)}}}

Where β(r)\boldsymbol{\beta}^{(r)} are draws from the assumed mixing distribution. The SML estimator maximises:

~(μ,Σ)=i=1nlnP~(Yi=ji)\tilde{\ell}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) = \sum_{i=1}^n \ln \tilde{P}(Y_i = j_i)

Quasi-Monte Carlo (Halton sequences): Replace pseudo-random draws with Halton sequences — low-discrepancy sequences that cover the integration domain more uniformly, typically reducing simulation variance by a factor of 10–100 compared to random sampling, requiring far fewer draws (typically R=100500R = 100-500 is sufficient).

Bayesian MCMC: An alternative to SML, using Markov Chain Monte Carlo to sample from the posterior distribution of parameters and individual-specific coefficients simultaneously.

13.5 Recovering Individual-Level Preferences

A key advantage of the Mixed Logit is the ability to estimate individual-specific coefficients using Bayes' theorem:

f(βiYi,Xi)=P(YiXi,βi)f(βiμ,Σ)P(YiXi,β)f(βμ,Σ)dβf(\boldsymbol{\beta}_i \mid \mathbf{Y}_i, \mathbf{X}_i) = \frac{P(\mathbf{Y}_i \mid \mathbf{X}_i, \boldsymbol{\beta}_i)f(\boldsymbol{\beta}_i \mid \boldsymbol{\mu}, \boldsymbol{\Sigma})}{\int P(\mathbf{Y}_i \mid \mathbf{X}_i, \boldsymbol{\beta})f(\boldsymbol{\beta} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma})d\boldsymbol{\beta}}

The posterior mean:

β~i=E[βiYi,Xi]rβ(r)tLijt(β(r))rtLijt(β(r))\tilde{\boldsymbol{\beta}}_i = E[\boldsymbol{\beta}_i \mid \mathbf{Y}_i, \mathbf{X}_i] \approx \frac{\sum_r \boldsymbol{\beta}^{(r)}\prod_t L_{ij_t}(\boldsymbol{\beta}^{(r)})}{\sum_r \prod_t L_{ij_t}(\boldsymbol{\beta}^{(r)})}

These conditional means reveal individual-level taste heterogeneity and are used for market segmentation and personalised prediction.


14. Extensions: Panel Data Discrete Choice

14.1 The Challenge: Incidental Parameters Problem

In panel data, each individual i=1,,ni = 1, \dots, n makes choices across TT time periods. The natural extension of binary logit to panel data with fixed effects:

P(Yit=1xit,αi)=Λ(αi+xitTβ)P(Y_{it}=1 \mid \mathbf{x}_{it}, \alpha_i) = \Lambda(\alpha_i + \mathbf{x}_{it}^T\boldsymbol{\beta})

Where αi\alpha_i is an individual fixed effect. The incidental parameters problem arises because:

14.2 Conditional Fixed Effects Logit (Chamberlain, 1980)

Chamberlain's conditional logit solves the incidental parameters problem for binary logit by conditioning on the sufficient statistic for αi\alpha_i — the individual's total number of successes Si=tYitS_i = \sum_t Y_{it}:

P(Yi1,,YiTSi,Xi)=exp(tYitxitTβ)dC(Si)exp(tdtxitTβ)P(Y_{i1}, \dots, Y_{iT} \mid S_i, \mathbf{X}_i) = \frac{\exp\left(\sum_t Y_{it}\mathbf{x}_{it}^T\boldsymbol{\beta}\right)}{\sum_{\mathbf{d} \in \mathcal{C}(S_i)}\exp\left(\sum_t d_t\mathbf{x}_{it}^T\boldsymbol{\beta}\right)}

Where C(Si)\mathcal{C}(S_i) is the set of all binary sequences with SiS_i ones (the conditioning set).

Key properties:

14.3 Random Effects Probit

When the fixed effects approach is too restrictive (e.g., with time-invariant covariates), the random effects probit assumes:

αiN(0,σα2)\alpha_i \sim \mathcal{N}(0, \sigma_\alpha^2)

P(Yit=1xit,αi)=Φ(αi+xitTβ)P(Y_{it}=1 \mid \mathbf{x}_{it}, \alpha_i) = \Phi(\alpha_i + \mathbf{x}_{it}^T\boldsymbol{\beta})

The marginal log-likelihood integrates out αi\alpha_i:

P(Yi1,,YiTXi)=t=1TΦ(αi+xitTβσ)YitΦ((αi+xitTβ)σ)1Yitϕ(αiσα)dαiσαP(Y_{i1}, \dots, Y_{iT} \mid \mathbf{X}_i) = \int \prod_{t=1}^T \Phi\left(\frac{\alpha_i + \mathbf{x}_{it}^T\boldsymbol{\beta}}{\sigma}\right)^{Y_{it}}\Phi\left(\frac{-(\alpha_i + \mathbf{x}_{it}^T\boldsymbol{\beta})}{\sigma}\right)^{1-Y_{it}} \phi\left(\frac{\alpha_i}{\sigma_\alpha}\right)\frac{d\alpha_i}{\sigma_\alpha}

This integral is computed via Gauss-Hermite quadrature.

Mundlak-Chamberlain (Correlated RE): Relax the random effects independence assumption by including individual-level means of time-varying covariates:

αi=xˉiTδ+vi,viN(0,σv2)\alpha_i = \bar{\mathbf{x}}_i^T\boldsymbol{\delta} + v_i, \quad v_i \sim \mathcal{N}(0,\sigma_v^2)

This allows correlation between αi\alpha_i and xit\mathbf{x}_{it}, approximating the FE estimator while retaining the ability to estimate effects of time-invariant variables.

14.4 Dynamic Panel Discrete Choice

State dependence refers to the direct causal effect of past choices on current choices:

P(Yit=1xit,Yi,t1,αi)=Λ(αi+xitTβ+ρYi,t1)P(Y_{it}=1 \mid \mathbf{x}_{it}, Y_{i,t-1}, \alpha_i) = \Lambda(\alpha_i + \mathbf{x}_{it}^T\boldsymbol{\beta} + \rho Y_{i,t-1})

Where ρ\rho captures structural state dependence (e.g., habit formation, switching costs).

The initial conditions problem (Heckman, 1981): The initial observation Yi1Y_{i1} is correlated with αi\alpha_i because it depends on the pre-sample history. Ignoring this causes inconsistency.

Wooldridge (2005) solution: Model the initial period as a function of the fixed effect:

P(Yi1=1xi,αi)=Φ(αi0+xi1Tψ)P(Y_{i1}=1 \mid \mathbf{x}_i, \alpha_i) = \Phi(\alpha_{i0} + \mathbf{x}_{i1}^T\boldsymbol{\psi})

And use the Mundlak-Chamberlain approach for the fixed effect distribution.


15. Using the Discrete Choice Component

The Discrete Choice Models component in the DataStatPro application provides a comprehensive workflow for specification, estimation, testing, and visualisation of all major discrete choice model families.

Step-by-Step Guide

Step 1 — Select Dataset Choose the dataset from the "Dataset" dropdown. The dataset should contain:

Step 2 — Select Model Family Choose the discrete choice model specification:

Step 3 — Select Variables Map the required variables from your dataset:

Step 4 — Specify Reference Categories For multinomial and conditional logit, set the reference alternative (default: first category in alphabetical order). For ordered models, verify the ordering of categories.

Step 5 — Configure Nesting Structure (Nested Logit) Assign each alternative to a nest:

Step 6 — Configure Random Parameters (Mixed Logit) For each covariate, specify whether the coefficient is:

Set the number of Halton draws (default: 500) and whether to use antithetic draws for variance reduction.

Step 7 — Configure Fixed Effects (Panel Models)

Step 8 — Configure Standard Errors

Step 9 — Configure Marginal Effects Select which partial effects to report:

Step 10 — Select Display Options Choose which outputs to display:

Step 11 — Run the Analysis Click "Run Discrete Choice Model". The application will:

  1. Validate data format and variable types; convert to appropriate structure if needed.
  2. Initialise parameters (using linear probability model or random starting values).
  3. Maximise the log-likelihood using Newton-Raphson / BFGS / IRLS.
  4. Compute variance-covariance matrix (information matrix or sandwich).
  5. Compute all selected marginal effects with delta method SEs.
  6. Run specified diagnostic tests (linktest, Brant, IIA Hausman).
  7. Generate all selected visualisations and tables.

16. Computational and Formula Details

16.1 Binary Logit MLE: Step-by-Step

Step 1: Initialise parameters

β(0)=0k×1(or OLS estimates as warm start)\boldsymbol{\beta}^{(0)} = \mathbf{0}_{k\times 1} \quad \text{(or OLS estimates as warm start)}

Step 2: Compute fitted probabilities

p^i(t)=Λ(xiTβ(t))=exiTβ(t)1+exiTβ(t)\hat{p}_i^{(t)} = \Lambda(\mathbf{x}_i^T\boldsymbol{\beta}^{(t)}) = \frac{e^{\mathbf{x}_i^T\boldsymbol{\beta}^{(t)}}}{1 + e^{\mathbf{x}_i^T\boldsymbol{\beta}^{(t)}}}

Step 3: Compute score and Hessian

s(t)=i=1n(Yip^i(t))xi=XT(yp^(t))\mathbf{s}^{(t)} = \sum_{i=1}^n (Y_i - \hat{p}_i^{(t)})\mathbf{x}_i = \mathbf{X}^T(\mathbf{y} - \hat{\mathbf{p}}^{(t)})

H(t)=i=1np^i(t)(1p^i(t))xixiT=XTW^(t)X\mathbf{H}^{(t)} = -\sum_{i=1}^n \hat{p}_i^{(t)}(1-\hat{p}_i^{(t)})\mathbf{x}_i\mathbf{x}_i^T = -\mathbf{X}^T\hat{\mathbf{W}}^{(t)}\mathbf{X}

Step 4: Newton-Raphson update

β(t+1)=β(t)+(XTW^(t)X)1XT(yp^(t))\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + \left(\mathbf{X}^T\hat{\mathbf{W}}^{(t)}\mathbf{X}\right)^{-1}\mathbf{X}^T(\mathbf{y} - \hat{\mathbf{p}}^{(t)})

Step 5: Check convergence

β(t+1)β(t)2<εtol(default: εtol=108)\|\boldsymbol{\beta}^{(t+1)} - \boldsymbol{\beta}^{(t)}\|_2 < \varepsilon_{tol} \quad \text{(default: } \varepsilon_{tol} = 10^{-8}\text{)}

Step 6: Compute variance-covariance matrix

V^(β^)=(XTW^X)1\hat{\mathbf{V}}(\hat{\boldsymbol{\beta}}) = \left(\mathbf{X}^T\hat{\mathbf{W}}\mathbf{X}\right)^{-1}

16.2 Average Partial Effects: Full Computation

For binary logit, continuous covariate XkX_k:

APE^k=1ni=1np^i(1p^i)β^k\widehat{APE}_k = \frac{1}{n}\sum_{i=1}^n \hat{p}_i(1-\hat{p}_i)\hat{\beta}_k

Gradient for delta method SE:

APE^kβ^l=1ni=1np^i(1p^i)[1[l=k]+β^k(12p^i)xil]\frac{\partial \widehat{APE}_k}{\partial \hat{\beta}_l} = \frac{1}{n}\sum_{i=1}^n \hat{p}_i(1-\hat{p}_i)\left[\mathbf{1}[l=k] + \hat{\beta}_k(1-2\hat{p}_i)x_{il}\right]

SE^(APE^k)=gkTV^(β^)gk\widehat{SE}(\widehat{APE}_k) = \sqrt{\mathbf{g}_k^T\hat{\mathbf{V}}(\hat{\boldsymbol{\beta}})\mathbf{g}_k}

For binary logit, binary covariate DkD_k:

APE^k=1ni=1n[Λ(η^i+(1dik)β^k)Λ(η^idikβ^k)]\widehat{APE}_k = \frac{1}{n}\sum_{i=1}^n \left[\Lambda(\hat{\eta}_i + (1-d_{ik})\hat{\beta}_k) - \Lambda(\hat{\eta}_i - d_{ik}\hat{\beta}_k)\right]

Where η^i=xiTβ^\hat{\eta}_i = \mathbf{x}_i^T\hat{\boldsymbol{\beta}} is the fitted index and dikd_{ik} is the observed value of DkD_k for individual ii.

16.3 Multinomial Logit: Score and Hessian

For the MNL with JJ alternatives and reference j=1j=1:

Score for category jj (j2j \geq 2):

βj=i=1n(1[Yi=j]P^ij)xi\frac{\partial \ell}{\partial \boldsymbol{\beta}_j} = \sum_{i=1}^n \left(\mathbf{1}[Y_i=j] - \hat{P}_{ij}\right)\mathbf{x}_i

Hessian blocks:

2βjβjT=i=1nP^ij(1P^ij)xixiT\frac{\partial^2 \ell}{\partial \boldsymbol{\beta}_j\partial\boldsymbol{\beta}_j^T} = -\sum_{i=1}^n \hat{P}_{ij}(1-\hat{P}_{ij})\mathbf{x}_i\mathbf{x}_i^T

2βjβkT=i=1nP^ijP^ikxixiT(jk)\frac{\partial^2 \ell}{\partial \boldsymbol{\beta}_j\partial\boldsymbol{\beta}_k^T} = \sum_{i=1}^n \hat{P}_{ij}\hat{P}_{ik}\mathbf{x}_i\mathbf{x}_i^T \quad (j \neq k)

The full Hessian is block-structured and negative definite, ensuring global concavity of the MNL log-likelihood.

16.4 Ordered Logit: Score and Threshold Constraints

Score for β\boldsymbol{\beta}:

β=i=1nj=1J1[Yi=j][λ(τj1xiTβ)+λ(τjxiTβ)P(Yi=j)]xi\frac{\partial \ell}{\partial \boldsymbol{\beta}} = \sum_{i=1}^n \sum_{j=1}^J \mathbf{1}[Y_i=j] \left[\frac{-\lambda(\tau_{j-1} - \mathbf{x}_i^T\boldsymbol{\beta}) + \lambda(\tau_j - \mathbf{x}_i^T\boldsymbol{\beta})}{P(Y_i=j)}\right]\mathbf{x}_i

Score for threshold τj\tau_j:

τj=i=1n[1[Yi=j]λ(τjxiTβ)P(Yi=j)1[Yi=j+1]λ(τjxiTβ)P(Yi=j+1)]\frac{\partial \ell}{\partial \tau_j} = \sum_{i=1}^n \left[\frac{\mathbf{1}[Y_i=j]\lambda(\tau_j - \mathbf{x}_i^T\boldsymbol{\beta})}{P(Y_i=j)} - \frac{\mathbf{1}[Y_i=j+1]\lambda(\tau_j - \mathbf{x}_i^T\boldsymbol{\beta})}{P(Y_i=j+1)}\right]

Thresholds are constrained to be strictly ordered. In practice, use the unconstrained re-parameterisation τj=τ1+l=2jeδl\tau_j = \tau_1 + \sum_{l=2}^j e^{\delta_l} (δl>0\delta_l > 0 freely estimated).

16.5 Nested Logit: Full Information MLE

The nested logit log-likelihood for individual ii choosing alternative jj^* in nest mm^*:

i=lnP(jm)+lnP(m)=Vijλmln(kBmeVik/λm)+λmIim+Wimln(leλlIil+Wil)\ell_i = \ln P(j^* \mid m^*) + \ln P(m^*) = \frac{V_{ij^*}}{\lambda_{m^*}} - \ln\left(\sum_{k \in B_{m^*}} e^{V_{ik}/\lambda_{m^*}}\right) + \lambda_{m^*} I_{im^*} + W_{im^*} - \ln\left(\sum_l e^{\lambda_l I_{il} + W_{il}}\right)

The gradient with respect to λm\lambda_m requires the chain rule through the inclusive value IimI_{im} and involves kBm(Vik/λm2)(PikmPikm2)\sum_{k \in B_m} (V_{ik}/\lambda_m^2)(P_{ik|m} - P_{ik|m}^2).

16.6 Conditional Fixed Effects Logit: Computation

For individual ii with Si=tYitS_i = \sum_t Y_{it} successes across TT periods, the conditional log-likelihood contribution is:

iCL=tYitxitTβln(dC(Si)etdtxitTβ)\ell_i^{CL} = \sum_t Y_{it}\mathbf{x}_{it}^T\boldsymbol{\beta} - \ln\left(\sum_{\mathbf{d} \in \mathcal{C}(S_i)} e^{\sum_t d_t\mathbf{x}_{it}^T\boldsymbol{\beta}}\right)

For T=2T=2 and Si=1S_i=1, this simplifies to:

iCL=Yi1(xi1xi2)Tβln(1+e(xi1xi2)Tβ)\ell_i^{CL} = Y_{i1}(\mathbf{x}_{i1} - \mathbf{x}_{i2})^T\boldsymbol{\beta} - \ln\left(1 + e^{(\mathbf{x}_{i1}-\mathbf{x}_{i2})^T\boldsymbol{\beta}}\right)

Which is equivalent to a standard logit with first-differenced covariates — the panel FE analogue of the first-differences estimator in linear models.

For T>2T > 2, the summation over C(Si)\mathcal{C}(S_i) grows combinatorially ((TSi)\binom{T}{S_i} terms) and is computed efficiently using the Breslow algorithm (analogous to the Cox partial likelihood).

16.7 Mixed Logit: Halton Sequences and Simulation

Halton sequence for prime base bb: Generate RR draws from the quasi-random sequence:

hr(b)=k=0Kak(r)b(k+1)h_r^{(b)} = \sum_{k=0}^K a_k(r) b^{-(k+1)}

Where rr in base bb is r=kak(r)bkr = \sum_k a_k(r) b^k. Halton sequences for different primes {b1,b2,,bK}\{b_1, b_2, \dots, b_K\} are used for different dimensions of integration.

Simulated log-likelihood:

~(μ,Σ)=i=1nln(1Rr=1RexijiTβ(r)k=1JexikTβ(r))\tilde{\ell}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) = \sum_{i=1}^n \ln\left(\frac{1}{R}\sum_{r=1}^R \frac{e^{\mathbf{x}_{ij_i}^T\boldsymbol{\beta}^{(r)}}}{\sum_{k=1}^J e^{\mathbf{x}_{ik}^T\boldsymbol{\beta}^{(r)}}}\right)

Where β(r)=μ+Lh(r)\boldsymbol{\beta}^{(r)} = \boldsymbol{\mu} + \mathbf{L}\boldsymbol{h}^{(r)} and h(r)\boldsymbol{h}^{(r)} are Halton draws transformed to standard normal variates via Φ1(hr(b))\Phi^{-1}(h_r^{(b)}).

Antithetic draws: For each draw h(r)\boldsymbol{h}^{(r)}, include its mirror h(r)-\boldsymbol{h}^{(r)} to reduce simulation variance.

16.8 WTP Computation and Krinsky-Robb Confidence Intervals

Point estimate:

WTP^k=γ^kγ^cost\widehat{WTP}_k = -\frac{\hat{\gamma}_k}{\hat{\gamma}_{cost}}

Delta method SE:

SE^(WTPk)=1γ^costVar(γ^k)+WTPk2Var(γ^cost)+2WTPkCov(γ^k,γ^cost)\widehat{SE}(WTP_k) = \frac{1}{|\hat{\gamma}_{cost}|}\sqrt{Var(\hat{\gamma}_k) + WTP_k^2 \cdot Var(\hat{\gamma}_{cost}) + 2 WTP_k \cdot Cov(\hat{\gamma}_k, \hat{\gamma}_{cost})}

Krinsky-Robb (1986) simulation:

  1. Draw R=10,000R = 10{,}000 parameter vectors from N(γ^,V^(γ^))\mathcal{N}(\hat{\boldsymbol{\gamma}}, \hat{\mathbf{V}}(\hat{\boldsymbol{\gamma}})).
  2. Compute WTPk(r)=γk(r)/γcost(r)WTP_k^{(r)} = -\gamma_k^{(r)}/\gamma_{cost}^{(r)} for each draw.
  3. Report the 2.5th and 97.5th percentiles of {WTPk(r)}\{WTP_k^{(r)}\} as the 95% CI.

The Krinsky-Robb CI is preferred over the delta method when γ^cost\hat{\gamma}_{cost} is close to zero (since the ratio is highly non-linear near zero).


17. Worked Examples

Example 1: Binary Logit — Probability of Health Insurance Take-Up

Research Question: What factors predict whether an individual aged 19–64 has health insurance coverage?

Data: Cross-sectional survey of n=8,246n = 8{,}246 working-age adults; outcome: Yi=1Y_i = 1 if insured, 00 if uninsured.

Model: lnP(insured)1P(insured)=β0+β1Incomei+β2Collegei+β3Agei+β4Employedi+ϵi\ln\frac{P(\text{insured})}{1-P(\text{insured})} = \beta_0 + \beta_1\text{Income}_i + \beta_2\text{College}_i + \beta_3\text{Age}_i + \beta_4\text{Employed}_i + \epsilon_i

Step 1: Results Table

Variableβ^\hat{\beta}SEzzppeβ^e^{\hat{\beta}} (OR)APE
Intercept-3.4120.241-14.15<0.001
Income (per $10k USD)0.3180.03110.26<0.0011.374+0.048 pp
College degree0.8410.0948.95<0.0012.318+0.127 pp
Age (years)0.0420.0076.00<0.0011.043+0.006 pp
Employed full-time1.2830.10811.88<0.0013.607+0.193 pp

(β^)=3,841.2\ell(\hat{\boldsymbol{\beta}}) = -3{,}841.2, 0=4,512.7\ell_0 = -4{,}512.7, McFadden ρ2=0.149\rho^2 = 0.149. AUC = 0.813, Hosmer-Lemeshow χ82=9.41\chi^2_{8} = 9.41 (p=0.308p = 0.308, good calibration).

Step 2: Interpretation

Step 3: Predicted Probability Profiles

IncomeCollegeAgeEmployedP^(insured)\hat{P}(\text{insured})
$25kNo30No0.312
$50kNo40Yes0.741
$75kYes50Yes0.941
$25kYes30No0.507

Step 4: Model Diagnostics


Example 2: Multinomial Logit — Occupational Choice

Research Question: What individual characteristics predict whether a worker is employed in (1) Professional/Managerial, (2) Technical/Clerical, or (3) Service/Manual occupations?

Data: n=3,812n = 3{,}812 workers; reference category: Service/Manual (category 3).

Step 1: MNL Coefficient Table (Reference: Service/Manual)

VariableProfessional (vs. Service)Technical (vs. Service)
β^\hat{\beta}SEβ^\hat{\beta}SE
Intercept-2.8410.312-1.5230.241
Education (years)0.4120.0410.2180.033
Experience (years)0.0830.0180.0610.015
Female-0.3910.1120.2840.098
Urban0.5210.1340.3120.118

(β^)=3,412.8\ell(\hat{\boldsymbol{\beta}}) = -3{,}412.8, McFadden ρ2=0.187\rho^2 = 0.187; LR χ82=1,578.4\chi^2_{8} = 1{,}578.4 (p<0.001p < 0.001).

Step 2: Average Partial Effects on Category Probabilities

VariableΔP^\Delta\hat{P}: ProfessionalΔP^\Delta\hat{P}: TechnicalΔP^\Delta\hat{P}: Service
Education (+1 year)+0.041+0.009-0.050
Experience (+1 year)+0.007+0.003-0.010
Female-0.048+0.062-0.014
Urban+0.056+0.021-0.077

Note that effects sum to zero across categories (probability constraint). Being female reduces the probability of professional occupation by 4.8 pp but increases the probability of technical occupation by 6.2 pp.

Step 3: IIA Test

Hausman-McFadden test excluding "Technical" category: χ42=3.21\chi^2_{4} = 3.21, p=0.523p = 0.523 → IIA not rejected. Excluding "Professional": χ42=4.87\chi^2_{4} = 4.87, p=0.301p = 0.301 → IIA not rejected. The MNL is appropriate for this application.

Step 4: Predicted Category Probabilities for Representative Profiles

ProfileProfessionalTechnicalService
12 yrs education, 5 yrs exp., male, rural0.2140.2810.505
16 yrs education, 10 yrs exp., female, urban0.4120.3940.194
18 yrs education, 20 yrs exp., male, urban0.6310.2480.121

Example 3: Ordered Logit — Customer Satisfaction

Research Question: What factors predict customer satisfaction with a bank, rated on a 5-point scale (1 = Very Dissatisfied, ..., 5 = Very Satisfied)?

Data: n=4,521n = 4{,}521 bank customers; outcome: satisfaction rating Yi{1,2,3,4,5}Y_i \in \{1,2,3,4,5\}.

Step 1: Ordered Logit Results

Variableβ^\hat{\beta}SEzzpp
Account Tenure (years)0.1820.0237.91<0.001
Branch Wait Time (−minutes)-0.2410.038-6.34<0.001
Mobile App User0.6120.0847.29<0.001
Complaint (last 12 mo.)-1.1430.112-10.21<0.001
Premium Account0.8310.0988.48<0.001

Estimated Thresholds:

ThresholdEstimateSE
τ^1\hat{\tau}_1 (1|2)-3.4120.181
τ^2\hat{\tau}_2 (2|3)-1.8410.143
τ^3\hat{\tau}_3 (3|4)0.3210.121
τ^4\hat{\tau}_4 (4|5)2.1840.152

Step 2: Brant Test for Proportional Odds

Variableχ32\chi^2_{3}ppPO Violated?
Account Tenure2.140.543No
Branch Wait Time3.910.271No
Mobile App User4.210.240No
Complaint18.410.000Yes
Premium Account3.120.373No
Global test32.180.009Yes

The complaint variable violates proportional odds → estimate a Generalised Ordered Logit allowing the complaint coefficient to vary across thresholds.

Step 3: Average Partial Effects on P(Y = 5: Very Satisfied)

VariableAPESEpp
Tenure (+1 year)+0.024 pp0.003<0.001
Wait Time (+1 min)-0.031 pp0.005<0.001
Mobile App User+0.082 pp0.011<0.001
Complaint (yes vs. no)-0.183 pp0.018<0.001
Premium Account+0.111 pp0.013<0.001

Having a complaint in the last 12 months reduces the probability of being Very Satisfied by 18.3 pp — by far the largest effect.


Example 4: Mixed Logit — Transportation Mode Choice

Research Question: How do travellers' preferences for cost, time, and comfort vary across individuals when choosing among Car, Bus, Train, and Bicycle?

Data: Stated preference survey; n=2,412n = 2{,}412 respondents, each evaluating 8 hypothetical choice scenarios (long format, N=19,296N = 19{,}296 rows); J=4J = 4 alternatives with attributes: cost ($), travel time (min.), comfort rating (1-5).

Step 1: Mixed Logit Specification

AttributeDistribution
Cost ($)Fixed (negative)
Travel Time (min.)Normal: N(μt,σt2)\mathcal{N}(\mu_t, \sigma_t^2)
Comfort RatingNormal: N(μc,σc2)\mathcal{N}(\mu_c, \sigma_c^2)
ASC: CarFixed
ASC: TrainFixed
ASC: BusFixed
(Bicycle = Reference ASC)

R=500R = 500 Halton draws used.

Step 2: Results

ParameterEstimateSEzzpp
Cost (γ^cost\hat{\gamma}_{cost})-0.04120.006-6.87<0.001
Time mean (μ^t\hat{\mu}_t)-0.08410.012-7.01<0.001
Time SD (σ^t\hat{\sigma}_t)0.04120.0085.15<0.001
Comfort mean (μ^c\hat{\mu}_c)0.31210.0417.61<0.001
Comfort SD (σ^c\hat{\sigma}_c)0.18430.0296.35<0.001
ASC: Car1.2410.1826.82<0.001
ASC: Train0.8140.1515.39<0.001
ASC: Bus-0.3120.141-2.210.027

Simulated =18,412.3\ell = -18{,}412.3; McFadden ρ2=0.341\rho^2 = 0.341 (vs. ρ2=0.298\rho^2 = 0.298 for standard MNL).

Step 3: WTP Calculations (Krinsky-Robb 95% CI)

AttributeWTP Estimate95% CI
Travel time (per minute saved)$2.04/min[$1.61, $2.51]
Comfort (per unit increase)$7.58/unit[$5.91, $9.31]

Travellers are willing to pay $2.04 per minute of travel time savings — a Value of Travel Time (VTT) estimate consistent with the transport economics literature.

Step 4: Preference Heterogeneity

The significant σ^t=0.0412\hat{\sigma}_t = 0.0412 (Time SD) indicates substantial preference heterogeneity: 95% of the population has time sensitivity in the range [0.0841±1.96×0.0412]=[0.165,0.003][-0.0841 \pm 1.96 \times 0.0412] = [-0.165, -0.003] (all negative, i.e., all dislike travel time). In contrast, comfort has σ^c=0.1843\hat{\sigma}_c = 0.1843, implying some travellers actually have negative comfort preferences — possibly capturing high-income travellers valuing solitude.


Example 5: Conditional Fixed Effects Logit — Panel Adoption Decision

Research Question: Does a reduction in technology cost (logged) increase the probability that a firm adopts a new production technology, controlling for all time-invariant firm characteristics?

Data: Annual panel of n=1,421n = 1{,}421 manufacturing firms, T=8T = 8 years; outcome: Yit=1Y_{it} = 1 if firm adopts technology in year tt; 214 firms (15%) adopt during the panel.

Model: Conditional fixed effects logit (Chamberlain), conditioning on tYit\sum_t Y_{it}.

Variableβ^\hat{\beta}SEzzppAPE
Log(Technology Cost)-0.8410.121-6.95<0.001-0.062 pp
Government Subsidy ($)0.03120.0083.90<0.001+0.023 pp
Competitor Adoption Rate1.4120.2186.48<0.001+0.104 pp
Time Trend0.1840.0414.49<0.001+0.014 pp

Number of firms contributing information: 214 (firms adopting at least once). Firms never adopting: 1,207 (dropped by conditioning). CL=1,241.8\ell^{CL} = -1{,}241.8.

Interpretation: A 10% increase in technology cost reduces the probability of adoption by approximately 0.062×ln(1.10)0.60.062 \times \ln(1.10) \approx 0.6 pp per year, controlling for all time-invariant firm heterogeneity. Competitive pressure (competitor adoption rate) has the largest effect — a 10 pp increase in competitor adoption rates raises a firm's own probability by 10.4 pp.


18. Common Mistakes and How to Avoid Them

Mistake 1: Interpreting Raw Logit/Probit Coefficients as Marginal Effects

Problem: Reporting the raw β^\hat{\beta} from a logit regression as "a one-unit increase in XX increases the probability of Y=1Y=1 by β^\hat{\beta}." This is only correct for the Linear Probability Model. In logit and probit, β^\hat{\beta} is the change in the log-odds (logit) or the latent index (probit), not the probability.
Solution: Always compute and report Average Partial Effects (APE) using the delta method for standard errors. For communication to non-technical audiences, report predicted probabilities for representative covariate profiles.

Mistake 2: Applying Multinomial Logit When IIA is Violated

Problem: Using the MNL for alternatives that are close substitutes (e.g., different bus routes, similar brand variants), leading to unrealistic cross-substitution patterns predicted by the model.
Solution: Test IIA with the Hausman-McFadden or Small-Hsiao test. If IIA is suspect based on subject-matter knowledge (similar alternatives exist), use Nested Logit (if the nesting structure is clear) or Mixed Logit (for flexible substitution patterns). Report robustness across model specifications.

Mistake 3: Ignoring the Proportional Odds Assumption in Ordered Logit

Problem: Estimating an ordered logit without testing the proportional odds assumption, and reporting a single coefficient for each variable as if it applies uniformly across all thresholds. When the assumption is violated, the estimated coefficient is an unreliable average.
Solution: Always run the Brant test (global and variable-specific). If violated for one or more variables, use the Generalised Ordered Logit (partial proportional odds) or report category-specific marginal effects. Never report ordered logit results without proportional odds diagnostics.

Mistake 4: Using Standard Fixed Effects Logit Instead of Conditional Logit for Panel Data

Problem: Estimating a logit with individual dummy variables (LSDV approach) for panel data. Due to the incidental parameters problem, β^\hat{\boldsymbol{\beta}} is inconsistent for fixed TT. With T=2T = 2, the bias is approximately 100%; with T=5T = 5, roughly 20%.
Solution: Use Chamberlain's conditional fixed effects logit for panel binary outcomes (Stata: xtlogit, fe; R: clogit). For probit, use the Mundlak-Chamberlain correlated random effects approach. Report within-individual variation only.

Mistake 5: Reporting Odds Ratios as Relative Risks (Risk Ratios)

Problem: Interpreting eβ^=2.0e^{\hat{\beta}} = 2.0 as "twice as likely." This is the odds ratio, not the relative risk (risk ratio). For common outcomes (P>10%P > 10\%), the odds ratio substantially overestimates the relative risk.
Solution: Be explicit about reporting odds ratios (from logit) vs. relative risks. For common outcomes, report Average Partial Effects (absolute probability changes) which are clearer. If relative risk is needed, use Poisson regression with a log link or compute predicted probability ratios directly.

Mistake 6: Using Only In-Sample Fit Statistics for Model Selection

Problem: Selecting a model (e.g., choosing logit over probit, or choosing a particular set of covariates) based solely on in-sample pseudo-R2R^2 or log-likelihood, without accounting for overfitting.
Solution: Use AIC/BIC for comparing models with different covariate sets. For predictive models, use out-of-sample AUC or Brier score from cross-validation. Always check calibration via Hosmer-Lemeshow. Distinguish between models for prediction vs. structural inference.

Mistake 7: Not Checking for Complete Separation

Problem: Running logit/probit on small samples or with many binary predictors without checking for complete separation. The MLE does not exist, but many software packages produce output with extremely large (meaningless) coefficients and standard errors without warning the user.
Solution: Check for separation before relying on MLE estimates. Warning signs: coefficients β^>10|\hat{\beta}| > 10, SEs >5> 5, predicted probabilities exactly at 0 or 1. Use Firth penalised logit or Bayesian logit (weakly informative priors) as robust alternatives to standard MLE in small or sparse samples.

Mistake 8: Including Irrelevant Alternatives in the Choice Set

Problem: Defining the choice set too broadly (e.g., including alternatives that are not actually available to the decision-maker) or too narrowly (excluding relevant alternatives). Both distort the estimated choice probabilities.
Solution: Carefully define the choice set based on availability. For alternative-specific choice sets (where different individuals face different options), specify the availability matrix in the model. Report the sensitivity of results to alternative choice set definitions.

Mistake 9: Failing to Account for Preference Heterogeneity

Problem: Estimating a standard MNL or conditional logit that assumes homogeneous preferences across all individuals, missing important heterogeneity in price sensitivity, taste, or value of time. This leads to biased substitution patterns and misleading policy simulations.
Solution: Test for heterogeneity by including interaction terms with demographic variables. For more flexible heterogeneity, estimate a Mixed Logit with normally distributed random coefficients. Report the distribution of individual-level preferences, not just the mean.

Mistake 10: Using the Wrong Data Format for Conditional Logit

Problem: Estimating a conditional logit with individual-specific data in wide format (one row per person, multiple columns for different alternatives' attributes). This causes data errors and incorrect likelihood contributions.
Solution: Convert data to long format: one row per alternative per individual. The dataset should have n×Jn \times J rows (nn individuals, JJ alternatives). Verify the choice indicator is coded as 1 for the chosen alternative and 0 for all others, within each individual's choice set.


19. Troubleshooting

IssueLikely CauseSolution
MLE does not convergePoor starting values; very flat likelihood; complete separationUse OLS/LPM as starting values; rescale variables; check for separation; try Firth logit
Extremely large coefficients or SEs (>5> 5)Complete or quasi-complete separation; collinearityCheck pairwise correlations; VIF analysis; merge categories; use penalised estimation (Firth)
Negative definite Hessian at convergenceLocal optimum; non-concave model extensionTry multiple starting values; use a different optimizer (BFGS vs. Newton-Raphson); check model specification
Predicted probabilities exactly 0 or 1Complete separation; extreme covariate valuesIdentify separating combination; drop/transform variable; use Firth logit; check for data errors
APE has wrong sign compared to coefficientNon-linear interaction effects; cross-effects in MNLAPE in MNL can have opposite sign to log-odds coefficient — this is expected; report both
IIA Hausman test gives negative statisticSmall sample; numerical imprecision in Hessian estimationUse Small-Hsiao test instead; check if restricted model is nested in full model; try more alternatives
Brant test significant (proportional odds violated)Heterogeneous covariate effects across thresholdsFit Generalised Ordered Logit; or Multinomial Logit; report variable-specific Brant results to identify culprits
Mixed logit does not convergeToo many random parameters; too few draws; poor scalingIncrease draws (to 1000+); scale attributes to similar magnitude; fix some parameters as fixed; simplify model
Conditional logit: no observations after conditioningAll individuals have Si=0S_i = 0 or Si=TS_i = TVerify panel structure; ensure within-individual variation in YitY_{it}; check treatment coding
WTP confidence interval is extremely wide or includes infinityCost coefficient close to zero; poor precisionReport Krinsky-Robb CI instead of delta method; increase sample size; consider fixing cost coefficient
Hosmer-Lemeshow test rejects calibrationModel does not predict outcome rates accurately in some regionsInspect calibration plot decile by decile; add polynomial terms for continuous variables; check for important omitted variables
AUC is high but calibration is poorModel discriminates well but predicted probabilities are poorly scaledApply Platt scaling or isotonic regression for calibration correction; consider calibrated probability estimation
Panel random effects probit: very slow convergenceMany quadrature points needed; complex likelihood surfaceReduce quadrature points (e.g., 12-20 are usually sufficient); use adaptive quadrature; use Mundlak-Chamberlain approach with standard probit
Nested logit: λ^m>1\hat{\lambda}_m > 1Incorrect nesting structure; misspecified modelThe model is not RUM-consistent; rethink nesting structure; try Mixed Logit as alternative
MNL: predictions dominated by one categoryClass imbalance; misspecified alternativeCheck class proportions; verify reference category; consider alternative-specific constants
Interaction terms insignificant despite theoretical expectationInsufficient statistical power; multicollinearityCheck VIF for interaction; report effect sizes with CIs regardless of significance; consider power analysis

20. Quick Reference Cheat Sheet

Core Probability Formulas

ModelP(Yi=1xi)P(Y_i = 1 \mid \mathbf{x}_i)Link Function
LogitΛ(xiTβ)=exTβ1+exTβ\Lambda(\mathbf{x}_i^T\boldsymbol{\beta}) = \frac{e^{\mathbf{x}^T\boldsymbol{\beta}}}{1+e^{\mathbf{x}^T\boldsymbol{\beta}}}Logit: ln[p/(1p)]\ln[p/(1-p)]
ProbitΦ(xiTβ)\Phi(\mathbf{x}_i^T\boldsymbol{\beta})Probit: Φ1(p)\Phi^{-1}(p)
LPMxiTβ\mathbf{x}_i^T\boldsymbol{\beta}Identity
MNLexiTβjkexiTβk\frac{e^{\mathbf{x}_i^T\boldsymbol{\beta}_j}}{\sum_k e^{\mathbf{x}_i^T\boldsymbol{\beta}_k}}Log relative odds
Ordered LogitΛ(τjxiTβ)Λ(τj1xiTβ)\Lambda(\tau_j - \mathbf{x}_i^T\boldsymbol{\beta}) - \Lambda(\tau_{j-1} - \mathbf{x}_i^T\boldsymbol{\beta})Proportional odds

Key Formulas

FormulaDescription
(β)=i[YilnPi+(1Yi)ln(1Pi)]\ell(\boldsymbol{\beta}) = \sum_i [Y_i\ln P_i + (1-Y_i)\ln(1-P_i)]Binary logit/probit log-likelihood
APEk=n1iPi/XikAPE_k = n^{-1}\sum_i \partial P_i / \partial X_{ik}Average Partial Effect (continuous XkX_k)
APEklogit=n1ip^i(1p^i)β^kAPE_k^{logit} = n^{-1}\sum_i \hat{p}_i(1-\hat{p}_i)\hat{\beta}_kAPE for logit
APEkprobit=n1iϕ(xiTβ^)β^kAPE_k^{probit} = n^{-1}\sum_i \phi(\mathbf{x}_i^T\hat{\boldsymbol{\beta}})\hat{\beta}_kAPE for probit
ORk=eβ^kOR_k = e^{\hat{\beta}_k}Odds ratio from logit
WTPk=γ^k/γ^costWTP_k = -\hat{\gamma}_k / \hat{\gamma}_{cost}Willingness to pay
ρ2=1(β^)/0\rho^2 = 1 - \ell(\hat{\boldsymbol{\beta}})/\ell_0McFadden's pseudo-R2R^2
LR=2(restrictedunrestricted)χq2LR = -2(\ell_{restricted} - \ell_{unrestricted}) \sim \chi^2_qLikelihood Ratio test
P(Yi=jnest m)=eVij/λm/kBmeVik/λmP(Y_i = j \mid \text{nest } m) = e^{V_{ij}/\lambda_m}/\sum_{k\in B_m}e^{V_{ik}/\lambda_m}Nested logit conditional probability
β^(t+1)=β^(t)+(XTW^X)1XT(yp^)\hat{\boldsymbol{\beta}}^{(t+1)} = \hat{\boldsymbol{\beta}}^{(t)} + (\mathbf{X}^T\hat{\mathbf{W}}\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{y}-\hat{\mathbf{p}})Newton-Raphson / IRLS update

Model Selection Guide

Outcome TypeJJAlternativesRecommended Model
Binary2Logit (default) or Probit
NominalJ3J \geq 3No attributesMultinomial Logit
NominalJ3J \geq 3With attributesConditional Logit
Nominal (correlated)J3J \geq 3Nested groupsNested Logit
Nominal (heterogeneous)J3J \geq 3Random preferencesMixed Logit
OrdinalJ3J \geq 3PO holdsOrdered Logit
OrdinalJ3J \geq 3PO violatedGeneralised Ordered Logit
Binary, panel FE2Conditional FE Logit
Binary, panel RE2Random Effects Probit (Mundlak)

Assumption Checklist

AssumptionModelHow to TestIf Violated
Correct link functionLogit/ProbitLinktest; Box-TidwellTry alternative link; add polynomial terms
No complete separationAll binaryCheck large SEs; predicted probs = 0/1Firth penalised MLE; Bayesian logit
IIAMNL, CLHausman-McFadden; Small-HsiaoNested Logit; Mixed Logit
Proportional oddsOrdered LogitBrant test; parallel lines graphGeneralised Ordered Logit; MNL
No heteroscedasticityProbitLinktest; heteroscedastic probitHeteroscedastic probit; robust SEs
No perfect multicollinearityAllVIF; condition numberDrop/combine variables; regularise
RUM consistencyNested Logitλ^m(0,1]\hat{\lambda}_m \in (0,1]Respecify nesting; Mixed Logit
No endogeneityAllHausman test vs. IV estimatorControl function; IV logit/probit

Marginal Effects: Type and Context

ContextMeasureFormula
Average effect (standard)APE / AMEn1iPi/Xkn^{-1}\sum_i \partial P_i/\partial X_k
Effect at average personPEMP/Xkx=xˉ\partial P / \partial X_k \mid_{\mathbf{x}=\bar{\mathbf{x}}}
Effect for specific profileMarginal effect at representative valueP/Xkx=x0\partial P / \partial X_k \mid_{\mathbf{x}=\mathbf{x}_0}
Binary covariateDiscrete changeP(Y=1Xk=1)P(Y=1Xk=0)P(Y=1\mid X_k=1) - P(Y=1\mid X_k=0)
Log-odds scaleRaw coefficientβ^k\hat{\beta}_k
Multiplicative oddsOdds ratioeβ^ke^{\hat{\beta}_k}
Money metricWillingness to Payγ^k/γ^cost-\hat{\gamma}_k/\hat{\gamma}_{cost}

Standard Error Selection

SettingRecommended SERationale
IID observations, correct spec.MLE information matrix SEsEfficient; standard
Potential misspecificationSandwich (robust) SEsRobust to distributional misspecification
Clustered data (firms, regions)Cluster-robust SEsWithin-cluster correlation
Small samplesBootstrap SEsMore reliable finite-sample inference
Marginal effectsDelta method SEsPropagates uncertainty from β^\hat{\boldsymbol{\beta}}
WTPKrinsky-Robb simulationBetter for ratios of estimates

Fit Statistics at a Glance

StatisticFormulaBest for
McFadden ρ2\rho^21/01 - \ell/\ell_0Overall model fit
AIC2+2k-2\ell + 2kModel comparison (prediction)
BIC2+klnn-2\ell + k\ln nModel comparison (parsimony)
AUC-ROCArea under ROC curveBinary discrimination
Brier Scoren1(p^iYi)2n^{-1}\sum(\hat{p}_i-Y_i)^2Probability calibration
Count R2R^2Pct. correctly classifiedNaive classification
Hosmer-LemeshowχG22\chi^2_{G-2}Calibration across prediction deciles

Panel Discrete Choice: Key Properties

EstimatorConsistent (nn\to\infty, fixed TT)?Time-Invariant Variables?Dynamic State Dependence?Key Reference
LSDV Logit (incidental params.)Limited
Conditional FE LogitNo (by default)Chamberlain (1980)
RE Probit (standard)✅ (if αixit\alpha_i \perp \mathbf{x}_{it})No
Correlated RE Probit (Mundlak)✅ (approximately)NoMundlak (1978)
Dynamic Logit (Wooldridge)LimitedWooldridge (2005)
Mixed Logit (panel)Via serial correlationMcFadden & Train (2000)

This tutorial provides a comprehensive foundation for understanding, applying, and interpreting Discrete Choice Models using the DataStatPro application. For further reading, consult McFadden's "Conditional Logit Analysis of Qualitative Choice Behavior" (1974), Train's "Discrete Choice Methods with Simulation" (Cambridge University Press, 2009), Greene's "Econometric Analysis" (8th ed., 2018), Long's "Regression Models for Categorical and Limited Dependent Variables" (Sage, 1997), or Wooldridge's "Econometric Analysis of Cross Section and Panel Data" (MIT Press, 2010). For feature requests or support, contact the DataStatPro team.