Discrete Choice Models: Zero to Hero Tutorial

This comprehensive tutorial takes you from the foundational concepts of Discrete Choice modelling all the way through advanced extensions, assumption testing, heterogeneity analysis, and practical usage within the DataStatPro application. Whether you are a complete beginner or an experienced analyst, this guide is structured to build your understanding step by step.

Prerequisites and Background Concepts
What are Discrete Choice Models?
The Mathematical Framework
Key Assumptions
Identification and Causal Inference
Binary Choice Models: Logit and Probit
Hypothesis Testing and Inference
Effect Size Measures
Model Fit and Evaluation
Diagnostics and Assumption Testing
Extensions: Multinomial and Conditional Logit
Extensions: Ordered Choice Models
Extensions: Nested Logit and Mixed Logit
Extensions: Panel Data Discrete Choice
Using the Discrete Choice Component
Computational and Formula Details
Worked Examples
Common Mistakes and How to Avoid Them
Troubleshooting
Quick Reference Cheat Sheet

1. Prerequisites and Background Concepts

Before diving into Discrete Choice Models, it is helpful to be familiar with the following foundational concepts. Do not worry if you are not — each concept is briefly explained here.

1.1 Random Variables and Probability Distributions

A random variable $Y$ is a variable whose value is determined by a random process. In discrete choice modelling, the outcome variable $Y$ takes on a finite set of values representing alternative choices (e.g., $Y \in \{0, 1\}$ for binary outcomes, or $Y \in \{1, 2, \dots, J\}$ for multiple alternatives).

Key distributions used in discrete choice models:

Bernoulli distribution: $Y \sim \text{Bernoulli}(p)$ , where $P(Y = 1) = p$ and $P(Y = 0) = 1 - p$ . Used for binary outcomes.
Logistic (Gumbel) distribution: The basis of the logit model. The standard logistic CDF is $F(x) = \Lambda(x) = \frac{e^x}{1 + e^x} = \frac{1}{1 + e^{-x}}$ .
Standard Normal distribution: The basis of the probit model. The CDF is $F(x) = \Phi(x) = \int_{-\infty}^x \frac{1}{\sqrt{2\pi}}e^{-t^2/2}dt$ .
Type I Extreme Value (Gumbel) distribution: The basis of the multinomial logit model. Its CDF is $F(x) = e^{-e^{-x}}$ .

1.2 Likelihood and Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE) is the primary estimation method for discrete choice models. For a sample of $n$ independent observations, the likelihood function is:

$\mathcal{L}(\boldsymbol{\theta}; \mathbf{y}, \mathbf{X}) = \prod_{i=1}^n f(y_i \mid \mathbf{x}_i; \boldsymbol{\theta})$

The log-likelihood (more convenient for optimisation) is:

$\ell(\boldsymbol{\theta}) = \ln \mathcal{L}(\boldsymbol{\theta}) = \sum_{i=1}^n \ln f(y_i \mid \mathbf{x}_i; \boldsymbol{\theta})$

The MLE $\hat{\boldsymbol{\theta}}$ maximises $\ell(\boldsymbol{\theta})$ :

$\hat{\boldsymbol{\theta}}_{MLE} = \arg\max_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta})$

MLE has attractive properties under regularity conditions: it is consistent, asymptotically normal, and asymptotically efficient.

1.3 Latent Variable Models

Many discrete choice models are derived from an underlying latent variable (unobserved continuous variable). Define a latent utility or propensity:

$Y_i^* = \mathbf{x}_i^T\boldsymbol{\beta} + \epsilon_i$

The observed discrete outcome is a deterministic function of the latent variable:

$Y_i = \begin{cases} 1 & \text{if } Y_i^* > 0 \\ 0 & \text{if } Y_i^* \leq 0 \end{cases}$

The distribution assumed for $\epsilon_i$ determines the model family:

$\epsilon_i \sim \text{Logistic}(0,1)$ → Logit model
$\epsilon_i \sim \mathcal{N}(0, 1)$ → Probit model

1.4 Random Utility Maximisation (RUM)

The Random Utility Model (McFadden, 1974) provides the economic foundation for discrete choice. Each decision-maker $i$ assigns a utility $U_{ij}$ to each alternative $j$ :

$U_{ij} = V_{ij} + \epsilon_{ij}$

Where:

$V_{ij} = \mathbf{x}_{ij}^T\boldsymbol{\beta}$ is the systematic (deterministic) utility component.
$\epsilon_{ij}$ is the random utility component — unobserved factors affecting utility.

The decision-maker chooses the alternative that maximises utility:

$Y_i = j \iff U_{ij} > U_{ik} \quad \forall k \neq j$

Different distributional assumptions for $\epsilon_{ij}$ yield different discrete choice model families.

1.5 Ordinary Least Squares and Its Limitations for Discrete Outcomes

OLS regression applied to a binary outcome ( $Y \in \{0,1\}$ ) produces the Linear Probability Model (LPM):

$E[Y_i \mid \mathbf{x}_i] = \mathbf{x}_i^T\boldsymbol{\beta}$

The LPM has well-known limitations:

Predicted probabilities outside $[0,1]$ : OLS can predict probabilities less than 0 or greater than 1.
Heteroscedastic errors: $\text{Var}(\epsilon_i) = p_i(1-p_i)$ varies with $\mathbf{x}_i$ , violating the OLS homoscedasticity assumption.
Non-constant marginal effects: The true relationship between covariates and $P(Y=1)$ is typically non-linear (sigmoid-shaped), not linear.

Discrete choice models address all three limitations by modelling probabilities through a monotone transformation that maps the real line to $(0,1)$ .

1.6 Multinomial Outcomes and Ordinal Data

Multinomial outcomes have more than two unordered categories: $Y \in \{1, 2, \dots, J\}$ where no natural ordering exists (e.g., mode of transport: car, bus, train, bicycle).

Ordinal outcomes have more than two categories with a natural ordering: $Y \in \{1, 2, \dots, J\}$ where $1 < 2 < \dots < J$ (e.g., satisfaction: low, medium, high).

Different model families are designed for each type of outcome.

2. What are Discrete Choice Models?

2.1 The Core Idea

Discrete Choice Models (DCMs) are statistical models designed to explain and predict the choices made by individuals (or firms, households, or other decision-making units) when they face a finite set of mutually exclusive alternatives.

The core modelling challenge: the outcome is not a continuous variable but a discrete category. Standard regression is misspecified for such outcomes. DCMs model the probability of choosing each alternative as a function of:

Decision-maker characteristics: Age, income, gender, education of the chooser.
Alternative-specific attributes: Price, quality, travel time, distance of each option.
Contextual factors: Market conditions, constraints, availability.

The general DCM probability structure:

$P(Y_i = j \mid \mathbf{x}_i) = F_j(\mathbf{x}_i, \boldsymbol{\theta})$

Where $F_j(\cdot)$ is a function mapping covariates and parameters to probabilities, with:

$\sum_{j=1}^J P(Y_i = j \mid \mathbf{x}_i) = 1 \quad \text{and} \quad P(Y_i = j \mid \mathbf{x}_i) \in (0,1)$

2.2 A Taxonomy of Discrete Choice Models

Model	Outcome Type	Alternatives	Key Assumption
Binary Logit	Binary ( $J=2$ )	2 unordered	Logistic errors
Binary Probit	Binary ( $J=2$ )	2 unordered	Normal errors
Multinomial Logit (MNL)	Nominal ( $J>2$ )	$J$ unordered	IID Gumbel errors (IIA)
Conditional Logit	Nominal ( $J>2$ )	$J$ with attributes	IID Gumbel errors (IIA)
Nested Logit	Nominal ( $J>2$ )	Hierarchical structure	Correlated within nests
Mixed Logit	Nominal ( $J>2$ )	$J$ with random coefficients	Flexible error structure
Ordered Logit (Proportional Odds)	Ordinal ( $J>2$ )	Ordered categories	Proportional odds
Ordered Probit	Ordinal ( $J>2$ )	Ordered categories	Normal latent variable
Multinomial Probit	Nominal ( $J>2$ )	$J$ unordered	Multivariate normal errors

2.3 Real-World Applications

Discrete choice models are applied across virtually every field involving individual decision-making:

Transportation: Modal choice (car vs. bus vs. train vs. bicycle), route choice, departure time choice.
Health Economics: Insurance plan choice, treatment adoption, healthcare provider selection, smoking/drinking behaviour.
Marketing: Brand choice, product adoption, willingness to pay for product attributes.
Labour Economics: Occupational choice, labour force participation, migration decisions, unionisation.
Environmental Economics: Willingness to pay for environmental goods, habitat choice, recreational site choice.
Political Science: Voting behaviour, party affiliation, referendum outcomes.
Housing: Residential location choice, tenure choice (rent vs. own), housing type choice.
Finance: Portfolio allocation categories, credit/default behaviour, investment vehicle choice.
Education: School choice, field of study selection, dropout decisions.

2.4 Discrete Choice Models vs. Other Regression Methods

Method	Outcome	Key Use Case	Key Limitation
OLS / LPM	Continuous (or binary)	Simple benchmark; DiD with binary $Y$	Predicted probs outside $[0,1]$
Logit / Probit	Binary	Binary classification; probability estimation	Marginal effects non-constant
Multinomial Logit	Nominal ( $J>2$ )	Unordered multi-category choices	IIA assumption restrictive
Nested Logit	Nominal ( $J>2$ , grouped)	Hierarchical choice structures	Tree structure pre-specified
Mixed Logit	Nominal (any $J$ )	Preference heterogeneity, flexible IIA	Computationally intensive
Ordered Logit/Probit	Ordinal	Ranked categories, Likert scales	Proportional odds assumption
Count Models (Poisson/NB)	Count data	Number of events	Not a DCM; counts not choices
Survival/Duration	Time to event	Time until discrete event	Different modelling paradigm

3. The Mathematical Framework

3.1 The Binary Logit Model

The logit model specifies the probability of the outcome $Y_i = 1$ as:

$P(Y_i = 1 \mid \mathbf{x}_i) = \Lambda(\mathbf{x}_i^T\boldsymbol{\beta}) = \frac{e^{\mathbf{x}_i^T\boldsymbol{\beta}}}{1 + e^{\mathbf{x}_i^T\boldsymbol{\beta}}} = \frac{1}{1 + e^{-\mathbf{x}_i^T\boldsymbol{\beta}}}$

And:

$P(Y_i = 0 \mid \mathbf{x}_i) = 1 - \Lambda(\mathbf{x}_i^T\boldsymbol{\beta}) = \frac{1}{1 + e^{\mathbf{x}_i^T\boldsymbol{\beta}}}$

The log-odds (logit) transformation linearises the model:

$\ln\left(\frac{P(Y_i=1)}{P(Y_i=0)}\right) = \ln\left(\frac{P_i}{1-P_i}\right) = \mathbf{x}_i^T\boldsymbol{\beta}$

This is the log-odds ratio (or logit), and $\boldsymbol{\beta}$ are the log-odds coefficients.

3.2 The Binary Probit Model

The probit model specifies:

$P(Y_i = 1 \mid \mathbf{x}_i) = \Phi(\mathbf{x}_i^T\boldsymbol{\beta})$

Where $\Phi(\cdot)$ is the standard normal CDF. From the latent variable representation:

$Y_i^* = \mathbf{x}_i^T\boldsymbol{\beta} + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0,1)$

$P(Y_i = 1) = P(Y_i^* > 0) = P(\epsilon_i > -\mathbf{x}_i^T\boldsymbol{\beta}) = 1 - \Phi(-\mathbf{x}_i^T\boldsymbol{\beta}) = \Phi(\mathbf{x}_i^T\boldsymbol{\beta})$

3.3 The Logit vs. Probit Comparison

The logistic and standard normal CDFs are very similar in shape. Key differences:

Property	Logit	Probit
Link function	$\ln[p/(1-p)] = \mathbf{x}^T\boldsymbol{\beta}$	$\Phi^{-1}(p) = \mathbf{x}^T\boldsymbol{\beta}$
Error distribution	Logistic (heavier tails)	Standard Normal
Scale normalisation	$\text{Var}(\epsilon) = \pi^2/3$	$\text{Var}(\epsilon) = 1$
Coefficient scaling	$\hat{\beta}_{logit} \approx 1.6 \times \hat{\beta}_{probit}$	—
Closed-form probabilities	✅	❌ (requires numerical integration)
Interpretability	Log-odds directly interpretable	Requires transformation
Tail behaviour	Heavier tails	Thinner tails

The rule of thumb for converting coefficients: $\hat{\beta}_{logit} \approx \frac{\pi}{\sqrt{3}} \hat{\beta}_{probit} \approx 1.81 \hat{\beta}_{probit}$ .

3.4 The Multinomial Logit Model

For $J > 2$ unordered alternatives, the Multinomial Logit (MNL) specifies, for alternative $j \in \{1, \dots, J\}$ with reference category $j = 1$ :

$P(Y_i = j \mid \mathbf{x}_i) = \frac{e^{\mathbf{x}_i^T\boldsymbol{\beta}_j}}{\sum_{k=1}^J e^{\mathbf{x}_i^T\boldsymbol{\beta}_k}}$

With the normalisation $\boldsymbol{\beta}_1 = \mathbf{0}$ (reference category), so:

$P(Y_i = j \mid \mathbf{x}_i) = \frac{e^{\mathbf{x}_i^T\boldsymbol{\beta}_j}}{1 + \sum_{k=2}^J e^{\mathbf{x}_i^T\boldsymbol{\beta}_k}}, \quad j = 2, \dots, J$

$P(Y_i = 1 \mid \mathbf{x}_i) = \frac{1}{1 + \sum_{k=2}^J e^{\mathbf{x}_i^T\boldsymbol{\beta}_k}}$

The log-odds ratio relative to the reference category:

$\ln\left(\frac{P(Y_i = j)}{P(Y_i = 1)}\right) = \mathbf{x}_i^T\boldsymbol{\beta}_j$

3.5 The Conditional Logit Model

The Conditional Logit (CL) model (McFadden, 1974) allows attributes to vary across alternatives. The utility of alternative $j$ for individual $i$ :

$U_{ij} = \mathbf{z}_{ij}^T\boldsymbol{\gamma} + \mathbf{x}_i^T\boldsymbol{\beta}_j + \epsilon_{ij}$

Where $\mathbf{z}_{ij}$ are alternative-specific attributes (e.g., price of alternative $j$ , travel time of option $j$ ) with a common coefficient $\boldsymbol{\gamma}$ , and $\mathbf{x}_i$ are individual-specific characteristics with alternative-specific coefficients $\boldsymbol{\beta}_j$ .

The choice probability:

$P(Y_i = j \mid \mathbf{Z}_i) = \frac{e^{\mathbf{z}_{ij}^T\boldsymbol{\gamma} + \mathbf{x}_i^T\boldsymbol{\beta}_j}}{\sum_{k=1}^J e^{\mathbf{z}_{ik}^T\boldsymbol{\gamma} + \mathbf{x}_i^T\boldsymbol{\beta}_k}}$

3.6 The Ordered Logit Model

For an ordinal outcome $Y_i \in \{1, 2, \dots, J\}$ , the Ordered Logit (Proportional Odds) model uses a single latent variable:

$Y_i^* = \mathbf{x}_i^T\boldsymbol{\beta} + \epsilon_i, \quad \epsilon_i \sim \text{Logistic}(0,1)$

With $J-1$ threshold (cut-point) parameters $\tau_1 < \tau_2 < \dots < \tau_{J-1}$ :

$Y_i = j \iff \tau_{j-1} < Y_i^* \leq \tau_j$

Where $\tau_0 = -\infty$ and $\tau_J = +\infty$ . The choice probabilities are:

$P(Y_i = j \mid \mathbf{x}_i) = \Lambda(\tau_j - \mathbf{x}_i^T\boldsymbol{\beta}) - \Lambda(\tau_{j-1} - \mathbf{x}_i^T\boldsymbol{\beta})$

$P(Y_i \leq j \mid \mathbf{x}_i) = \Lambda(\tau_j - \mathbf{x}_i^T\boldsymbol{\beta})$

3.7 The Nested Logit Model

The Nested Logit partitions the $J$ alternatives into $M$ mutually exclusive nests $B_1, B_2, \dots, B_M$ . The choice probability for alternative $j$ in nest $m$ :

$P(Y_i = j) = P(\text{nest } m) \times P(j \mid \text{nest } m)$

$P(j \mid \text{nest } m) = \frac{e^{V_{ij}/\lambda_m}}{\sum_{k \in B_m} e^{V_{ik}/\lambda_m}}$

$P(\text{nest } m) = \frac{e^{W_{im} + \lambda_m I_{im}}}{\sum_{l=1}^M e^{W_{il} + \lambda_l I_{il}}}$

Where $I_{im} = \ln\sum_{k \in B_m} e^{V_{ik}/\lambda_m}$ is the inclusive value (log-sum), $\lambda_m \in (0,1]$ is the dissimilarity parameter for nest $m$ , and $W_{im}$ contains nest-level attributes.

3.8 The Mixed Logit Model

The Mixed Logit (also called the Random Parameters Logit) allows coefficients to vary across individuals:

$P(Y_i = j \mid \mathbf{x}_i, \boldsymbol{\beta}_i) = \frac{e^{\mathbf{x}_{ij}^T\boldsymbol{\beta}_i}}{\sum_{k=1}^J e^{\mathbf{x}_{ik}^T\boldsymbol{\beta}_i}}$

Where $\boldsymbol{\beta}_i \sim f(\boldsymbol{\beta} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma})$ — individual-specific random coefficients drawn from a mixing distribution (typically normal or log-normal).

The unconditional choice probability integrates over the random coefficient distribution:

$P(Y_i = j \mid \mathbf{x}_i) = \int \frac{e^{\mathbf{x}_{ij}^T\boldsymbol{\beta}}}{\sum_{k=1}^J e^{\mathbf{x}_{ik}^T\boldsymbol{\beta}}} f(\boldsymbol{\beta} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) \, d\boldsymbol{\beta}$

This integral has no closed form and is evaluated by simulation (see Section 16.7).

4. Key Assumptions

4.1 Independence of Irrelevant Alternatives (IIA)

The most important and controversial assumption in multinomial logit models is the Independence of Irrelevant Alternatives (IIA):

The ratio of probabilities for any two alternatives $j$ and $k$ is independent of all other alternatives in the choice set.

Formally, for the MNL model:

$\frac{P(Y_i = j)}{P(Y_i = k)} = \frac{e^{\mathbf{x}_i^T\boldsymbol{\beta}_j}}{e^{\mathbf{x}_i^T\boldsymbol{\beta}_k}} = e^{\mathbf{x}_i^T(\boldsymbol{\beta}_j - \boldsymbol{\beta}_k)}$

This ratio depends only on $j$ and $k$ , not on any other alternative $l$ .

The Red Bus / Blue Bus Problem: A classic IIA failure. Suppose individuals choose between Car and Red Bus with equal probability (50/50). If a Blue Bus (identical to Red Bus except colour) is added, IIA predicts all three have 1/3 probability each. But intuitively, the split should be 50% car and 50% bus (25% red + 25% blue). IIA allocates "competition" uniformly across all alternatives rather than within similar alternatives.

IIA is implied by: The Type I Extreme Value (Gumbel) distributional assumption on $\epsilon_{ij}$ and the independence across alternatives.

IIA fails when: Alternatives are correlated substitutes — i.e., some alternatives are more similar to each other than to others. In such cases, the error terms $\epsilon_{ij}$ are correlated across alternatives.

4.2 The Proportional Odds Assumption

For the Ordered Logit model, the proportional odds assumption (also called parallel lines assumption) requires that the effect of covariates on the log-odds is constant across all thresholds:

$\ln\left(\frac{P(Y_i > j)}{P(Y_i \leq j)}\right) = \mathbf{x}_i^T\boldsymbol{\beta} - \tau_j \quad \forall j$

The coefficient vector $\boldsymbol{\beta}$ is the same for all outcome categories — only the intercept $\tau_j$ changes. This is a strong assumption that should be explicitly tested (see Section 10.3).

4.3 Random Utility Consistency

For the RUM foundation to be valid:

Completeness: Decision-makers can rank all alternatives.
Transitivity: Preferences are transitive (if $A \succ B$ and $B \succ C$ , then $A \succ C$ ).
Utility maximisation: Decision-makers always choose the alternative with the highest utility.
Stable preferences: Preferences do not change during the observation period.

4.4 Independence of Observations

Standard discrete choice models assume independent observations across individuals. In panel data (Section 14), this assumption is relaxed by allowing within-individual correlation across repeated choices.

4.5 Correct Specification of the Choice Set

The model assumes:

All relevant alternatives are included in the choice set.
The choice set is the same for all individuals (or, in some extensions, individual-specific choice sets are correctly specified).
No irrelevant alternatives contaminate the model.

5. Identification and Causal Inference

5.1 What Discrete Choice Models Identify

Identification in discrete choice models means the ability to recover the structural parameters $\boldsymbol{\beta}$ from the data. Key identification conditions:

Scale normalisation: The scale of the latent utility is unidentified. In the binary probit, $\text{Var}(\epsilon) = 1$ is imposed; in logit, $\text{Var}(\epsilon) = \pi^2/3$ is imposed.
Location normalisation: Only utility differences are identified, not absolute utility levels. In the MNL, one alternative's coefficient vector is normalised to zero.
No perfect multicollinearity: Covariates must not be perfectly linearly dependent.
Exclusion of constants: In models with alternative-specific attributes only, a constant cannot be separately identified from the normalisation.

5.2 Endogeneity in Discrete Choice Models

Endogeneity arises when a regressor $X_{ij}$ is correlated with the unobserved utility component $\epsilon_{ij}$ . Common sources:

Omitted variables: Unobserved factors affecting both the covariate and the choice.
Simultaneity: The choice affects the covariate (e.g., price determined by anticipated demand).
Measurement error: Classical measurement error attenuates estimates toward zero.

Consequences: MLE estimates are inconsistent under endogeneity — standard corrections are required.

Remedies:

Control Function Approach: Add the residuals from an auxiliary regression of the endogenous variable on instruments as an additional regressor in the discrete choice model.
IV Probit / IV Logit: Two-stage estimation using valid instruments.
Berry-Levinsohn-Pakes (BLP): For market-level discrete choice with endogenous prices, uses product-level instrumental variables.

5.3 Average Partial Effects vs. Structural Parameters

In discrete choice models, the raw coefficients $\boldsymbol{\beta}$ are not directly interpretable as marginal effects. The Average Partial Effect (APE) of continuous covariate $X_k$ on $P(Y_i = j)$ is:

$APE_k = \frac{1}{n}\sum_{i=1}^n \frac{\partial P(Y_i = j \mid \mathbf{x}_i)}{\partial X_{ik}}$

For binary logit:

$APE_k^{logit} = \frac{1}{n}\sum_{i=1}^n \hat{p}_i(1-\hat{p}_i)\hat{\beta}_k$

For binary probit:

$APE_k^{probit} = \frac{1}{n}\sum_{i=1}^n \phi(\mathbf{x}_i^T\hat{\boldsymbol{\beta}})\hat{\beta}_k$

The APE (also called the Average Marginal Effect, AME) is the primary reported quantity of interest in discrete choice models — analogous to the regression coefficient in OLS.

5.4 Partial Effects at the Mean (PEM) and Partial Effects at Representative Values

Partial Effect at the Mean (PEM): Evaluate the marginal effect at the sample mean $\bar{\mathbf{x}}$ :

$PEM_k = \frac{\partial P(Y_i = j \mid \mathbf{x}_i = \bar{\mathbf{x}})}{\partial X_k}$

For binary logit: $PEM_k = \Lambda(\bar{\mathbf{x}}^T\hat{\boldsymbol{\beta}})[1 - \Lambda(\bar{\mathbf{x}}^T\hat{\boldsymbol{\beta}})]\hat{\beta}_k$

⚠️ The PEM evaluates the marginal effect at a potentially non-existent "average individual." The APE is generally preferred because it averages the marginal effect across actual observations, accounting for the non-linearity of the model.

5.5 Willingness to Pay (WTP) in Choice Models

In models with a cost or price attribute (e.g., transport cost, product price), the Willingness to Pay (WTP) for a change in attribute $k$ is:

$WTP_k = -\frac{\hat{\gamma}_k}{\hat{\gamma}_{cost}}$

Where $\hat{\gamma}_k$ is the coefficient on attribute $k$ and $\hat{\gamma}_{cost}$ is the coefficient on cost. This ratio gives the marginal rate of substitution between attribute $k$ and money — a central output of stated preference and transport choice studies.

6. Binary Choice Models: Logit and Probit

6.1 The Log-Likelihood for Binary Models

For a binary outcome $Y_i \in \{0, 1\}$ , the log-likelihood is:

$\ell(\boldsymbol{\beta}) = \sum_{i=1}^n \left[Y_i \ln P_i + (1 - Y_i) \ln(1 - P_i)\right]$

Where $P_i = P(Y_i = 1 \mid \mathbf{x}_i)$ .

For logit: $P_i = \Lambda(\mathbf{x}_i^T\boldsymbol{\beta})$ , giving:

$\ell_{logit}(\boldsymbol{\beta}) = \sum_{i=1}^n \left[Y_i \mathbf{x}_i^T\boldsymbol{\beta} - \ln(1 + e^{\mathbf{x}_i^T\boldsymbol{\beta}})\right]$

For probit: $P_i = \Phi(\mathbf{x}_i^T\boldsymbol{\beta})$ , giving:

$\ell_{probit}(\boldsymbol{\beta}) = \sum_{i=1}^n \left[Y_i \ln\Phi(\mathbf{x}_i^T\boldsymbol{\beta}) + (1-Y_i)\ln\Phi(-\mathbf{x}_i^T\boldsymbol{\beta})\right]$

Both log-likelihoods are globally concave, ensuring a unique maximum.

6.2 The Score and Hessian

The score vector (gradient of the log-likelihood):

$\mathbf{s}(\boldsymbol{\beta}) = \frac{\partial \ell}{\partial \boldsymbol{\beta}} = \sum_{i=1}^n \left(Y_i - P_i\right)\mathbf{x}_i$

For logit, this simplifies elegantly because $\partial P_i / \partial \eta_i = P_i(1-P_i)$ where $\eta_i = \mathbf{x}_i^T\boldsymbol{\beta}$ .

The Hessian matrix (second derivative):

$\mathbf{H}(\boldsymbol{\beta}) = \frac{\partial^2 \ell}{\partial \boldsymbol{\beta} \partial \boldsymbol{\beta}^T} = -\sum_{i=1}^n w_i \mathbf{x}_i\mathbf{x}_i^T$

Where $w_i = P_i(1-P_i)$ for logit and $w_i = \phi(\mathbf{x}_i^T\boldsymbol{\beta})^2 / [P_i(1-P_i)]$ for probit. The negative Hessian is positive definite, confirming concavity.

6.3 Newton-Raphson and IRLS Estimation

The MLE is obtained iteratively using Newton-Raphson (or equivalently, Iteratively Reweighted Least Squares — IRLS):

$\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} - \left[\mathbf{H}(\boldsymbol{\beta}^{(t)})\right]^{-1}\mathbf{s}(\boldsymbol{\beta}^{(t)})$

IRLS Interpretation: At each iteration, solve a weighted OLS problem:

$\boldsymbol{\beta}^{(t+1)} = \left(\mathbf{X}^T\mathbf{W}^{(t)}\mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{W}^{(t)}\mathbf{z}^{(t)}$

Where $\mathbf{W}^{(t)} = \text{diag}(w_1^{(t)}, \dots, w_n^{(t)})$ is a diagonal weight matrix and $\mathbf{z}^{(t)} = \boldsymbol{\eta}^{(t)} + (\mathbf{W}^{(t)})^{-1}(\mathbf{y} - \hat{\mathbf{p}}^{(t)})$ is the adjusted dependent variable.

6.4 Asymptotic Properties of MLE

Under regularity conditions, the MLE is:

$\sqrt{n}(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}_0) \xrightarrow{d} \mathcal{N}(\mathbf{0}, \mathcal{I}(\boldsymbol{\beta}_0)^{-1})$

Where the Fisher information matrix is:

$\mathcal{I}(\boldsymbol{\beta}) = -E\left[\frac{\partial^2 \ell_i}{\partial \boldsymbol{\beta}\partial\boldsymbol{\beta}^T}\right] = \sum_{i=1}^n w_i \mathbf{x}_i\mathbf{x}_i^T$

The variance-covariance matrix of $\hat{\boldsymbol{\beta}}$ is estimated by the inverse of the observed information matrix:

$\hat{\mathbf{V}}(\hat{\boldsymbol{\beta}}) = \left[-\mathbf{H}(\hat{\boldsymbol{\beta}})\right]^{-1} = \left(\mathbf{X}^T\hat{\mathbf{W}}\mathbf{X}\right)^{-1}$

6.5 Interpreting Logit Coefficients as Odds Ratios

For the logit model, exponentiating the coefficient gives the odds ratio:

$OR_k = e^{\hat{\beta}_k}$

Interpretation: A one-unit increase in $X_k$ multiplies the odds of $Y=1$ by $e^{\hat{\beta}_k}$ :

$\frac{P(Y=1 \mid X_k + 1) / P(Y=0 \mid X_k + 1)}{P(Y=1 \mid X_k) / P(Y=0 \mid X_k)} = e^{\hat{\beta}_k}$

If $OR_k > 1$ : $X_k$ increases the odds of $Y=1$ .
If $OR_k < 1$ : $X_k$ decreases the odds of $Y=1$ .
If $OR_k = 1$ : No effect on odds.

⚠️ Odds ratios are not the same as probability ratios (relative risks). Do not interpret the odds ratio as "X% more likely." Convert to marginal probabilities via the APE for clearer communication.

6.6 Marginal Effects in Binary Models

For a continuous covariate $X_k$ , the marginal effect of $X_k$ on $P(Y=1)$ for individual $i$ :

Logit: $\frac{\partial P_i}{\partial X_{ik}} = \hat{p}_i(1-\hat{p}_i)\hat{\beta}_k = \Lambda(\mathbf{x}_i^T\hat{\boldsymbol{\beta}})[1-\Lambda(\mathbf{x}_i^T\hat{\boldsymbol{\beta}})]\hat{\beta}_k$

Probit: $\frac{\partial P_i}{\partial X_{ik}} = \phi(\mathbf{x}_i^T\hat{\boldsymbol{\beta}})\hat{\beta}_k$

For a discrete/binary covariate $X_k \in \{0,1\}$ , the marginal effect is the discrete change in predicted probability:

$\Delta P_i = P(Y_i=1 \mid X_{ik}=1, \mathbf{x}_{i,-k}) - P(Y_i=1 \mid X_{ik}=0, \mathbf{x}_{i,-k})$

6.7 Standard Errors for Average Partial Effects (Delta Method)

The APE is a nonlinear function of $\hat{\boldsymbol{\beta}}$ . Standard errors are obtained via the delta method:

$\widehat{SE}(APE_k) = \sqrt{\mathbf{g}_k^T \hat{\mathbf{V}}(\hat{\boldsymbol{\beta}}) \mathbf{g}_k}$

Where $\mathbf{g}_k = \partial \widehat{APE}_k / \partial \hat{\boldsymbol{\beta}}$ is the gradient of the APE with respect to the coefficient vector. Alternatively, use the bootstrap for more reliable inference with small samples.

7. Hypothesis Testing and Inference

7.1 The Wald Test

The Wald test for $H_0: \beta_k = 0$ uses the asymptotic normality of the MLE:

$z = \frac{\hat{\beta}_k}{SE(\hat{\beta}_k)} \sim \mathcal{N}(0,1)$

Or equivalently, $z^2 \sim \chi^2_1$ . For a vector of $q$ restrictions $H_0: \mathbf{R}\boldsymbol{\beta} = \mathbf{r}$ :

$W = (\mathbf{R}\hat{\boldsymbol{\beta}} - \mathbf{r})^T\left[\mathbf{R}\hat{\mathbf{V}}(\hat{\boldsymbol{\beta}})\mathbf{R}^T\right]^{-1}(\mathbf{R}\hat{\boldsymbol{\beta}} - \mathbf{r}) \sim \chi^2_q$

7.2 The Likelihood Ratio Test

The Likelihood Ratio (LR) test compares a restricted model (imposing $H_0$ ) to an unrestricted model:

$LR = -2\left[\ell(\hat{\boldsymbol{\beta}}_{restricted}) - \ell(\hat{\boldsymbol{\beta}}_{unrestricted})\right] \sim \chi^2_q$

Where $q$ is the number of restrictions. The LR test is generally preferred over the Wald test because it is invariant to reparameterisation and often has better finite-sample properties.

Special case: The LR test comparing a model with covariates to an intercept-only model:

$LR = -2[\ell_0 - \ell_{full}] \sim \chi^2_k$

Where $\ell_0 = n[p_0\ln p_0 + (1-p_0)\ln(1-p_0)]$ and $p_0 = \bar{Y}$ is the sample proportion.

7.3 The Score (Lagrange Multiplier) Test

The Score test (Rao test) only requires estimating the restricted model:

$S = \mathbf{s}(\hat{\boldsymbol{\beta}}_{restricted})^T \hat{\mathcal{I}}(\hat{\boldsymbol{\beta}}_{restricted})^{-1} \mathbf{s}(\hat{\boldsymbol{\beta}}_{restricted}) \sim \chi^2_q$

Useful when the unrestricted model is computationally expensive to estimate.

7.4 Test Equivalences and Recommendations

Test	Requires	Best For	Invariant to Reparameterisation?
Wald	Unrestricted model only	Single coefficient tests	❌
Likelihood Ratio	Both models	Nested model comparison	✅
Score (LM)	Restricted model only	Adding variables to a model	✅

The three tests are asymptotically equivalent but differ in finite samples. The LR test is generally most reliable.

7.5 Confidence Intervals

A $(1-\alpha)\times100\%$ Wald confidence interval for $\beta_k$ :

$\hat{\beta}_k \pm z_{\alpha/2} \times SE(\hat{\beta}_k)$

A profile likelihood confidence interval (more reliable for small samples):

$CI_{PL} = \left\{\beta_k : -2[\ell(\hat{\boldsymbol{\beta}}_{-k}, \beta_k) - \ell(\hat{\boldsymbol{\beta}})] \leq \chi^2_{\alpha,1}\right\}$

7.6 Testing IIA with the Hausman-McFadden Test

The Hausman-McFadden test for IIA in the MNL compares the full-sample MNL estimates to estimates obtained after removing one alternative from the choice set:

$H_{IIA} = (\hat{\boldsymbol{\beta}}_s - \hat{\boldsymbol{\beta}}_f)^T\left[\hat{\mathbf{V}}_s - \hat{\mathbf{V}}_f\right]^{-1}(\hat{\boldsymbol{\beta}}_s - \hat{\boldsymbol{\beta}}_f) \sim \chi^2_k$

Where $\hat{\boldsymbol{\beta}}_s$ are estimates from the restricted choice set and $\hat{\boldsymbol{\beta}}_f$ are estimates from the full choice set. Rejection suggests IIA violation.

⚠️ The Hausman-McFadden test has poor finite-sample properties and can produce negative test statistics. The Small-Hsiao test offers an alternative. Neither test is definitive. Subject-matter knowledge about alternative similarity remains essential.

7.7 Testing the Proportional Odds Assumption

The Brant test for the proportional odds assumption in ordered logit estimates a separate binary logit for each cumulative split and tests whether the coefficients are equal across splits:

$H_0: \boldsymbol{\beta}^{(1)} = \boldsymbol{\beta}^{(2)} = \dots = \boldsymbol{\beta}^{(J-1)}$

A chi-squared test statistic is formed from the sum of squared differences in estimates across cumulative splits, weighted by their precision. Rejection indicates the proportional odds assumption is violated, and a generalised ordered logit or multinomial logit should be considered.

7.8 Robust Standard Errors in Discrete Choice Models

While MLE standard errors are derived from the information matrix, misspecification-robust (sandwich) standard errors are available:

$\hat{\mathbf{V}}_{sandwich}(\hat{\boldsymbol{\beta}}) = \hat{\mathbf{H}}^{-1}\hat{\mathbf{B}}\hat{\mathbf{H}}^{-1}$

Where $\hat{\mathbf{H}} = -\sum_i \partial^2\ell_i/\partial\boldsymbol{\beta}\partial\boldsymbol{\beta}^T$ and $\hat{\mathbf{B}} = \sum_i (\partial\ell_i/\partial\boldsymbol{\beta})(\partial\ell_i/\partial\boldsymbol{\beta})^T$ (the outer product of scores).

Use heteroscedasticity-robust (sandwich) SEs when the distributional assumption may be misspecified.
Use cluster-robust SEs when observations are grouped (e.g., individuals within firms or households within regions).

8. Effect Size Measures

8.1 Average Partial Effects (APE / AME)

The primary effect size in discrete choice models is the Average Partial Effect (APE), also called the Average Marginal Effect (AME):

$APE_k = \frac{1}{n}\sum_{i=1}^n \frac{\partial P(Y_i = j \mid \mathbf{x}_i)}{\partial X_{ik}}$

Interpretation: The average change in the probability of outcome $j$ associated with a one-unit increase in $X_k$ , averaging over all individuals in the sample.

For a binary covariate $D_i \in \{0,1\}$ :

$APE_k = \frac{1}{n}\sum_{i=1}^n \left[P(Y_i=1 \mid D_i=1, \mathbf{x}_{i,-k}) - P(Y_i=1 \mid D_i=0, \mathbf{x}_{i,-k})\right]$

8.2 Odds Ratios and Relative Risk

Measure	Formula	Interpretation
Odds Ratio (OR)	$e^{\hat{\beta}_k}$	Multiplicative change in odds per unit increase in $X_k$
Relative Risk (RR)	$P(Y=1 \mid X_k+1) / P(Y=1 \mid X_k)$	Ratio of probabilities; computed at representative values
Absolute Risk Reduction	$P(Y=1 \mid X_k=1) - P(Y=1 \mid X_k=0)$	Difference in probabilities for binary $X_k$
Number Needed to Treat	$1 /	ARR

8.3 Predicted Probability Changes

For practical communication, report predicted probabilities at meaningful covariate values:

$\Delta\hat{P} = \hat{P}(Y_i=1 \mid \mathbf{x}_{high}) - \hat{P}(Y_i=1 \mid \mathbf{x}_{low})$

Where $\mathbf{x}_{high}$ and $\mathbf{x}_{low}$ represent two substantively meaningful covariate profiles (e.g., high-income vs. low-income; treated vs. untreated).

8.4 Standardised Coefficients in Discrete Choice Models

To compare the relative importance of different covariates, standardise the APE by the standard deviation of the outcome:

$\beta^*_k = APE_k \times \frac{s_{X_k}}{s_Y}$

Where $s_{X_k}$ is the standard deviation of $X_k$ and $s_Y = \sqrt{\bar{p}(1-\bar{p})}$ for binary outcomes. This produces an effect size interpretable as the change in probability (in units of the outcome SD) per SD change in $X_k$ .

8.5 McFadden's Pseudo- $R^2$ as Effect Size

McFadden's pseudo- $R^2$ measures the proportional improvement in log-likelihood:

$\rho^2 = 1 - \frac{\ell(\hat{\boldsymbol{\beta}})}{\ell_0}$

Where $\ell_0 = \ell(\beta_0 \text{ only})$ is the log-likelihood of the intercept-only model. While not a pure effect size, it provides a scale for comparing model fit improvement:

$\rho^2$	Interpretation
$0.00 - 0.10$	Poor fit
$0.10 - 0.20$	Acceptable fit
$0.20 - 0.40$	Good fit
$0.40+$	Very good fit

8.6 Willingness to Pay (WTP) as Effect Size in Choice Experiments

In stated or revealed preference studies, WTP contextualises effect sizes economically:

$WTP_k = -\frac{APE_k}{APE_{cost}} = -\frac{\hat{\gamma}_k}{\hat{\gamma}_{cost}}$

Report WTP with confidence intervals obtained via the delta method or Krinsky-Robb simulation.

9. Model Fit and Evaluation

9.1 Goodness-of-Fit Statistics

Statistic	Formula	Description
Log-likelihood at convergence	$\ell(\hat{\boldsymbol{\beta}})$	Higher (less negative) is better
Null log-likelihood	$\ell_0$	Baseline (intercept-only)
LR chi-squared	$-2(\ell_0 - \ell)$	Overall model fit test
McFadden's $\rho^2$	$1 - \ell/\ell_0$	Proportional LL improvement
Adjusted McFadden's $\rho^2$	$1 - (\ell - k)/\ell_0$	Penalised for parameters
AIC	$-2\ell + 2k$	Lower is better
BIC	$-2\ell + k\ln(n)$	Lower is better; penalises more
Count $R^2$	Correctly classified / $n$	Naive classification accuracy
Hit rate (vs. base)	Count $R^2$ vs. $\max(\bar{p}, 1-\bar{p})$	Improvement over naive classifier

9.2 Pseudo-R² Measures

Multiple pseudo- $R^2$ measures exist; they capture different aspects of fit:

McFadden (1974): $\rho^2_{McFadden} = 1 - \frac{\ell(\hat{\boldsymbol{\beta}})}{\ell_0}$

Cox-Snell: $R^2_{CS} = 1 - \left(\frac{\mathcal{L}_0}{\mathcal{L}(\hat{\boldsymbol{\beta}})}\right)^{2/n}$

Nagelkerke (normalised Cox-Snell): $R^2_{N} = \frac{R^2_{CS}}{1 - \mathcal{L}_0^{2/n}}$

⚠️ No single pseudo- $R^2$ is universally "correct." Report multiple, and always prefer out-of-sample predictive performance metrics (AUC, Brier score) for evaluating predictive models.

9.3 Classification Metrics for Binary Models

For binary models, at a threshold $c$ (default $c = 0.5$ ):

$\hat{Y}_i = \mathbf{1}[\hat{p}_i \geq c]$

Metric	Formula	Description
Accuracy	$(TP + TN)/(TP+TN+FP+FN)$	Overall correct classification rate
Sensitivity (Recall)	$TP/(TP+FN)$	True positive rate
Specificity	$TN/(TN+FP)$	True negative rate
Precision (PPV)	$TP/(TP+FP)$	Positive predictive value
F1 Score	$2 \cdot (Precision \times Recall)/(Precision + Recall)$	Harmonic mean of precision and recall
AUC-ROC	Area under ROC curve	Discrimination across all thresholds

The Receiver Operating Characteristic (ROC) curve plots sensitivity vs. $(1-\text{specificity})$ across all classification thresholds $c \in [0,1]$ . The Area Under the Curve (AUC) summarises discrimination:

AUC	Interpretation
$0.50$	No discrimination (random)
$0.70 - 0.80$	Acceptable discrimination
$0.80 - 0.90$	Excellent discrimination
$0.90+$	Outstanding discrimination

9.4 Calibration

Calibration assesses whether predicted probabilities match observed outcome rates.

Hosmer-Lemeshow test: Partition observations into $G$ (typically 10) quantile groups by predicted probability. Compare observed and expected counts in each group:

$H = \sum_{g=1}^G \frac{(O_g^1 - E_g^1)^2}{E_g^1} + \frac{(O_g^0 - E_g^0)^2}{E_g^0} \sim \chi^2_{G-2}$

Where $O_g^1$ and $E_g^1$ are observed and expected counts of $Y=1$ in group $g$ . Rejection suggests poor calibration.

Calibration plot: Plot mean predicted probability vs. observed proportion in each decile group. A well-calibrated model lies along the 45° diagonal.

9.5 Information Criteria for Model Comparison

When comparing non-nested models (e.g., logit vs. probit; different covariate sets):

$AIC = -2\ell(\hat{\boldsymbol{\beta}}) + 2k$ $BIC = -2\ell(\hat{\boldsymbol{\beta}}) + k\ln(n)$

Lower values indicate better fit. BIC imposes a heavier penalty on model complexity, favouring parsimony. AIC and BIC are only directly comparable for models fitted to the same dataset with the same outcome variable.

9.6 Out-of-Sample Validation

For predictive models, always assess performance on held-out data:

$k$ -fold cross-validation: Partition data into $k$ folds; train on $k-1$ and test on 1; rotate and average performance metrics.
Train-test split: Randomly assign 70-80% to training and 20-30% to test.
Temporal split: For time-indexed data, train on earlier periods and test on later periods.
Brier score: Mean squared error for probability predictions: $BS = n^{-1}\sum_i (\hat{p}_i - Y_i)^2$ .

10. Diagnostics and Assumption Testing

10.1 Residuals in Discrete Choice Models

Unlike OLS, residuals in discrete choice models require careful definition.

Pearson residuals: $r_i^P = \frac{Y_i - \hat{p}_i}{\sqrt{\hat{p}_i(1-\hat{p}_i)}}$

Deviance residuals: $r_i^D = \text{sign}(Y_i - \hat{p}_i)\sqrt{-2\left[Y_i\ln\hat{p}_i + (1-Y_i)\ln(1-\hat{p}_i)\right]}$

The deviance (sum of squared deviance residuals) is:

$D = \sum_{i=1}^n (r_i^D)^2 = -2\ell(\hat{\boldsymbol{\beta}})$

Standardised residuals for outlier detection: $r_i^{std} = r_i^D / \sqrt{1 - h_{ii}}$ , where $h_{ii}$ is the hat-value (leverage).

10.2 Influence and Leverage

Leverage in logit/probit: $h_{ii} = \hat{w}_i \mathbf{x}_i^T(\mathbf{X}^T\hat{\mathbf{W}}\mathbf{X})^{-1}\mathbf{x}_i$

Cook's distance analogue: $CD_i = \frac{(r_i^P)^2 h_{ii}}{k(1-h_{ii})^2}$

DFFITS and DFBETAS analogues are available for identifying influential observations. Flag observations with $|CD_i| > 4/n$ or $|r_i^{std}| > 2$ for inspection.

10.3 Testing the Proportional Odds Assumption (Ordered Logit)

Brant (1990) test: Estimates a binary logit for each of the $J-1$ cumulative dichotomisations and tests whether coefficients are equal. Available both as a global test (all covariates) and variable-specific tests:

$\chi^2_{Brant} = \sum_{j=1}^{J-2} (\hat{\boldsymbol{\beta}}_j - \hat{\boldsymbol{\beta}}_{J-1})^T \left[\hat{\mathbf{V}}_j + \hat{\mathbf{V}}_{J-1}\right]^{-1} (\hat{\boldsymbol{\beta}}_j - \hat{\boldsymbol{\beta}}_{J-1})$

Graphical check: Plot ordered logit coefficients estimated separately for each binary cumulative split. Coefficients that vary substantially suggest a violation.

Remedy if violated:

Generalised Ordered Logit (partial proportional odds): Allow some (but not all) coefficients to vary across thresholds.
Multinomial Logit: Drop the ordinal structure entirely; less efficient but unrestricted.
Stereotype Logit (reduced-rank MNL): Intermediate model allowing partial ordering.

10.4 Testing IIA

Multiple tests for IIA are available, each with limitations:

Test	Method	Reference	Limitations
Hausman-McFadden	Compare restricted vs. full estimates	Hausman & McFadden (1984)	Can yield negative test statistic
Small-Hsiao	Random sample split + comparison	Small & Hsiao (1985)	Sample-split dependent
Swait-Louviere	Scaling test across datasets	Swait & Louviere (1993)	Requires two datasets

Remedy if IIA fails:

Nested Logit: Group correlated alternatives into nests.
Mixed Logit: Allow correlation across alternatives through random coefficients.
Multinomial Probit: Directly models correlated errors; computationally intensive.

10.5 Checking for Complete Separation

Complete separation occurs when a covariate or linear combination of covariates perfectly predicts the outcome — the MLE does not exist (the log-likelihood has no finite maximum):

Perfect separation: $P(\hat{Y}_i = Y_i) = 1$ for some linear predictor.
Quasi-complete separation: The outcome is perfectly predicted for a subset of observations.

Detection: MLE algorithm fails to converge; extremely large coefficient estimates with very large standard errors; implausible predicted probabilities near 0 or 1.

Remedies:

Firth penalised MLE: Modifies the score equations by a Jeffreys prior penalty — reduces bias in small samples and resolves separation.
Ridge-penalised logit: $\ell_{penalised} = \ell(\boldsymbol{\beta}) - \lambda\|\boldsymbol{\beta}\|^2$ .
Bayesian logit/probit: Place weakly informative priors on coefficients.
Drop/combine categories: Merge sparse categories causing separation.

10.6 Heteroscedasticity in Probit (Heteroscedastic Probit)

In the standard probit, $\text{Var}(\epsilon_i) = 1$ for all $i$ . If the true error variance is heteroscedastic:

$\text{Var}(\epsilon_i) = \sigma_i^2 = [e^{\mathbf{h}_i^T\boldsymbol{\delta}}]^2$

The heteroscedastic probit models:

$P(Y_i = 1 \mid \mathbf{x}_i, \mathbf{h}_i) = \Phi\left(\frac{\mathbf{x}_i^T\boldsymbol{\beta}}{e^{\mathbf{h}_i^T\boldsymbol{\delta}}}\right)$

Standard probit estimates are inconsistent under heteroscedasticity (unlike OLS which remains consistent, only losing efficiency). The linktest (adding the squared predicted index as a covariate) checks for systematic misspecification.

10.7 Goodness-of-Link Tests

The linktest (Pregibon, 1980) adds $\hat{\eta}_i^2 = (\mathbf{x}_i^T\hat{\boldsymbol{\beta}})^2$ as an additional regressor to the fitted model:

$P(Y_i=1) = F(\hat{\eta}_i \cdot \delta_1 + \hat{\eta}_i^2 \cdot \delta_2)$

Under correct specification, $\hat{\delta}_2 = 0$ (the squared term should not be significant). A significant $\hat{\delta}_2$ indicates link function misspecification or omitted non-linear terms.

11. Extensions: Multinomial and Conditional Logit

11.1 MNL Log-Likelihood

For outcome $Y_i \in \{1, \dots, J\}$ with reference category $j=1$ , the MNL log-likelihood:

$\ell(\{\boldsymbol{\beta}_j\}_{j=2}^J) = \sum_{i=1}^n \sum_{j=1}^J \mathbf{1}[Y_i=j] \ln P(Y_i=j \mid \mathbf{x}_i)$

$= \sum_{i=1}^n \left[\sum_{j=2}^J \mathbf{1}[Y_i=j]\mathbf{x}_i^T\boldsymbol{\beta}_j - \ln\left(1 + \sum_{k=2}^J e^{\mathbf{x}_i^T\boldsymbol{\beta}_k}\right)\right]$

11.2 Marginal Effects in MNL

For the MNL, the marginal effect of $X_k$ on $P(Y_i = j)$ :

$\frac{\partial P(Y_i = j)}{\partial X_k} = P(Y_i=j)\left[\beta_{jk} - \sum_{l=1}^J P(Y_i=l)\beta_{lk}\right]$

Note: Cross-effects — the effect of a covariate on a different category's probability — may be positive or negative, depending on model parameters.

Average Partial Effect: $APE_{jk} = \frac{1}{n}\sum_{i=1}^n \hat{P}_{ij}\left[\hat{\beta}_{jk} - \sum_{l=1}^J \hat{P}_{il}\hat{\beta}_{lk}\right]$

11.3 The Conditional Logit and Mixed-Effects Specification

The full Mixed Logit specification that includes both individual-varying and alternative-varying attributes:

$V_{ij} = \underbrace{\mathbf{z}_{ij}^T\boldsymbol{\gamma}}_{\text{alternative attributes}} + \underbrace{\mathbf{x}_i^T\boldsymbol{\beta}_j}_{\text{individual chars. × alt. FE}}$

Where:

$\mathbf{z}_{ij}$ : Attributes that vary across both individuals and alternatives (e.g., travel time from $i$ 's origin to alternative $j$ ).
$\mathbf{x}_i$ : Characteristics of the individual (e.g., income), interacted with alternative-specific dummy variables to allow the effect to vary across alternatives.

11.4 Marginal Effects on Log-Odds (MNL)

The log-odds of choosing $j$ vs. reference category $1$ :

$\ln\frac{P(Y_i = j)}{P(Y_i = 1)} = \mathbf{x}_i^T\boldsymbol{\beta}_j$

$\frac{\partial}{\partial X_k}\ln\frac{P(Y_i=j)}{P(Y_i=1)} = \beta_{jk}$

This is the most directly interpretable quantity from the MNL regression output: $\beta_{jk}$ is the effect of $X_k$ on the log-odds of $j$ vs. reference.

11.5 Substitution Patterns and the IIA Implication

Under IIA, the own-price elasticity and cross-price elasticity have rigid implications:

Own elasticity: $\varepsilon_{jj}^k = \frac{\partial P_j}{\partial z_{jk}}\frac{z_{jk}}{P_j} = \gamma_k z_{jk}(1 - P_j)$

Cross elasticity (between alternatives $j$ and $l$ ): $\varepsilon_{jl}^k = \frac{\partial P_j}{\partial z_{lk}}\frac{z_{lk}}{P_j} = -\gamma_k z_{lk} P_l$

Under IIA, the cross elasticity is the same for all $j \neq l$ — a strong and often unrealistic restriction. The cross elasticity depends only on the attribute level and share of the alternative being changed, not on the similarity between alternatives $j$ and $l$ .

12. Extensions: Ordered Choice Models

12.1 The Ordered Logit (Proportional Odds Model)

Recall from Section 3.6 the latent variable:

$Y_i^* = \mathbf{x}_i^T\boldsymbol{\beta} + \epsilon_i, \quad \epsilon_i \sim \text{Logistic}$

The ordered logit log-likelihood:

$\ell(\boldsymbol{\beta}, \boldsymbol{\tau}) = \sum_{i=1}^n \sum_{j=1}^J \mathbf{1}[Y_i=j] \ln\left[\Lambda(\tau_j - \mathbf{x}_i^T\boldsymbol{\beta}) - \Lambda(\tau_{j-1} - \mathbf{x}_i^T\boldsymbol{\beta})\right]$

Subject to $\tau_0 = -\infty$ , $\tau_J = +\infty$ , and $\tau_1 < \tau_2 < \dots < \tau_{J-1}$ .

The coefficient vector $\boldsymbol{\beta}$ and thresholds $\boldsymbol{\tau}$ are estimated jointly.

12.2 Marginal Effects in Ordered Models

For a continuous covariate $X_k$ , the marginal effect on $P(Y_i = j)$ :

$\frac{\partial P(Y_i=j)}{\partial X_k} = -\hat{\beta}_k\left[\lambda(\hat{\tau}_j - \mathbf{x}_i^T\hat{\boldsymbol{\beta}}) - \lambda(\hat{\tau}_{j-1} - \mathbf{x}_i^T\hat{\boldsymbol{\beta}})\right]$

Where $\lambda(\cdot) = \Lambda(\cdot)[1-\Lambda(\cdot)]$ is the logistic PDF.

Key observation: For the highest category ( $j=J$ ) and lowest category ( $j=1$ ), the signs are: $\frac{\partial P(Y_i = J)}{\partial X_k} = \hat{\beta}_k \lambda(\hat{\tau}_{J-1} - \mathbf{x}_i^T\hat{\boldsymbol{\beta}}) > 0 \text{ if } \hat{\beta}_k > 0$ $\frac{\partial P(Y_i = 1)}{\partial X_k} = -\hat{\beta}_k \lambda(\hat{\tau}_1 - \mathbf{x}_i^T\hat{\boldsymbol{\beta}}) < 0 \text{ if } \hat{\beta}_k > 0$

For middle categories, the sign depends on parameter values — effects on middle categories can go either way even when the overall latent variable effect is unambiguous.

12.3 Generalised Ordered Logit

When the proportional odds assumption is violated, the Generalised Ordered Logit allows $\boldsymbol{\beta}$ to vary across thresholds:

$P(Y_i > j \mid \mathbf{x}_i) = \Lambda(\mathbf{x}_i^T\boldsymbol{\beta}_j - \tau_j), \quad j = 1, \dots, J-1$

The partial proportional odds model constrains some coefficients to be equal across thresholds (for covariates satisfying PO) and allows others to vary:

$P(Y_i > j) = \Lambda(\mathbf{x}_{1i}^T\boldsymbol{\beta} + \mathbf{x}_{2i}^T\boldsymbol{\gamma}_j - \tau_j)$

Where $\boldsymbol{\beta}$ is common across thresholds and $\boldsymbol{\gamma}_j$ varies.

12.4 Ordered Probit

The Ordered Probit replaces the logistic with the normal CDF:

$P(Y_i \leq j \mid \mathbf{x}_i) = \Phi(\tau_j - \mathbf{x}_i^T\boldsymbol{\beta})$

Marginal effects are analogous, replacing $\lambda$ with $\phi$ (the standard normal PDF):

$\frac{\partial P(Y_i = j)}{\partial X_k} = -\hat{\beta}_k\left[\phi(\hat{\tau}_j - \mathbf{x}_i^T\hat{\boldsymbol{\beta}}) - \phi(\hat{\tau}_{j-1} - \mathbf{x}_i^T\hat{\boldsymbol{\beta}})\right]$

13. Extensions: Nested Logit and Mixed Logit

13.1 The Nested Logit: Addressing IIA

The Nested Logit relaxes IIA by grouping alternatives into nests within which alternatives are correlated substitutes. The choice probability decomposes into:

$P(Y_i = j) = \underbrace{P(j \mid \text{nest} m)}_{\text{within-nest choice}} \times \underbrace{P(\text{nest } m)}_{\text{nest choice}}$

The dissimilarity parameter $\lambda_m \in (0,1]$ governs the correlation within nest $m$ :

$\lambda_m = 1$ : No within-nest correlation (reduces to MNL).
$\lambda_m \to 0$ : Perfect within-nest correlation (nest collapses to a single alternative).

The inclusive value $I_{im} = \ln\sum_{j \in B_m}e^{V_{ij}/\lambda_m}$ summarises the attractiveness of nest $m$ , allowing it to influence the nest-level choice.

Utility consistency: The Nested Logit is RUM-consistent if and only if $\lambda_m \in (0,1]$ for all nests. If $\hat{\lambda}_m > 1$ , it signals misspecification or incorrect nesting structure.

13.2 Estimation of the Nested Logit

Sequential (limited information) estimation:

Estimate the within-nest model parameters by fitting a conditional logit within each nest.
Compute the inclusive values $\hat{I}_{im}$ .
Estimate the nest-level model using $\hat{I}_{im}$ as a covariate.

Full information MLE: Simultaneously maximise the full nested logit log-likelihood:

$\ell = \sum_{i=1}^n \ln P(Y_i = j_i) = \sum_{i=1}^n \left[\ln P(j_i \mid B_{m_i}) + \ln P(B_{m_i})\right]$

Full MLE is preferred as it produces more efficient estimates; sequential estimation is easier to implement but is less efficient.

13.3 The Mixed Logit: Flexible Preferences

The Mixed Logit approximates virtually any random utility model by allowing random coefficients:

$\boldsymbol{\beta}_i = \boldsymbol{\mu} + \mathbf{L}\boldsymbol{\eta}_i, \quad \boldsymbol{\eta}_i \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$

Where $\boldsymbol{\mu}$ are mean preferences and $\mathbf{L}\mathbf{L}^T = \boldsymbol{\Sigma}$ captures preference heterogeneity and cross-alternative error correlation.

Key advantages over MNL:

No IIA: Correlation across alternatives via $\boldsymbol{\Sigma}$ .
Preference heterogeneity: Estimates the distribution of preferences, not just the mean.
Panel data: Handles repeated choices by the same individual via the mixing distribution.
Flexible substitution: Allows realistic substitution patterns.

13.4 Simulation-Based Estimation for Mixed Logit

Since $P(Y_i = j) = \int L_{ij}(\boldsymbol{\beta})f(\boldsymbol{\beta})d\boldsymbol{\beta}$ has no closed form, use simulation:

Simulated Maximum Likelihood (SML):

$\tilde{P}(Y_i = j) = \frac{1}{R}\sum_{r=1}^R \frac{e^{\mathbf{x}_{ij}^T\boldsymbol{\beta}^{(r)}}}{\sum_{k=1}^J e^{\mathbf{x}_{ik}^T\boldsymbol{\beta}^{(r)}}}$

Where $\boldsymbol{\beta}^{(r)}$ are draws from the assumed mixing distribution. The SML estimator maximises:

$\tilde{\ell}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) = \sum_{i=1}^n \ln \tilde{P}(Y_i = j_i)$

Quasi-Monte Carlo (Halton sequences): Replace pseudo-random draws with Halton sequences — low-discrepancy sequences that cover the integration domain more uniformly, typically reducing simulation variance by a factor of 10–100 compared to random sampling, requiring far fewer draws (typically $R = 100-500$ is sufficient).

Bayesian MCMC: An alternative to SML, using Markov Chain Monte Carlo to sample from the posterior distribution of parameters and individual-specific coefficients simultaneously.

13.5 Recovering Individual-Level Preferences

A key advantage of the Mixed Logit is the ability to estimate individual-specific coefficients using Bayes' theorem:

$f(\boldsymbol{\beta}_i \mid \mathbf{Y}_i, \mathbf{X}_i) = \frac{P(\mathbf{Y}_i \mid \mathbf{X}_i, \boldsymbol{\beta}_i)f(\boldsymbol{\beta}_i \mid \boldsymbol{\mu}, \boldsymbol{\Sigma})}{\int P(\mathbf{Y}_i \mid \mathbf{X}_i, \boldsymbol{\beta})f(\boldsymbol{\beta} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma})d\boldsymbol{\beta}}$

The posterior mean:

$\tilde{\boldsymbol{\beta}}_i = E[\boldsymbol{\beta}_i \mid \mathbf{Y}_i, \mathbf{X}_i] \approx \frac{\sum_r \boldsymbol{\beta}^{(r)}\prod_t L_{ij_t}(\boldsymbol{\beta}^{(r)})}{\sum_r \prod_t L_{ij_t}(\boldsymbol{\beta}^{(r)})}$

These conditional means reveal individual-level taste heterogeneity and are used for market segmentation and personalised prediction.

14. Extensions: Panel Data Discrete Choice

14.1 The Challenge: Incidental Parameters Problem

In panel data, each individual $i = 1, \dots, n$ makes choices across $T$ time periods. The natural extension of binary logit to panel data with fixed effects:

$P(Y_{it}=1 \mid \mathbf{x}_{it}, \alpha_i) = \Lambda(\alpha_i + \mathbf{x}_{it}^T\boldsymbol{\beta})$

Where $\alpha_i$ is an individual fixed effect. The incidental parameters problem arises because:

Estimating $n$ nuisance parameters $\{\alpha_i\}$ alongside $\boldsymbol{\beta}$ in a nonlinear model causes inconsistency of $\hat{\boldsymbol{\beta}}$ even as $n \to \infty$ .
The MNL and probit fixed effects estimators are inconsistent for fixed $T$ .
The bias is of order $1/T$ and can be severe for small $T$ (e.g., $T = 2$ leads to approximately 100% upward bias in binary probit coefficients).

14.2 Conditional Fixed Effects Logit (Chamberlain, 1980)

Chamberlain's conditional logit solves the incidental parameters problem for binary logit by conditioning on the sufficient statistic for $\alpha_i$ — the individual's total number of successes $S_i = \sum_t Y_{it}$ :

$P(Y_{i1}, \dots, Y_{iT} \mid S_i, \mathbf{X}_i) = \frac{\exp\left(\sum_t Y_{it}\mathbf{x}_{it}^T\boldsymbol{\beta}\right)}{\sum_{\mathbf{d} \in \mathcal{C}(S_i)}\exp\left(\sum_t d_t\mathbf{x}_{it}^T\boldsymbol{\beta}\right)}$

Where $\mathcal{C}(S_i)$ is the set of all binary sequences with $S_i$ ones (the conditioning set).

Key properties:

$\hat{\boldsymbol{\beta}}$ is consistent as $n \to \infty$ for fixed $T$ .
Individuals with $S_i = 0$ (never treated) or $S_i = T$ (always treated) contribute no information and are dropped — identification comes only from within-individual variation.
Cannot estimate effects of time-invariant variables (all absorbed by $\alpha_i$ ).

14.3 Random Effects Probit

When the fixed effects approach is too restrictive (e.g., with time-invariant covariates), the random effects probit assumes:

$\alpha_i \sim \mathcal{N}(0, \sigma_\alpha^2)$

$P(Y_{it}=1 \mid \mathbf{x}_{it}, \alpha_i) = \Phi(\alpha_i + \mathbf{x}_{it}^T\boldsymbol{\beta})$

The marginal log-likelihood integrates out $\alpha_i$ :

$P(Y_{i1}, \dots, Y_{iT} \mid \mathbf{X}_i) = \int \prod_{t=1}^T \Phi\left(\frac{\alpha_i + \mathbf{x}_{it}^T\boldsymbol{\beta}}{\sigma}\right)^{Y_{it}}\Phi\left(\frac{-(\alpha_i + \mathbf{x}_{it}^T\boldsymbol{\beta})}{\sigma}\right)^{1-Y_{it}} \phi\left(\frac{\alpha_i}{\sigma_\alpha}\right)\frac{d\alpha_i}{\sigma_\alpha}$

This integral is computed via Gauss-Hermite quadrature.

Mundlak-Chamberlain (Correlated RE): Relax the random effects independence assumption by including individual-level means of time-varying covariates:

$\alpha_i = \bar{\mathbf{x}}_i^T\boldsymbol{\delta} + v_i, \quad v_i \sim \mathcal{N}(0,\sigma_v^2)$

This allows correlation between $\alpha_i$ and $\mathbf{x}_{it}$ , approximating the FE estimator while retaining the ability to estimate effects of time-invariant variables.

14.4 Dynamic Panel Discrete Choice

State dependence refers to the direct causal effect of past choices on current choices:

$P(Y_{it}=1 \mid \mathbf{x}_{it}, Y_{i,t-1}, \alpha_i) = \Lambda(\alpha_i + \mathbf{x}_{it}^T\boldsymbol{\beta} + \rho Y_{i,t-1})$

Where $\rho$ captures structural state dependence (e.g., habit formation, switching costs).

The initial conditions problem (Heckman, 1981): The initial observation $Y_{i1}$ is correlated with $\alpha_i$ because it depends on the pre-sample history. Ignoring this causes inconsistency.

Wooldridge (2005) solution: Model the initial period as a function of the fixed effect:

$P(Y_{i1}=1 \mid \mathbf{x}_i, \alpha_i) = \Phi(\alpha_{i0} + \mathbf{x}_{i1}^T\boldsymbol{\psi})$

And use the Mundlak-Chamberlain approach for the fixed effect distribution.

15. Using the Discrete Choice Component

The Discrete Choice Models component in the DataStatPro application provides a comprehensive workflow for specification, estimation, testing, and visualisation of all major discrete choice model families.

Step-by-Step Guide

Step 1 — Select Dataset Choose the dataset from the "Dataset" dropdown. The dataset should contain:

A unit identifier column (individual, household, firm).
An outcome variable (binary, multinomial, or ordinal).
Covariates: Individual characteristics and/or alternative-specific attributes.
For panel/repeated choices: a time or choice occasion identifier.
For conditional logit: data in long format (one row per alternative per individual).

Step 2 — Select Model Family Choose the discrete choice model specification:

Binary Logit (binary outcome, logistic link)
Binary Probit (binary outcome, normal link)
Linear Probability Model (binary outcome, OLS — for comparison purposes)
Multinomial Logit (nominal outcome, $J \geq 3$ )
Conditional Logit (alternative-specific attributes)
Mixed Logit (random parameters logit)
Nested Logit (hierarchical alternatives)
Ordered Logit (ordinal outcome)
Ordered Probit (ordinal outcome, normal latent variable)
Generalised Ordered Logit (relaxes proportional odds)
Panel Conditional Logit (fixed effects binary logit for panel data)
Random Effects Probit (panel probit with random effects)

Step 3 — Select Variables Map the required variables from your dataset:

Unit ID: Unique identifier for each decision-maker.
Choice Occasion ID: (for panel/repeated choices) The time or choice occasion identifier.
Outcome ( $Y$ ): The discrete choice variable.
Individual Covariates ( $\mathbf{x}_i$ ): Characteristics of the decision-maker.
Alternative Attributes ( $\mathbf{z}_{ij}$ ): (for conditional/mixed logit) Attributes varying across alternatives and individuals.
Alternative ID: (for long-format data) Which alternative is described by each row.

Step 4 — Specify Reference Categories For multinomial and conditional logit, set the reference alternative (default: first category in alphabetical order). For ordered models, verify the ordering of categories.

Step 5 — Configure Nesting Structure (Nested Logit) Assign each alternative to a nest:

Drag-and-drop alternatives into nests in the nesting panel.
Specify whether dissimilarity parameters ( $\lambda_m$ ) are free or constrained.
Choose between sequential and full MLE estimation.

Step 6 — Configure Random Parameters (Mixed Logit) For each covariate, specify whether the coefficient is:

Fixed (common to all individuals)
Random — Normal $\beta_i \sim \mathcal{N}(\mu_\beta, \sigma_\beta^2)$
Random — Log-Normal $\ln\beta_i \sim \mathcal{N}(\mu_\beta, \sigma_\beta^2)$ (for constrained-sign effects like cost)
Random — Triangular (bounded support)
Correlated (estimate full covariance $\boldsymbol{\Sigma}$ , not just diagonal)

Set the number of Halton draws (default: 500) and whether to use antithetic draws for variance reduction.

Step 7 — Configure Fixed Effects (Panel Models)

None (pooled model)
Unit Fixed Effects (conditional logit or dummy-variable approach)
Random Effects (Mundlak-Chamberlain or standard RE)
Correlated Random Effects (include individual means of time-varying covariates)

Step 8 — Configure Standard Errors

Standard MLE (information matrix SEs)
Robust (Sandwich) — recommended for potential misspecification
Cluster-Robust — specify clustering variable (e.g., household, region, market)
Bootstrap — specify replications (default: 999)
Delta Method — for marginal effect SEs (always active)

Step 9 — Configure Marginal Effects Select which partial effects to report:

Average Partial Effects (APE) (default and recommended)
Partial Effects at the Mean (PEM)
Partial Effects at Representative Values (specify covariate values manually)
Odds Ratios / Risk Ratios (binary and MNL only)
Willingness to Pay (requires specification of a cost/price variable)

Step 10 — Select Display Options Choose which outputs to display:

✅ Coefficient Table (with SEs, z-stats, p-values, CIs)
✅ Marginal Effects Table (APE with SEs and CIs)
✅ Odds Ratios / Risk Ratios Plot
✅ Predicted Probability Plot (over covariate range)
✅ ROC Curve and AUC (binary models)
✅ Calibration Plot (Hosmer-Lemeshow)
✅ Goodness-of-Fit Statistics Table
✅ Pre-Trends / Marginal Effect Profile (over subgroups)
✅ Linktest Residuals
✅ Influence Diagnostics Plot
✅ Brant Test Results (ordered models)
✅ IIA Hausman Test Results (MNL)
✅ Nested Logit Tree Diagram
✅ Random Parameter Distributions (Mixed Logit)
✅ Confusion Matrix (binary classification)
✅ WTP Confidence Intervals

Step 11 — Run the Analysis Click "Run Discrete Choice Model". The application will:

Validate data format and variable types; convert to appropriate structure if needed.
Initialise parameters (using linear probability model or random starting values).
Maximise the log-likelihood using Newton-Raphson / BFGS / IRLS.
Compute variance-covariance matrix (information matrix or sandwich).
Compute all selected marginal effects with delta method SEs.
Run specified diagnostic tests (linktest, Brant, IIA Hausman).
Generate all selected visualisations and tables.

16. Computational and Formula Details

16.1 Binary Logit MLE: Step-by-Step

Step 1: Initialise parameters

$\boldsymbol{\beta}^{(0)} = \mathbf{0}_{k\times 1} \quad \text{(or OLS estimates as warm start)}$

Step 2: Compute fitted probabilities

$\hat{p}_i^{(t)} = \Lambda(\mathbf{x}_i^T\boldsymbol{\beta}^{(t)}) = \frac{e^{\mathbf{x}_i^T\boldsymbol{\beta}^{(t)}}}{1 + e^{\mathbf{x}_i^T\boldsymbol{\beta}^{(t)}}}$

Step 3: Compute score and Hessian

$\mathbf{s}^{(t)} = \sum_{i=1}^n (Y_i - \hat{p}_i^{(t)})\mathbf{x}_i = \mathbf{X}^T(\mathbf{y} - \hat{\mathbf{p}}^{(t)})$

$\mathbf{H}^{(t)} = -\sum_{i=1}^n \hat{p}_i^{(t)}(1-\hat{p}_i^{(t)})\mathbf{x}_i\mathbf{x}_i^T = -\mathbf{X}^T\hat{\mathbf{W}}^{(t)}\mathbf{X}$

Step 4: Newton-Raphson update

$\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + \left(\mathbf{X}^T\hat{\mathbf{W}}^{(t)}\mathbf{X}\right)^{-1}\mathbf{X}^T(\mathbf{y} - \hat{\mathbf{p}}^{(t)})$

Step 5: Check convergence

$\|\boldsymbol{\beta}^{(t+1)} - \boldsymbol{\beta}^{(t)}\|_2 < \varepsilon_{tol} \quad \text{(default: } \varepsilon_{tol} = 10^{-8}\text{)}$

Step 6: Compute variance-covariance matrix

$\hat{\mathbf{V}}(\hat{\boldsymbol{\beta}}) = \left(\mathbf{X}^T\hat{\mathbf{W}}\mathbf{X}\right)^{-1}$

16.2 Average Partial Effects: Full Computation

For binary logit, continuous covariate $X_k$ :

$\widehat{APE}_k = \frac{1}{n}\sum_{i=1}^n \hat{p}_i(1-\hat{p}_i)\hat{\beta}_k$

Gradient for delta method SE:

$\frac{\partial \widehat{APE}_k}{\partial \hat{\beta}_l} = \frac{1}{n}\sum_{i=1}^n \hat{p}_i(1-\hat{p}_i)\left[\mathbf{1}[l=k] + \hat{\beta}_k(1-2\hat{p}_i)x_{il}\right]$

$\widehat{SE}(\widehat{APE}_k) = \sqrt{\mathbf{g}_k^T\hat{\mathbf{V}}(\hat{\boldsymbol{\beta}})\mathbf{g}_k}$

For binary logit, binary covariate $D_k$ :

$\widehat{APE}_k = \frac{1}{n}\sum_{i=1}^n \left[\Lambda(\hat{\eta}_i + (1-d_{ik})\hat{\beta}_k) - \Lambda(\hat{\eta}_i - d_{ik}\hat{\beta}_k)\right]$

Where $\hat{\eta}_i = \mathbf{x}_i^T\hat{\boldsymbol{\beta}}$ is the fitted index and $d_{ik}$ is the observed value of $D_k$ for individual $i$ .

16.3 Multinomial Logit: Score and Hessian

For the MNL with $J$ alternatives and reference $j=1$ :

Score for category $j$ ( $j \geq 2$ ):

$\frac{\partial \ell}{\partial \boldsymbol{\beta}_j} = \sum_{i=1}^n \left(\mathbf{1}[Y_i=j] - \hat{P}_{ij}\right)\mathbf{x}_i$

Hessian blocks:

$\frac{\partial^2 \ell}{\partial \boldsymbol{\beta}_j\partial\boldsymbol{\beta}_j^T} = -\sum_{i=1}^n \hat{P}_{ij}(1-\hat{P}_{ij})\mathbf{x}_i\mathbf{x}_i^T$

$\frac{\partial^2 \ell}{\partial \boldsymbol{\beta}_j\partial\boldsymbol{\beta}_k^T} = \sum_{i=1}^n \hat{P}_{ij}\hat{P}_{ik}\mathbf{x}_i\mathbf{x}_i^T \quad (j \neq k)$

The full Hessian is block-structured and negative definite, ensuring global concavity of the MNL log-likelihood.

16.4 Ordered Logit: Score and Threshold Constraints

Score for $\boldsymbol{\beta}$ :

$\frac{\partial \ell}{\partial \boldsymbol{\beta}} = \sum_{i=1}^n \sum_{j=1}^J \mathbf{1}[Y_i=j] \left[\frac{-\lambda(\tau_{j-1} - \mathbf{x}_i^T\boldsymbol{\beta}) + \lambda(\tau_j - \mathbf{x}_i^T\boldsymbol{\beta})}{P(Y_i=j)}\right]\mathbf{x}_i$

Score for threshold $\tau_j$ :

$\frac{\partial \ell}{\partial \tau_j} = \sum_{i=1}^n \left[\frac{\mathbf{1}[Y_i=j]\lambda(\tau_j - \mathbf{x}_i^T\boldsymbol{\beta})}{P(Y_i=j)} - \frac{\mathbf{1}[Y_i=j+1]\lambda(\tau_j - \mathbf{x}_i^T\boldsymbol{\beta})}{P(Y_i=j+1)}\right]$

Thresholds are constrained to be strictly ordered. In practice, use the unconstrained re-parameterisation $\tau_j = \tau_1 + \sum_{l=2}^j e^{\delta_l}$ ( $\delta_l > 0$ freely estimated).

16.5 Nested Logit: Full Information MLE

The nested logit log-likelihood for individual $i$ choosing alternative $j^*$ in nest $m^*$ :

$\ell_i = \ln P(j^* \mid m^*) + \ln P(m^*) = \frac{V_{ij^*}}{\lambda_{m^*}} - \ln\left(\sum_{k \in B_{m^*}} e^{V_{ik}/\lambda_{m^*}}\right) + \lambda_{m^*} I_{im^*} + W_{im^*} - \ln\left(\sum_l e^{\lambda_l I_{il} + W_{il}}\right)$

The gradient with respect to $\lambda_m$ requires the chain rule through the inclusive value $I_{im}$ and involves $\sum_{k \in B_m} (V_{ik}/\lambda_m^2)(P_{ik|m} - P_{ik|m}^2)$ .

16.6 Conditional Fixed Effects Logit: Computation

For individual $i$ with $S_i = \sum_t Y_{it}$ successes across $T$ periods, the conditional log-likelihood contribution is:

$\ell_i^{CL} = \sum_t Y_{it}\mathbf{x}_{it}^T\boldsymbol{\beta} - \ln\left(\sum_{\mathbf{d} \in \mathcal{C}(S_i)} e^{\sum_t d_t\mathbf{x}_{it}^T\boldsymbol{\beta}}\right)$

For $T=2$ and $S_i=1$ , this simplifies to:

$\ell_i^{CL} = Y_{i1}(\mathbf{x}_{i1} - \mathbf{x}_{i2})^T\boldsymbol{\beta} - \ln\left(1 + e^{(\mathbf{x}_{i1}-\mathbf{x}_{i2})^T\boldsymbol{\beta}}\right)$

Which is equivalent to a standard logit with first-differenced covariates — the panel FE analogue of the first-differences estimator in linear models.

For $T > 2$ , the summation over $\mathcal{C}(S_i)$ grows combinatorially ( $\binom{T}{S_i}$ terms) and is computed efficiently using the Breslow algorithm (analogous to the Cox partial likelihood).

16.7 Mixed Logit: Halton Sequences and Simulation

Halton sequence for prime base $b$ : Generate $R$ draws from the quasi-random sequence:

$h_r^{(b)} = \sum_{k=0}^K a_k(r) b^{-(k+1)}$

Where $r$ in base $b$ is $r = \sum_k a_k(r) b^k$ . Halton sequences for different primes $\{b_1, b_2, \dots, b_K\}$ are used for different dimensions of integration.

Simulated log-likelihood:

$\tilde{\ell}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) = \sum_{i=1}^n \ln\left(\frac{1}{R}\sum_{r=1}^R \frac{e^{\mathbf{x}_{ij_i}^T\boldsymbol{\beta}^{(r)}}}{\sum_{k=1}^J e^{\mathbf{x}_{ik}^T\boldsymbol{\beta}^{(r)}}}\right)$

Where $\boldsymbol{\beta}^{(r)} = \boldsymbol{\mu} + \mathbf{L}\boldsymbol{h}^{(r)}$ and $\boldsymbol{h}^{(r)}$ are Halton draws transformed to standard normal variates via $\Phi^{-1}(h_r^{(b)})$ .

Antithetic draws: For each draw $\boldsymbol{h}^{(r)}$ , include its mirror $-\boldsymbol{h}^{(r)}$ to reduce simulation variance.

16.8 WTP Computation and Krinsky-Robb Confidence Intervals

Point estimate:

$\widehat{WTP}_k = -\frac{\hat{\gamma}_k}{\hat{\gamma}_{cost}}$

Delta method SE:

$\widehat{SE}(WTP_k) = \frac{1}{|\hat{\gamma}_{cost}|}\sqrt{Var(\hat{\gamma}_k) + WTP_k^2 \cdot Var(\hat{\gamma}_{cost}) + 2 WTP_k \cdot Cov(\hat{\gamma}_k, \hat{\gamma}_{cost})}$

Krinsky-Robb (1986) simulation:

Draw $R = 10{,}000$ parameter vectors from $\mathcal{N}(\hat{\boldsymbol{\gamma}}, \hat{\mathbf{V}}(\hat{\boldsymbol{\gamma}}))$ .
Compute $WTP_k^{(r)} = -\gamma_k^{(r)}/\gamma_{cost}^{(r)}$ for each draw.
Report the 2.5th and 97.5th percentiles of $\{WTP_k^{(r)}\}$ as the 95% CI.

The Krinsky-Robb CI is preferred over the delta method when $\hat{\gamma}_{cost}$ is close to zero (since the ratio is highly non-linear near zero).

17. Worked Examples

Example 1: Binary Logit — Probability of Health Insurance Take-Up

Research Question: What factors predict whether an individual aged 19–64 has health insurance coverage?

Data: Cross-sectional survey of $n = 8{,}246$ working-age adults; outcome: $Y_i = 1$ if insured, $0$ if uninsured.

Model: $\ln\frac{P(\text{insured})}{1-P(\text{insured})} = \beta_0 + \beta_1\text{Income}_i + \beta_2\text{College}_i + \beta_3\text{Age}_i + \beta_4\text{Employed}_i + \epsilon_i$

Step 1: Results Table

Variable	$\hat{\beta}$	SE	$z$	$p$	$e^{\hat{\beta}}$ (OR)	APE
Intercept	-3.412	0.241	-14.15	<0.001	—	—
Income (per $10k USD)	0.318	0.031	10.26	<0.001	1.374	+0.048 pp
College degree	0.841	0.094	8.95	<0.001	2.318	+0.127 pp
Age (years)	0.042	0.007	6.00	<0.001	1.043	+0.006 pp
Employed full-time	1.283	0.108	11.88	<0.001	3.607	+0.193 pp

$\ell(\hat{\boldsymbol{\beta}}) = -3{,}841.2$ , $\ell_0 = -4{,}512.7$ , McFadden $\rho^2 = 0.149$ . AUC = 0.813, Hosmer-Lemeshow $\chi^2_{8} = 9.41$ ( $p = 0.308$ , good calibration).

Step 2: Interpretation

Income: A $10,000 increase in annual income increases the odds of coverage by a factor of 1.374 (37.4% higher odds). On average, this corresponds to a 4.8 percentage point increase in the probability of being insured (APE).
College degree: Having a college degree more than doubles the odds of insurance (OR = 2.318). The APE is 12.7 pp — the largest marginal effect in the model.
Full-time employment: Full-time employment (likely with employer-sponsored insurance) multiplies the odds by 3.607 (APE = 19.3 pp — the largest effect overall).

Step 3: Predicted Probability Profiles

Income	College	Age	Employed	$\hat{P}(\text{insured})$
$25k	No	30	No	0.312
$50k	No	40	Yes	0.741
$75k	Yes	50	Yes	0.941
$25k	Yes	30	No	0.507

Step 4: Model Diagnostics

Linktest: $\hat{\delta}_2 = 0.041$ ( $p = 0.342$ ) — no evidence of systematic misspecification.
Cook's distance: 12 observations with $CD_i > 4/8246 = 0.000485$ identified; re-estimation without these changes key coefficients by less than 3% → robust.

Example 2: Multinomial Logit — Occupational Choice

Research Question: What individual characteristics predict whether a worker is employed in (1) Professional/Managerial, (2) Technical/Clerical, or (3) Service/Manual occupations?

Data: $n = 3{,}812$ workers; reference category: Service/Manual (category 3).

Step 1: MNL Coefficient Table (Reference: Service/Manual)

Variable	Professional (vs. Service)		Technical (vs. Service)
	$\hat{\beta}$	SE	$\hat{\beta}$	SE
Intercept	-2.841	0.312	-1.523	0.241
Education (years)	0.412	0.041	0.218	0.033
Experience (years)	0.083	0.018	0.061	0.015
Female	-0.391	0.112	0.284	0.098
Urban	0.521	0.134	0.312	0.118

$\ell(\hat{\boldsymbol{\beta}}) = -3{,}412.8$ , McFadden $\rho^2 = 0.187$ ; LR $\chi^2_{8} = 1{,}578.4$ ( $p < 0.001$ ).

Step 2: Average Partial Effects on Category Probabilities

Variable	$\Delta\hat{P}$ : Professional	$\Delta\hat{P}$ : Technical	$\Delta\hat{P}$ : Service
Education (+1 year)	+0.041	+0.009	-0.050
Experience (+1 year)	+0.007	+0.003	-0.010
Female	-0.048	+0.062	-0.014
Urban	+0.056	+0.021	-0.077

Note that effects sum to zero across categories (probability constraint). Being female reduces the probability of professional occupation by 4.8 pp but increases the probability of technical occupation by 6.2 pp.

Step 3: IIA Test

Hausman-McFadden test excluding "Technical" category: $\chi^2_{4} = 3.21$ , $p = 0.523$ → IIA not rejected. Excluding "Professional": $\chi^2_{4} = 4.87$ , $p = 0.301$ → IIA not rejected. The MNL is appropriate for this application.

Step 4: Predicted Category Probabilities for Representative Profiles

Profile	Professional	Technical	Service
12 yrs education, 5 yrs exp., male, rural	0.214	0.281	0.505
16 yrs education, 10 yrs exp., female, urban	0.412	0.394	0.194
18 yrs education, 20 yrs exp., male, urban	0.631	0.248	0.121

Example 3: Ordered Logit — Customer Satisfaction

Research Question: What factors predict customer satisfaction with a bank, rated on a 5-point scale (1 = Very Dissatisfied, ..., 5 = Very Satisfied)?

Data: $n = 4{,}521$ bank customers; outcome: satisfaction rating $Y_i \in \{1,2,3,4,5\}$ .

Step 1: Ordered Logit Results

Variable	$\hat{\beta}$	SE	$z$	$p$
Account Tenure (years)	0.182	0.023	7.91	<0.001
Branch Wait Time (−minutes)	-0.241	0.038	-6.34	<0.001
Mobile App User	0.612	0.084	7.29	<0.001
Complaint (last 12 mo.)	-1.143	0.112	-10.21	<0.001
Premium Account	0.831	0.098	8.48	<0.001

Estimated Thresholds:

Threshold	Estimate	SE
$\hat{\tau}_1$ (1\|2)	-3.412	0.181
$\hat{\tau}_2$ (2\|3)	-1.841	0.143
$\hat{\tau}_3$ (3\|4)	0.321	0.121
$\hat{\tau}_4$ (4\|5)	2.184	0.152

Step 2: Brant Test for Proportional Odds

Variable	$\chi^2_{3}$	$p$	PO Violated?
Account Tenure	2.14	0.543	No
Branch Wait Time	3.91	0.271	No
Mobile App User	4.21	0.240	No
Complaint	18.41	0.000	Yes
Premium Account	3.12	0.373	No
Global test	32.18	0.009	Yes

The complaint variable violates proportional odds → estimate a Generalised Ordered Logit allowing the complaint coefficient to vary across thresholds.

Step 3: Average Partial Effects on P(Y = 5: Very Satisfied)

Variable	APE	SE	$p$
Tenure (+1 year)	+0.024 pp	0.003	<0.001
Wait Time (+1 min)	-0.031 pp	0.005	<0.001
Mobile App User	+0.082 pp	0.011	<0.001
Complaint (yes vs. no)	-0.183 pp	0.018	<0.001
Premium Account	+0.111 pp	0.013	<0.001

Having a complaint in the last 12 months reduces the probability of being Very Satisfied by 18.3 pp — by far the largest effect.

Example 4: Mixed Logit — Transportation Mode Choice

Research Question: How do travellers' preferences for cost, time, and comfort vary across individuals when choosing among Car, Bus, Train, and Bicycle?

Data: Stated preference survey; $n = 2{,}412$ respondents, each evaluating 8 hypothetical choice scenarios (long format, $N = 19{,}296$ rows); $J = 4$ alternatives with attributes: cost ($), travel time (min.), comfort rating (1-5).

Step 1: Mixed Logit Specification

Attribute	Distribution
Cost ($)	Fixed (negative)
Travel Time (min.)	Normal: $\mathcal{N}(\mu_t, \sigma_t^2)$
Comfort Rating	Normal: $\mathcal{N}(\mu_c, \sigma_c^2)$
ASC: Car	Fixed
ASC: Train	Fixed
ASC: Bus	Fixed
(Bicycle = Reference ASC)	—

$R = 500$ Halton draws used.

Step 2: Results

Parameter	Estimate	SE	$z$	$p$
Cost ( $\hat{\gamma}_{cost}$ )	-0.0412	0.006	-6.87	<0.001
Time mean ( $\hat{\mu}_t$ )	-0.0841	0.012	-7.01	<0.001
Time SD ( $\hat{\sigma}_t$ )	0.0412	0.008	5.15	<0.001
Comfort mean ( $\hat{\mu}_c$ )	0.3121	0.041	7.61	<0.001
Comfort SD ( $\hat{\sigma}_c$ )	0.1843	0.029	6.35	<0.001
ASC: Car	1.241	0.182	6.82	<0.001
ASC: Train	0.814	0.151	5.39	<0.001
ASC: Bus	-0.312	0.141	-2.21	0.027

Simulated $\ell = -18{,}412.3$ ; McFadden $\rho^2 = 0.341$ (vs. $\rho^2 = 0.298$ for standard MNL).

Step 3: WTP Calculations (Krinsky-Robb 95% CI)

Attribute	WTP Estimate	95% CI
Travel time (per minute saved)	$2.04/min	[$1.61, $2.51]
Comfort (per unit increase)	$7.58/unit	[$5.91, $9.31]

Travellers are willing to pay $2.04 per minute of travel time savings — a Value of Travel Time (VTT) estimate consistent with the transport economics literature.

Step 4: Preference Heterogeneity

The significant $\hat{\sigma}_t = 0.0412$ (Time SD) indicates substantial preference heterogeneity: 95% of the population has time sensitivity in the range $[-0.0841 \pm 1.96 \times 0.0412] = [-0.165, -0.003]$ (all negative, i.e., all dislike travel time). In contrast, comfort has $\hat{\sigma}_c = 0.1843$ , implying some travellers actually have negative comfort preferences — possibly capturing high-income travellers valuing solitude.

Example 5: Conditional Fixed Effects Logit — Panel Adoption Decision

Research Question: Does a reduction in technology cost (logged) increase the probability that a firm adopts a new production technology, controlling for all time-invariant firm characteristics?

Data: Annual panel of $n = 1{,}421$ manufacturing firms, $T = 8$ years; outcome: $Y_{it} = 1$ if firm adopts technology in year $t$ ; 214 firms (15%) adopt during the panel.

Model: Conditional fixed effects logit (Chamberlain), conditioning on $\sum_t Y_{it}$ .

Variable	$\hat{\beta}$	SE	$z$	$p$	APE
Log(Technology Cost)	-0.841	0.121	-6.95	<0.001	-0.062 pp
Government Subsidy ($)	0.0312	0.008	3.90	<0.001	+0.023 pp
Competitor Adoption Rate	1.412	0.218	6.48	<0.001	+0.104 pp
Time Trend	0.184	0.041	4.49	<0.001	+0.014 pp

Number of firms contributing information: 214 (firms adopting at least once). Firms never adopting: 1,207 (dropped by conditioning). $\ell^{CL} = -1{,}241.8$ .

Interpretation: A 10% increase in technology cost reduces the probability of adoption by approximately $0.062 \times \ln(1.10) \approx 0.6$ pp per year, controlling for all time-invariant firm heterogeneity. Competitive pressure (competitor adoption rate) has the largest effect — a 10 pp increase in competitor adoption rates raises a firm's own probability by 10.4 pp.

18. Common Mistakes and How to Avoid Them

Mistake 1: Interpreting Raw Logit/Probit Coefficients as Marginal Effects

Problem: Reporting the raw $\hat{\beta}$ from a logit regression as "a one-unit increase in $X$ increases the probability of $Y=1$ by $\hat{\beta}$ ." This is only correct for the Linear Probability Model. In logit and probit, $\hat{\beta}$ is the change in the log-odds (logit) or the latent index (probit), not the probability.
Solution: Always compute and report Average Partial Effects (APE) using the delta method for standard errors. For communication to non-technical audiences, report predicted probabilities for representative covariate profiles.

Mistake 2: Applying Multinomial Logit When IIA is Violated

Problem: Using the MNL for alternatives that are close substitutes (e.g., different bus routes, similar brand variants), leading to unrealistic cross-substitution patterns predicted by the model.
Solution: Test IIA with the Hausman-McFadden or Small-Hsiao test. If IIA is suspect based on subject-matter knowledge (similar alternatives exist), use Nested Logit (if the nesting structure is clear) or Mixed Logit (for flexible substitution patterns). Report robustness across model specifications.

Mistake 3: Ignoring the Proportional Odds Assumption in Ordered Logit

Problem: Estimating an ordered logit without testing the proportional odds assumption, and reporting a single coefficient for each variable as if it applies uniformly across all thresholds. When the assumption is violated, the estimated coefficient is an unreliable average.
Solution: Always run the Brant test (global and variable-specific). If violated for one or more variables, use the Generalised Ordered Logit (partial proportional odds) or report category-specific marginal effects. Never report ordered logit results without proportional odds diagnostics.

Mistake 4: Using Standard Fixed Effects Logit Instead of Conditional Logit for Panel Data

Problem: Estimating a logit with individual dummy variables (LSDV approach) for panel data. Due to the incidental parameters problem, $\hat{\boldsymbol{\beta}}$ is inconsistent for fixed $T$ . With $T = 2$ , the bias is approximately 100%; with $T = 5$ , roughly 20%.
Solution: Use Chamberlain's conditional fixed effects logit for panel binary outcomes (Stata: xtlogit, fe; R: clogit). For probit, use the Mundlak-Chamberlain correlated random effects approach. Report within-individual variation only.

Mistake 5: Reporting Odds Ratios as Relative Risks (Risk Ratios)

Problem: Interpreting $e^{\hat{\beta}} = 2.0$ as "twice as likely." This is the odds ratio, not the relative risk (risk ratio). For common outcomes ( $P > 10\%$ ), the odds ratio substantially overestimates the relative risk.
Solution: Be explicit about reporting odds ratios (from logit) vs. relative risks. For common outcomes, report Average Partial Effects (absolute probability changes) which are clearer. If relative risk is needed, use Poisson regression with a log link or compute predicted probability ratios directly.

Mistake 6: Using Only In-Sample Fit Statistics for Model Selection

Problem: Selecting a model (e.g., choosing logit over probit, or choosing a particular set of covariates) based solely on in-sample pseudo- $R^2$ or log-likelihood, without accounting for overfitting.
Solution: Use AIC/BIC for comparing models with different covariate sets. For predictive models, use out-of-sample AUC or Brier score from cross-validation. Always check calibration via Hosmer-Lemeshow. Distinguish between models for prediction vs. structural inference.

Mistake 7: Not Checking for Complete Separation

Problem: Running logit/probit on small samples or with many binary predictors without checking for complete separation. The MLE does not exist, but many software packages produce output with extremely large (meaningless) coefficients and standard errors without warning the user.
Solution: Check for separation before relying on MLE estimates. Warning signs: coefficients $|\hat{\beta}| > 10$ , SEs $> 5$ , predicted probabilities exactly at 0 or 1. Use Firth penalised logit or Bayesian logit (weakly informative priors) as robust alternatives to standard MLE in small or sparse samples.

Mistake 8: Including Irrelevant Alternatives in the Choice Set

Problem: Defining the choice set too broadly (e.g., including alternatives that are not actually available to the decision-maker) or too narrowly (excluding relevant alternatives). Both distort the estimated choice probabilities.
Solution: Carefully define the choice set based on availability. For alternative-specific choice sets (where different individuals face different options), specify the availability matrix in the model. Report the sensitivity of results to alternative choice set definitions.

Mistake 9: Failing to Account for Preference Heterogeneity

Problem: Estimating a standard MNL or conditional logit that assumes homogeneous preferences across all individuals, missing important heterogeneity in price sensitivity, taste, or value of time. This leads to biased substitution patterns and misleading policy simulations.
Solution: Test for heterogeneity by including interaction terms with demographic variables. For more flexible heterogeneity, estimate a Mixed Logit with normally distributed random coefficients. Report the distribution of individual-level preferences, not just the mean.

Mistake 10: Using the Wrong Data Format for Conditional Logit

Problem: Estimating a conditional logit with individual-specific data in wide format (one row per person, multiple columns for different alternatives' attributes). This causes data errors and incorrect likelihood contributions.
Solution: Convert data to long format: one row per alternative per individual. The dataset should have $n \times J$ rows ( $n$ individuals, $J$ alternatives). Verify the choice indicator is coded as 1 for the chosen alternative and 0 for all others, within each individual's choice set.

19. Troubleshooting

Issue	Likely Cause	Solution
MLE does not converge	Poor starting values; very flat likelihood; complete separation	Use OLS/LPM as starting values; rescale variables; check for separation; try Firth logit
Extremely large coefficients or SEs ( $> 5$ )	Complete or quasi-complete separation; collinearity	Check pairwise correlations; VIF analysis; merge categories; use penalised estimation (Firth)
Negative definite Hessian at convergence	Local optimum; non-concave model extension	Try multiple starting values; use a different optimizer (BFGS vs. Newton-Raphson); check model specification
Predicted probabilities exactly 0 or 1	Complete separation; extreme covariate values	Identify separating combination; drop/transform variable; use Firth logit; check for data errors
APE has wrong sign compared to coefficient	Non-linear interaction effects; cross-effects in MNL	APE in MNL can have opposite sign to log-odds coefficient — this is expected; report both
IIA Hausman test gives negative statistic	Small sample; numerical imprecision in Hessian estimation	Use Small-Hsiao test instead; check if restricted model is nested in full model; try more alternatives
Brant test significant (proportional odds violated)	Heterogeneous covariate effects across thresholds	Fit Generalised Ordered Logit; or Multinomial Logit; report variable-specific Brant results to identify culprits
Mixed logit does not converge	Too many random parameters; too few draws; poor scaling	Increase draws (to 1000+); scale attributes to similar magnitude; fix some parameters as fixed; simplify model
Conditional logit: no observations after conditioning	All individuals have $S_i = 0$ or $S_i = T$	Verify panel structure; ensure within-individual variation in $Y_{it}$ ; check treatment coding
WTP confidence interval is extremely wide or includes infinity	Cost coefficient close to zero; poor precision	Report Krinsky-Robb CI instead of delta method; increase sample size; consider fixing cost coefficient
Hosmer-Lemeshow test rejects calibration	Model does not predict outcome rates accurately in some regions	Inspect calibration plot decile by decile; add polynomial terms for continuous variables; check for important omitted variables
AUC is high but calibration is poor	Model discriminates well but predicted probabilities are poorly scaled	Apply Platt scaling or isotonic regression for calibration correction; consider calibrated probability estimation
Panel random effects probit: very slow convergence	Many quadrature points needed; complex likelihood surface	Reduce quadrature points (e.g., 12-20 are usually sufficient); use adaptive quadrature; use Mundlak-Chamberlain approach with standard probit
Nested logit: $\hat{\lambda}_m > 1$	Incorrect nesting structure; misspecified model	The model is not RUM-consistent; rethink nesting structure; try Mixed Logit as alternative
MNL: predictions dominated by one category	Class imbalance; misspecified alternative	Check class proportions; verify reference category; consider alternative-specific constants
Interaction terms insignificant despite theoretical expectation	Insufficient statistical power; multicollinearity	Check VIF for interaction; report effect sizes with CIs regardless of significance; consider power analysis

20. Quick Reference Cheat Sheet

Core Probability Formulas

Model	$P(Y_i = 1 \mid \mathbf{x}_i)$	Link Function
Logit	$\Lambda(\mathbf{x}_i^T\boldsymbol{\beta}) = \frac{e^{\mathbf{x}^T\boldsymbol{\beta}}}{1+e^{\mathbf{x}^T\boldsymbol{\beta}}}$	Logit: $\ln[p/(1-p)]$
Probit	$\Phi(\mathbf{x}_i^T\boldsymbol{\beta})$	Probit: $\Phi^{-1}(p)$
LPM	$\mathbf{x}_i^T\boldsymbol{\beta}$	Identity
MNL	$\frac{e^{\mathbf{x}_i^T\boldsymbol{\beta}_j}}{\sum_k e^{\mathbf{x}_i^T\boldsymbol{\beta}_k}}$	Log relative odds
Ordered Logit	$\Lambda(\tau_j - \mathbf{x}_i^T\boldsymbol{\beta}) - \Lambda(\tau_{j-1} - \mathbf{x}_i^T\boldsymbol{\beta})$	Proportional odds

Key Formulas

Formula	Description
$\ell(\boldsymbol{\beta}) = \sum_i [Y_i\ln P_i + (1-Y_i)\ln(1-P_i)]$	Binary logit/probit log-likelihood
$APE_k = n^{-1}\sum_i \partial P_i / \partial X_{ik}$	Average Partial Effect (continuous $X_k$ )
$APE_k^{logit} = n^{-1}\sum_i \hat{p}_i(1-\hat{p}_i)\hat{\beta}_k$	APE for logit
$APE_k^{probit} = n^{-1}\sum_i \phi(\mathbf{x}_i^T\hat{\boldsymbol{\beta}})\hat{\beta}_k$	APE for probit
$OR_k = e^{\hat{\beta}_k}$	Odds ratio from logit
$WTP_k = -\hat{\gamma}_k / \hat{\gamma}_{cost}$	Willingness to pay
$\rho^2 = 1 - \ell(\hat{\boldsymbol{\beta}})/\ell_0$	McFadden's pseudo- $R^2$
$LR = -2(\ell_{restricted} - \ell_{unrestricted}) \sim \chi^2_q$	Likelihood Ratio test
$P(Y_i = j \mid \text{nest } m) = e^{V_{ij}/\lambda_m}/\sum_{k\in B_m}e^{V_{ik}/\lambda_m}$	Nested logit conditional probability
$\hat{\boldsymbol{\beta}}^{(t+1)} = \hat{\boldsymbol{\beta}}^{(t)} + (\mathbf{X}^T\hat{\mathbf{W}}\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{y}-\hat{\mathbf{p}})$	Newton-Raphson / IRLS update

Model Selection Guide

Outcome Type	$J$	Alternatives	Recommended Model
Binary	2	—	Logit (default) or Probit
Nominal	$J \geq 3$	No attributes	Multinomial Logit
Nominal	$J \geq 3$	With attributes	Conditional Logit
Nominal (correlated)	$J \geq 3$	Nested groups	Nested Logit
Nominal (heterogeneous)	$J \geq 3$	Random preferences	Mixed Logit
Ordinal	$J \geq 3$	PO holds	Ordered Logit
Ordinal	$J \geq 3$	PO violated	Generalised Ordered Logit
Binary, panel FE	2	—	Conditional FE Logit
Binary, panel RE	2	—	Random Effects Probit (Mundlak)

Assumption Checklist

Assumption	Model	How to Test	If Violated
Correct link function	Logit/Probit	Linktest; Box-Tidwell	Try alternative link; add polynomial terms
No complete separation	All binary	Check large SEs; predicted probs = 0/1	Firth penalised MLE; Bayesian logit
IIA	MNL, CL	Hausman-McFadden; Small-Hsiao	Nested Logit; Mixed Logit
Proportional odds	Ordered Logit	Brant test; parallel lines graph	Generalised Ordered Logit; MNL
No heteroscedasticity	Probit	Linktest; heteroscedastic probit	Heteroscedastic probit; robust SEs
No perfect multicollinearity	All	VIF; condition number	Drop/combine variables; regularise
RUM consistency	Nested Logit	$\hat{\lambda}_m \in (0,1]$	Respecify nesting; Mixed Logit
No endogeneity	All	Hausman test vs. IV estimator	Control function; IV logit/probit

Marginal Effects: Type and Context

Context	Measure	Formula
Average effect (standard)	APE / AME	$n^{-1}\sum_i \partial P_i/\partial X_k$
Effect at average person	PEM	$\partial P / \partial X_k \mid_{\mathbf{x}=\bar{\mathbf{x}}}$
Effect for specific profile	Marginal effect at representative value	$\partial P / \partial X_k \mid_{\mathbf{x}=\mathbf{x}_0}$
Binary covariate	Discrete change	$P(Y=1\mid X_k=1) - P(Y=1\mid X_k=0)$
Log-odds scale	Raw coefficient	$\hat{\beta}_k$
Multiplicative odds	Odds ratio	$e^{\hat{\beta}_k}$
Money metric	Willingness to Pay	$-\hat{\gamma}_k/\hat{\gamma}_{cost}$

Standard Error Selection

Setting	Recommended SE	Rationale
IID observations, correct spec.	MLE information matrix SEs	Efficient; standard
Potential misspecification	Sandwich (robust) SEs	Robust to distributional misspecification
Clustered data (firms, regions)	Cluster-robust SEs	Within-cluster correlation
Small samples	Bootstrap SEs	More reliable finite-sample inference
Marginal effects	Delta method SEs	Propagates uncertainty from $\hat{\boldsymbol{\beta}}$
WTP	Krinsky-Robb simulation	Better for ratios of estimates

Fit Statistics at a Glance

Statistic	Formula	Best for
McFadden $\rho^2$	$1 - \ell/\ell_0$	Overall model fit
AIC	$-2\ell + 2k$	Model comparison (prediction)
BIC	$-2\ell + k\ln n$	Model comparison (parsimony)
AUC-ROC	Area under ROC curve	Binary discrimination
Brier Score	$n^{-1}\sum(\hat{p}_i-Y_i)^2$	Probability calibration
Count $R^2$	Pct. correctly classified	Naive classification
Hosmer-Lemeshow	$\chi^2_{G-2}$	Calibration across prediction deciles

Panel Discrete Choice: Key Properties

Estimator	Consistent ( $n\to\infty$ , fixed $T$ )?	Time-Invariant Variables?	Dynamic State Dependence?	Key Reference
LSDV Logit (incidental params.)	❌	❌	Limited	—
Conditional FE Logit	✅	❌	No (by default)	Chamberlain (1980)
RE Probit (standard)	✅ (if $\alpha_i \perp \mathbf{x}_{it}$ )	✅	No	—
Correlated RE Probit (Mundlak)	✅ (approximately)	✅	No	Mundlak (1978)
Dynamic Logit (Wooldridge)	✅	Limited	✅	Wooldridge (2005)
Mixed Logit (panel)	✅	✅	Via serial correlation	McFadden & Train (2000)

This tutorial provides a comprehensive foundation for understanding, applying, and interpreting Discrete Choice Models using the DataStatPro application. For further reading, consult McFadden's "Conditional Logit Analysis of Qualitative Choice Behavior" (1974), Train's "Discrete Choice Methods with Simulation" (Cambridge University Press, 2009), Greene's "Econometric Analysis" (8th ed., 2018), Long's "Regression Models for Categorical and Limited Dependent Variables" (Sage, 1997), or Wooldridge's "Econometric Analysis of Cross Section and Panel Data" (MIT Press, 2010). For feature requests or support, contact the DataStatPro team.

Structural Equation Modeling