Discrete Choice Models: Zero to Hero Tutorial
This comprehensive tutorial takes you from the foundational concepts of Discrete Choice modelling all the way through advanced extensions, assumption testing, heterogeneity analysis, and practical usage within the DataStatPro application. Whether you are a complete beginner or an experienced analyst, this guide is structured to build your understanding step by step.
Table of Contents
- Prerequisites and Background Concepts
- What are Discrete Choice Models?
- The Mathematical Framework
- Key Assumptions
- Identification and Causal Inference
- Binary Choice Models: Logit and Probit
- Hypothesis Testing and Inference
- Effect Size Measures
- Model Fit and Evaluation
- Diagnostics and Assumption Testing
- Extensions: Multinomial and Conditional Logit
- Extensions: Ordered Choice Models
- Extensions: Nested Logit and Mixed Logit
- Extensions: Panel Data Discrete Choice
- Using the Discrete Choice Component
- Computational and Formula Details
- Worked Examples
- Common Mistakes and How to Avoid Them
- Troubleshooting
- Quick Reference Cheat Sheet
1. Prerequisites and Background Concepts
Before diving into Discrete Choice Models, it is helpful to be familiar with the following foundational concepts. Do not worry if you are not — each concept is briefly explained here.
1.1 Random Variables and Probability Distributions
A random variable is a variable whose value is determined by a random process. In discrete choice modelling, the outcome variable takes on a finite set of values representing alternative choices (e.g., for binary outcomes, or for multiple alternatives).
Key distributions used in discrete choice models:
- Bernoulli distribution: , where and . Used for binary outcomes.
- Logistic (Gumbel) distribution: The basis of the logit model. The standard logistic CDF is .
- Standard Normal distribution: The basis of the probit model. The CDF is .
- Type I Extreme Value (Gumbel) distribution: The basis of the multinomial logit model. Its CDF is .
1.2 Likelihood and Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE) is the primary estimation method for discrete choice models. For a sample of independent observations, the likelihood function is:
The log-likelihood (more convenient for optimisation) is:
The MLE maximises :
MLE has attractive properties under regularity conditions: it is consistent, asymptotically normal, and asymptotically efficient.
1.3 Latent Variable Models
Many discrete choice models are derived from an underlying latent variable (unobserved continuous variable). Define a latent utility or propensity:
The observed discrete outcome is a deterministic function of the latent variable:
The distribution assumed for determines the model family:
- → Logit model
- → Probit model
1.4 Random Utility Maximisation (RUM)
The Random Utility Model (McFadden, 1974) provides the economic foundation for discrete choice. Each decision-maker assigns a utility to each alternative :
Where:
- is the systematic (deterministic) utility component.
- is the random utility component — unobserved factors affecting utility.
The decision-maker chooses the alternative that maximises utility:
Different distributional assumptions for yield different discrete choice model families.
1.5 Ordinary Least Squares and Its Limitations for Discrete Outcomes
OLS regression applied to a binary outcome () produces the Linear Probability Model (LPM):
The LPM has well-known limitations:
- Predicted probabilities outside : OLS can predict probabilities less than 0 or greater than 1.
- Heteroscedastic errors: varies with , violating the OLS homoscedasticity assumption.
- Non-constant marginal effects: The true relationship between covariates and is typically non-linear (sigmoid-shaped), not linear.
Discrete choice models address all three limitations by modelling probabilities through a monotone transformation that maps the real line to .
1.6 Multinomial Outcomes and Ordinal Data
Multinomial outcomes have more than two unordered categories: where no natural ordering exists (e.g., mode of transport: car, bus, train, bicycle).
Ordinal outcomes have more than two categories with a natural ordering: where (e.g., satisfaction: low, medium, high).
Different model families are designed for each type of outcome.
2. What are Discrete Choice Models?
2.1 The Core Idea
Discrete Choice Models (DCMs) are statistical models designed to explain and predict the choices made by individuals (or firms, households, or other decision-making units) when they face a finite set of mutually exclusive alternatives.
The core modelling challenge: the outcome is not a continuous variable but a discrete category. Standard regression is misspecified for such outcomes. DCMs model the probability of choosing each alternative as a function of:
- Decision-maker characteristics: Age, income, gender, education of the chooser.
- Alternative-specific attributes: Price, quality, travel time, distance of each option.
- Contextual factors: Market conditions, constraints, availability.
The general DCM probability structure:
Where is a function mapping covariates and parameters to probabilities, with:
2.2 A Taxonomy of Discrete Choice Models
| Model | Outcome Type | Alternatives | Key Assumption |
|---|---|---|---|
| Binary Logit | Binary () | 2 unordered | Logistic errors |
| Binary Probit | Binary () | 2 unordered | Normal errors |
| Multinomial Logit (MNL) | Nominal () | unordered | IID Gumbel errors (IIA) |
| Conditional Logit | Nominal () | with attributes | IID Gumbel errors (IIA) |
| Nested Logit | Nominal () | Hierarchical structure | Correlated within nests |
| Mixed Logit | Nominal () | with random coefficients | Flexible error structure |
| Ordered Logit (Proportional Odds) | Ordinal () | Ordered categories | Proportional odds |
| Ordered Probit | Ordinal () | Ordered categories | Normal latent variable |
| Multinomial Probit | Nominal () | unordered | Multivariate normal errors |
2.3 Real-World Applications
Discrete choice models are applied across virtually every field involving individual decision-making:
- Transportation: Modal choice (car vs. bus vs. train vs. bicycle), route choice, departure time choice.
- Health Economics: Insurance plan choice, treatment adoption, healthcare provider selection, smoking/drinking behaviour.
- Marketing: Brand choice, product adoption, willingness to pay for product attributes.
- Labour Economics: Occupational choice, labour force participation, migration decisions, unionisation.
- Environmental Economics: Willingness to pay for environmental goods, habitat choice, recreational site choice.
- Political Science: Voting behaviour, party affiliation, referendum outcomes.
- Housing: Residential location choice, tenure choice (rent vs. own), housing type choice.
- Finance: Portfolio allocation categories, credit/default behaviour, investment vehicle choice.
- Education: School choice, field of study selection, dropout decisions.
2.4 Discrete Choice Models vs. Other Regression Methods
| Method | Outcome | Key Use Case | Key Limitation |
|---|---|---|---|
| OLS / LPM | Continuous (or binary) | Simple benchmark; DiD with binary | Predicted probs outside |
| Logit / Probit | Binary | Binary classification; probability estimation | Marginal effects non-constant |
| Multinomial Logit | Nominal () | Unordered multi-category choices | IIA assumption restrictive |
| Nested Logit | Nominal (, grouped) | Hierarchical choice structures | Tree structure pre-specified |
| Mixed Logit | Nominal (any ) | Preference heterogeneity, flexible IIA | Computationally intensive |
| Ordered Logit/Probit | Ordinal | Ranked categories, Likert scales | Proportional odds assumption |
| Count Models (Poisson/NB) | Count data | Number of events | Not a DCM; counts not choices |
| Survival/Duration | Time to event | Time until discrete event | Different modelling paradigm |
3. The Mathematical Framework
3.1 The Binary Logit Model
The logit model specifies the probability of the outcome as:
And:
The log-odds (logit) transformation linearises the model:
This is the log-odds ratio (or logit), and are the log-odds coefficients.
3.2 The Binary Probit Model
The probit model specifies:
Where is the standard normal CDF. From the latent variable representation:
3.3 The Logit vs. Probit Comparison
The logistic and standard normal CDFs are very similar in shape. Key differences:
| Property | Logit | Probit |
|---|---|---|
| Link function | ||
| Error distribution | Logistic (heavier tails) | Standard Normal |
| Scale normalisation | ||
| Coefficient scaling | — | |
| Closed-form probabilities | ✅ | ❌ (requires numerical integration) |
| Interpretability | Log-odds directly interpretable | Requires transformation |
| Tail behaviour | Heavier tails | Thinner tails |
The rule of thumb for converting coefficients: .
3.4 The Multinomial Logit Model
For unordered alternatives, the Multinomial Logit (MNL) specifies, for alternative with reference category :
With the normalisation (reference category), so:
The log-odds ratio relative to the reference category:
3.5 The Conditional Logit Model
The Conditional Logit (CL) model (McFadden, 1974) allows attributes to vary across alternatives. The utility of alternative for individual :
Where are alternative-specific attributes (e.g., price of alternative , travel time of option ) with a common coefficient , and are individual-specific characteristics with alternative-specific coefficients .
The choice probability:
3.6 The Ordered Logit Model
For an ordinal outcome , the Ordered Logit (Proportional Odds) model uses a single latent variable:
With threshold (cut-point) parameters :
Where and . The choice probabilities are:
3.7 The Nested Logit Model
The Nested Logit partitions the alternatives into mutually exclusive nests . The choice probability for alternative in nest :
Where is the inclusive value (log-sum), is the dissimilarity parameter for nest , and contains nest-level attributes.
3.8 The Mixed Logit Model
The Mixed Logit (also called the Random Parameters Logit) allows coefficients to vary across individuals:
Where — individual-specific random coefficients drawn from a mixing distribution (typically normal or log-normal).
The unconditional choice probability integrates over the random coefficient distribution:
This integral has no closed form and is evaluated by simulation (see Section 16.7).
4. Key Assumptions
4.1 Independence of Irrelevant Alternatives (IIA)
The most important and controversial assumption in multinomial logit models is the Independence of Irrelevant Alternatives (IIA):
The ratio of probabilities for any two alternatives and is independent of all other alternatives in the choice set.
Formally, for the MNL model:
This ratio depends only on and , not on any other alternative .
The Red Bus / Blue Bus Problem: A classic IIA failure. Suppose individuals choose between Car and Red Bus with equal probability (50/50). If a Blue Bus (identical to Red Bus except colour) is added, IIA predicts all three have 1/3 probability each. But intuitively, the split should be 50% car and 50% bus (25% red + 25% blue). IIA allocates "competition" uniformly across all alternatives rather than within similar alternatives.
IIA is implied by: The Type I Extreme Value (Gumbel) distributional assumption on and the independence across alternatives.
IIA fails when: Alternatives are correlated substitutes — i.e., some alternatives are more similar to each other than to others. In such cases, the error terms are correlated across alternatives.
4.2 The Proportional Odds Assumption
For the Ordered Logit model, the proportional odds assumption (also called parallel lines assumption) requires that the effect of covariates on the log-odds is constant across all thresholds:
The coefficient vector is the same for all outcome categories — only the intercept changes. This is a strong assumption that should be explicitly tested (see Section 10.3).
4.3 Random Utility Consistency
For the RUM foundation to be valid:
- Completeness: Decision-makers can rank all alternatives.
- Transitivity: Preferences are transitive (if and , then ).
- Utility maximisation: Decision-makers always choose the alternative with the highest utility.
- Stable preferences: Preferences do not change during the observation period.
4.4 Independence of Observations
Standard discrete choice models assume independent observations across individuals. In panel data (Section 14), this assumption is relaxed by allowing within-individual correlation across repeated choices.
4.5 Correct Specification of the Choice Set
The model assumes:
- All relevant alternatives are included in the choice set.
- The choice set is the same for all individuals (or, in some extensions, individual-specific choice sets are correctly specified).
- No irrelevant alternatives contaminate the model.
5. Identification and Causal Inference
5.1 What Discrete Choice Models Identify
Identification in discrete choice models means the ability to recover the structural parameters from the data. Key identification conditions:
- Scale normalisation: The scale of the latent utility is unidentified. In the binary probit, is imposed; in logit, is imposed.
- Location normalisation: Only utility differences are identified, not absolute utility levels. In the MNL, one alternative's coefficient vector is normalised to zero.
- No perfect multicollinearity: Covariates must not be perfectly linearly dependent.
- Exclusion of constants: In models with alternative-specific attributes only, a constant cannot be separately identified from the normalisation.
5.2 Endogeneity in Discrete Choice Models
Endogeneity arises when a regressor is correlated with the unobserved utility component . Common sources:
- Omitted variables: Unobserved factors affecting both the covariate and the choice.
- Simultaneity: The choice affects the covariate (e.g., price determined by anticipated demand).
- Measurement error: Classical measurement error attenuates estimates toward zero.
Consequences: MLE estimates are inconsistent under endogeneity — standard corrections are required.
Remedies:
- Control Function Approach: Add the residuals from an auxiliary regression of the endogenous variable on instruments as an additional regressor in the discrete choice model.
- IV Probit / IV Logit: Two-stage estimation using valid instruments.
- Berry-Levinsohn-Pakes (BLP): For market-level discrete choice with endogenous prices, uses product-level instrumental variables.
5.3 Average Partial Effects vs. Structural Parameters
In discrete choice models, the raw coefficients are not directly interpretable as marginal effects. The Average Partial Effect (APE) of continuous covariate on is:
For binary logit:
For binary probit:
The APE (also called the Average Marginal Effect, AME) is the primary reported quantity of interest in discrete choice models — analogous to the regression coefficient in OLS.
5.4 Partial Effects at the Mean (PEM) and Partial Effects at Representative Values
Partial Effect at the Mean (PEM): Evaluate the marginal effect at the sample mean :
For binary logit:
⚠️ The PEM evaluates the marginal effect at a potentially non-existent "average individual." The APE is generally preferred because it averages the marginal effect across actual observations, accounting for the non-linearity of the model.
5.5 Willingness to Pay (WTP) in Choice Models
In models with a cost or price attribute (e.g., transport cost, product price), the Willingness to Pay (WTP) for a change in attribute is:
Where is the coefficient on attribute and is the coefficient on cost. This ratio gives the marginal rate of substitution between attribute and money — a central output of stated preference and transport choice studies.
6. Binary Choice Models: Logit and Probit
6.1 The Log-Likelihood for Binary Models
For a binary outcome , the log-likelihood is:
Where .
For logit: , giving:
For probit: , giving:
Both log-likelihoods are globally concave, ensuring a unique maximum.
6.2 The Score and Hessian
The score vector (gradient of the log-likelihood):
For logit, this simplifies elegantly because where .
The Hessian matrix (second derivative):
Where for logit and for probit. The negative Hessian is positive definite, confirming concavity.
6.3 Newton-Raphson and IRLS Estimation
The MLE is obtained iteratively using Newton-Raphson (or equivalently, Iteratively Reweighted Least Squares — IRLS):
IRLS Interpretation: At each iteration, solve a weighted OLS problem:
Where is a diagonal weight matrix and is the adjusted dependent variable.
6.4 Asymptotic Properties of MLE
Under regularity conditions, the MLE is:
Where the Fisher information matrix is:
The variance-covariance matrix of is estimated by the inverse of the observed information matrix:
6.5 Interpreting Logit Coefficients as Odds Ratios
For the logit model, exponentiating the coefficient gives the odds ratio:
Interpretation: A one-unit increase in multiplies the odds of by :
- If : increases the odds of .
- If : decreases the odds of .
- If : No effect on odds.
⚠️ Odds ratios are not the same as probability ratios (relative risks). Do not interpret the odds ratio as "X% more likely." Convert to marginal probabilities via the APE for clearer communication.
6.6 Marginal Effects in Binary Models
For a continuous covariate , the marginal effect of on for individual :
Logit:
Probit:
For a discrete/binary covariate , the marginal effect is the discrete change in predicted probability:
6.7 Standard Errors for Average Partial Effects (Delta Method)
The APE is a nonlinear function of . Standard errors are obtained via the delta method:
Where is the gradient of the APE with respect to the coefficient vector. Alternatively, use the bootstrap for more reliable inference with small samples.
7. Hypothesis Testing and Inference
7.1 The Wald Test
The Wald test for uses the asymptotic normality of the MLE:
Or equivalently, . For a vector of restrictions :
7.2 The Likelihood Ratio Test
The Likelihood Ratio (LR) test compares a restricted model (imposing ) to an unrestricted model:
Where is the number of restrictions. The LR test is generally preferred over the Wald test because it is invariant to reparameterisation and often has better finite-sample properties.
Special case: The LR test comparing a model with covariates to an intercept-only model:
Where and is the sample proportion.
7.3 The Score (Lagrange Multiplier) Test
The Score test (Rao test) only requires estimating the restricted model:
Useful when the unrestricted model is computationally expensive to estimate.
7.4 Test Equivalences and Recommendations
| Test | Requires | Best For | Invariant to Reparameterisation? |
|---|---|---|---|
| Wald | Unrestricted model only | Single coefficient tests | ❌ |
| Likelihood Ratio | Both models | Nested model comparison | ✅ |
| Score (LM) | Restricted model only | Adding variables to a model | ✅ |
The three tests are asymptotically equivalent but differ in finite samples. The LR test is generally most reliable.
7.5 Confidence Intervals
A Wald confidence interval for :
A profile likelihood confidence interval (more reliable for small samples):
7.6 Testing IIA with the Hausman-McFadden Test
The Hausman-McFadden test for IIA in the MNL compares the full-sample MNL estimates to estimates obtained after removing one alternative from the choice set:
Where are estimates from the restricted choice set and are estimates from the full choice set. Rejection suggests IIA violation.
⚠️ The Hausman-McFadden test has poor finite-sample properties and can produce negative test statistics. The Small-Hsiao test offers an alternative. Neither test is definitive. Subject-matter knowledge about alternative similarity remains essential.
7.7 Testing the Proportional Odds Assumption
The Brant test for the proportional odds assumption in ordered logit estimates a separate binary logit for each cumulative split and tests whether the coefficients are equal across splits:
A chi-squared test statistic is formed from the sum of squared differences in estimates across cumulative splits, weighted by their precision. Rejection indicates the proportional odds assumption is violated, and a generalised ordered logit or multinomial logit should be considered.
7.8 Robust Standard Errors in Discrete Choice Models
While MLE standard errors are derived from the information matrix, misspecification-robust (sandwich) standard errors are available:
Where and (the outer product of scores).
- Use heteroscedasticity-robust (sandwich) SEs when the distributional assumption may be misspecified.
- Use cluster-robust SEs when observations are grouped (e.g., individuals within firms or households within regions).
8. Effect Size Measures
8.1 Average Partial Effects (APE / AME)
The primary effect size in discrete choice models is the Average Partial Effect (APE), also called the Average Marginal Effect (AME):
Interpretation: The average change in the probability of outcome associated with a one-unit increase in , averaging over all individuals in the sample.
For a binary covariate :
8.2 Odds Ratios and Relative Risk
| Measure | Formula | Interpretation |
|---|---|---|
| Odds Ratio (OR) | Multiplicative change in odds per unit increase in | |
| Relative Risk (RR) | Ratio of probabilities; computed at representative values | |
| Absolute Risk Reduction | Difference in probabilities for binary | |
| Number Needed to Treat | $1 / | ARR |
8.3 Predicted Probability Changes
For practical communication, report predicted probabilities at meaningful covariate values:
Where and represent two substantively meaningful covariate profiles (e.g., high-income vs. low-income; treated vs. untreated).
8.4 Standardised Coefficients in Discrete Choice Models
To compare the relative importance of different covariates, standardise the APE by the standard deviation of the outcome:
Where is the standard deviation of and for binary outcomes. This produces an effect size interpretable as the change in probability (in units of the outcome SD) per SD change in .
8.5 McFadden's Pseudo- as Effect Size
McFadden's pseudo- measures the proportional improvement in log-likelihood:
Where is the log-likelihood of the intercept-only model. While not a pure effect size, it provides a scale for comparing model fit improvement:
| Interpretation | |
|---|---|
| Poor fit | |
| Acceptable fit | |
| Good fit | |
| Very good fit |
8.6 Willingness to Pay (WTP) as Effect Size in Choice Experiments
In stated or revealed preference studies, WTP contextualises effect sizes economically:
Report WTP with confidence intervals obtained via the delta method or Krinsky-Robb simulation.
9. Model Fit and Evaluation
9.1 Goodness-of-Fit Statistics
| Statistic | Formula | Description |
|---|---|---|
| Log-likelihood at convergence | Higher (less negative) is better | |
| Null log-likelihood | Baseline (intercept-only) | |
| LR chi-squared | Overall model fit test | |
| McFadden's | Proportional LL improvement | |
| Adjusted McFadden's | Penalised for parameters | |
| AIC | Lower is better | |
| BIC | Lower is better; penalises more | |
| Count | Correctly classified / | Naive classification accuracy |
| Hit rate (vs. base) | Count vs. | Improvement over naive classifier |
9.2 Pseudo-R² Measures
Multiple pseudo- measures exist; they capture different aspects of fit:
McFadden (1974):
Cox-Snell:
Nagelkerke (normalised Cox-Snell):
⚠️ No single pseudo- is universally "correct." Report multiple, and always prefer out-of-sample predictive performance metrics (AUC, Brier score) for evaluating predictive models.
9.3 Classification Metrics for Binary Models
For binary models, at a threshold (default ):
| Metric | Formula | Description |
|---|---|---|
| Accuracy | Overall correct classification rate | |
| Sensitivity (Recall) | True positive rate | |
| Specificity | True negative rate | |
| Precision (PPV) | Positive predictive value | |
| F1 Score | Harmonic mean of precision and recall | |
| AUC-ROC | Area under ROC curve | Discrimination across all thresholds |
The Receiver Operating Characteristic (ROC) curve plots sensitivity vs. across all classification thresholds . The Area Under the Curve (AUC) summarises discrimination:
| AUC | Interpretation |
|---|---|
| No discrimination (random) | |
| Acceptable discrimination | |
| Excellent discrimination | |
| Outstanding discrimination |
9.4 Calibration
Calibration assesses whether predicted probabilities match observed outcome rates.
Hosmer-Lemeshow test: Partition observations into (typically 10) quantile groups by predicted probability. Compare observed and expected counts in each group:
Where and are observed and expected counts of in group . Rejection suggests poor calibration.
Calibration plot: Plot mean predicted probability vs. observed proportion in each decile group. A well-calibrated model lies along the 45° diagonal.
9.5 Information Criteria for Model Comparison
When comparing non-nested models (e.g., logit vs. probit; different covariate sets):
Lower values indicate better fit. BIC imposes a heavier penalty on model complexity, favouring parsimony. AIC and BIC are only directly comparable for models fitted to the same dataset with the same outcome variable.
9.6 Out-of-Sample Validation
For predictive models, always assess performance on held-out data:
- -fold cross-validation: Partition data into folds; train on and test on 1; rotate and average performance metrics.
- Train-test split: Randomly assign 70-80% to training and 20-30% to test.
- Temporal split: For time-indexed data, train on earlier periods and test on later periods.
- Brier score: Mean squared error for probability predictions: .
10. Diagnostics and Assumption Testing
10.1 Residuals in Discrete Choice Models
Unlike OLS, residuals in discrete choice models require careful definition.
Pearson residuals:
Deviance residuals:
The deviance (sum of squared deviance residuals) is:
Standardised residuals for outlier detection: , where is the hat-value (leverage).
10.2 Influence and Leverage
Leverage in logit/probit:
Cook's distance analogue:
DFFITS and DFBETAS analogues are available for identifying influential observations. Flag observations with or for inspection.
10.3 Testing the Proportional Odds Assumption (Ordered Logit)
Brant (1990) test: Estimates a binary logit for each of the cumulative dichotomisations and tests whether coefficients are equal. Available both as a global test (all covariates) and variable-specific tests:
Graphical check: Plot ordered logit coefficients estimated separately for each binary cumulative split. Coefficients that vary substantially suggest a violation.
Remedy if violated:
- Generalised Ordered Logit (partial proportional odds): Allow some (but not all) coefficients to vary across thresholds.
- Multinomial Logit: Drop the ordinal structure entirely; less efficient but unrestricted.
- Stereotype Logit (reduced-rank MNL): Intermediate model allowing partial ordering.
10.4 Testing IIA
Multiple tests for IIA are available, each with limitations:
| Test | Method | Reference | Limitations |
|---|---|---|---|
| Hausman-McFadden | Compare restricted vs. full estimates | Hausman & McFadden (1984) | Can yield negative test statistic |
| Small-Hsiao | Random sample split + comparison | Small & Hsiao (1985) | Sample-split dependent |
| Swait-Louviere | Scaling test across datasets | Swait & Louviere (1993) | Requires two datasets |
Remedy if IIA fails:
- Nested Logit: Group correlated alternatives into nests.
- Mixed Logit: Allow correlation across alternatives through random coefficients.
- Multinomial Probit: Directly models correlated errors; computationally intensive.
10.5 Checking for Complete Separation
Complete separation occurs when a covariate or linear combination of covariates perfectly predicts the outcome — the MLE does not exist (the log-likelihood has no finite maximum):
- Perfect separation: for some linear predictor.
- Quasi-complete separation: The outcome is perfectly predicted for a subset of observations.
Detection: MLE algorithm fails to converge; extremely large coefficient estimates with very large standard errors; implausible predicted probabilities near 0 or 1.
Remedies:
- Firth penalised MLE: Modifies the score equations by a Jeffreys prior penalty — reduces bias in small samples and resolves separation.
- Ridge-penalised logit: .
- Bayesian logit/probit: Place weakly informative priors on coefficients.
- Drop/combine categories: Merge sparse categories causing separation.
10.6 Heteroscedasticity in Probit (Heteroscedastic Probit)
In the standard probit, for all . If the true error variance is heteroscedastic:
The heteroscedastic probit models:
Standard probit estimates are inconsistent under heteroscedasticity (unlike OLS which remains consistent, only losing efficiency). The linktest (adding the squared predicted index as a covariate) checks for systematic misspecification.
10.7 Goodness-of-Link Tests
The linktest (Pregibon, 1980) adds as an additional regressor to the fitted model:
Under correct specification, (the squared term should not be significant). A significant indicates link function misspecification or omitted non-linear terms.
11. Extensions: Multinomial and Conditional Logit
11.1 MNL Log-Likelihood
For outcome with reference category , the MNL log-likelihood:
11.2 Marginal Effects in MNL
For the MNL, the marginal effect of on :
Note: Cross-effects — the effect of a covariate on a different category's probability — may be positive or negative, depending on model parameters.
Average Partial Effect:
11.3 The Conditional Logit and Mixed-Effects Specification
The full Mixed Logit specification that includes both individual-varying and alternative-varying attributes:
Where:
- : Attributes that vary across both individuals and alternatives (e.g., travel time from 's origin to alternative ).
- : Characteristics of the individual (e.g., income), interacted with alternative-specific dummy variables to allow the effect to vary across alternatives.
11.4 Marginal Effects on Log-Odds (MNL)
The log-odds of choosing vs. reference category :
This is the most directly interpretable quantity from the MNL regression output: is the effect of on the log-odds of vs. reference.
11.5 Substitution Patterns and the IIA Implication
Under IIA, the own-price elasticity and cross-price elasticity have rigid implications:
Own elasticity:
Cross elasticity (between alternatives and ):
Under IIA, the cross elasticity is the same for all — a strong and often unrealistic restriction. The cross elasticity depends only on the attribute level and share of the alternative being changed, not on the similarity between alternatives and .
12. Extensions: Ordered Choice Models
12.1 The Ordered Logit (Proportional Odds Model)
Recall from Section 3.6 the latent variable:
The ordered logit log-likelihood:
Subject to , , and .
The coefficient vector and thresholds are estimated jointly.
12.2 Marginal Effects in Ordered Models
For a continuous covariate , the marginal effect on :
Where is the logistic PDF.
Key observation: For the highest category () and lowest category (), the signs are:
For middle categories, the sign depends on parameter values — effects on middle categories can go either way even when the overall latent variable effect is unambiguous.
12.3 Generalised Ordered Logit
When the proportional odds assumption is violated, the Generalised Ordered Logit allows to vary across thresholds:
The partial proportional odds model constrains some coefficients to be equal across thresholds (for covariates satisfying PO) and allows others to vary:
Where is common across thresholds and varies.
12.4 Ordered Probit
The Ordered Probit replaces the logistic with the normal CDF:
Marginal effects are analogous, replacing with (the standard normal PDF):
13. Extensions: Nested Logit and Mixed Logit
13.1 The Nested Logit: Addressing IIA
The Nested Logit relaxes IIA by grouping alternatives into nests within which alternatives are correlated substitutes. The choice probability decomposes into:
The dissimilarity parameter governs the correlation within nest :
- : No within-nest correlation (reduces to MNL).
- : Perfect within-nest correlation (nest collapses to a single alternative).
The inclusive value summarises the attractiveness of nest , allowing it to influence the nest-level choice.
Utility consistency: The Nested Logit is RUM-consistent if and only if for all nests. If , it signals misspecification or incorrect nesting structure.
13.2 Estimation of the Nested Logit
Sequential (limited information) estimation:
- Estimate the within-nest model parameters by fitting a conditional logit within each nest.
- Compute the inclusive values .
- Estimate the nest-level model using as a covariate.
Full information MLE: Simultaneously maximise the full nested logit log-likelihood:
Full MLE is preferred as it produces more efficient estimates; sequential estimation is easier to implement but is less efficient.
13.3 The Mixed Logit: Flexible Preferences
The Mixed Logit approximates virtually any random utility model by allowing random coefficients:
Where are mean preferences and captures preference heterogeneity and cross-alternative error correlation.
Key advantages over MNL:
- No IIA: Correlation across alternatives via .
- Preference heterogeneity: Estimates the distribution of preferences, not just the mean.
- Panel data: Handles repeated choices by the same individual via the mixing distribution.
- Flexible substitution: Allows realistic substitution patterns.
13.4 Simulation-Based Estimation for Mixed Logit
Since has no closed form, use simulation:
Simulated Maximum Likelihood (SML):
Where are draws from the assumed mixing distribution. The SML estimator maximises:
Quasi-Monte Carlo (Halton sequences): Replace pseudo-random draws with Halton sequences — low-discrepancy sequences that cover the integration domain more uniformly, typically reducing simulation variance by a factor of 10–100 compared to random sampling, requiring far fewer draws (typically is sufficient).
Bayesian MCMC: An alternative to SML, using Markov Chain Monte Carlo to sample from the posterior distribution of parameters and individual-specific coefficients simultaneously.
13.5 Recovering Individual-Level Preferences
A key advantage of the Mixed Logit is the ability to estimate individual-specific coefficients using Bayes' theorem:
The posterior mean:
These conditional means reveal individual-level taste heterogeneity and are used for market segmentation and personalised prediction.
14. Extensions: Panel Data Discrete Choice
14.1 The Challenge: Incidental Parameters Problem
In panel data, each individual makes choices across time periods. The natural extension of binary logit to panel data with fixed effects:
Where is an individual fixed effect. The incidental parameters problem arises because:
- Estimating nuisance parameters alongside in a nonlinear model causes inconsistency of even as .
- The MNL and probit fixed effects estimators are inconsistent for fixed .
- The bias is of order and can be severe for small (e.g., leads to approximately 100% upward bias in binary probit coefficients).
14.2 Conditional Fixed Effects Logit (Chamberlain, 1980)
Chamberlain's conditional logit solves the incidental parameters problem for binary logit by conditioning on the sufficient statistic for — the individual's total number of successes :
Where is the set of all binary sequences with ones (the conditioning set).
Key properties:
- is consistent as for fixed .
- Individuals with (never treated) or (always treated) contribute no information and are dropped — identification comes only from within-individual variation.
- Cannot estimate effects of time-invariant variables (all absorbed by ).
14.3 Random Effects Probit
When the fixed effects approach is too restrictive (e.g., with time-invariant covariates), the random effects probit assumes:
The marginal log-likelihood integrates out :
This integral is computed via Gauss-Hermite quadrature.
Mundlak-Chamberlain (Correlated RE): Relax the random effects independence assumption by including individual-level means of time-varying covariates:
This allows correlation between and , approximating the FE estimator while retaining the ability to estimate effects of time-invariant variables.
14.4 Dynamic Panel Discrete Choice
State dependence refers to the direct causal effect of past choices on current choices:
Where captures structural state dependence (e.g., habit formation, switching costs).
The initial conditions problem (Heckman, 1981): The initial observation is correlated with because it depends on the pre-sample history. Ignoring this causes inconsistency.
Wooldridge (2005) solution: Model the initial period as a function of the fixed effect:
And use the Mundlak-Chamberlain approach for the fixed effect distribution.
15. Using the Discrete Choice Component
The Discrete Choice Models component in the DataStatPro application provides a comprehensive workflow for specification, estimation, testing, and visualisation of all major discrete choice model families.
Step-by-Step Guide
Step 1 — Select Dataset Choose the dataset from the "Dataset" dropdown. The dataset should contain:
- A unit identifier column (individual, household, firm).
- An outcome variable (binary, multinomial, or ordinal).
- Covariates: Individual characteristics and/or alternative-specific attributes.
- For panel/repeated choices: a time or choice occasion identifier.
- For conditional logit: data in long format (one row per alternative per individual).
Step 2 — Select Model Family Choose the discrete choice model specification:
- Binary Logit (binary outcome, logistic link)
- Binary Probit (binary outcome, normal link)
- Linear Probability Model (binary outcome, OLS — for comparison purposes)
- Multinomial Logit (nominal outcome, )
- Conditional Logit (alternative-specific attributes)
- Mixed Logit (random parameters logit)
- Nested Logit (hierarchical alternatives)
- Ordered Logit (ordinal outcome)
- Ordered Probit (ordinal outcome, normal latent variable)
- Generalised Ordered Logit (relaxes proportional odds)
- Panel Conditional Logit (fixed effects binary logit for panel data)
- Random Effects Probit (panel probit with random effects)
Step 3 — Select Variables Map the required variables from your dataset:
- Unit ID: Unique identifier for each decision-maker.
- Choice Occasion ID: (for panel/repeated choices) The time or choice occasion identifier.
- Outcome (): The discrete choice variable.
- Individual Covariates (): Characteristics of the decision-maker.
- Alternative Attributes (): (for conditional/mixed logit) Attributes varying across alternatives and individuals.
- Alternative ID: (for long-format data) Which alternative is described by each row.
Step 4 — Specify Reference Categories For multinomial and conditional logit, set the reference alternative (default: first category in alphabetical order). For ordered models, verify the ordering of categories.
Step 5 — Configure Nesting Structure (Nested Logit) Assign each alternative to a nest:
- Drag-and-drop alternatives into nests in the nesting panel.
- Specify whether dissimilarity parameters () are free or constrained.
- Choose between sequential and full MLE estimation.
Step 6 — Configure Random Parameters (Mixed Logit) For each covariate, specify whether the coefficient is:
- Fixed (common to all individuals)
- Random — Normal
- Random — Log-Normal (for constrained-sign effects like cost)
- Random — Triangular (bounded support)
- Correlated (estimate full covariance , not just diagonal)
Set the number of Halton draws (default: 500) and whether to use antithetic draws for variance reduction.
Step 7 — Configure Fixed Effects (Panel Models)
- None (pooled model)
- Unit Fixed Effects (conditional logit or dummy-variable approach)
- Random Effects (Mundlak-Chamberlain or standard RE)
- Correlated Random Effects (include individual means of time-varying covariates)
Step 8 — Configure Standard Errors
- Standard MLE (information matrix SEs)
- Robust (Sandwich) — recommended for potential misspecification
- Cluster-Robust — specify clustering variable (e.g., household, region, market)
- Bootstrap — specify replications (default: 999)
- Delta Method — for marginal effect SEs (always active)
Step 9 — Configure Marginal Effects Select which partial effects to report:
- Average Partial Effects (APE) (default and recommended)
- Partial Effects at the Mean (PEM)
- Partial Effects at Representative Values (specify covariate values manually)
- Odds Ratios / Risk Ratios (binary and MNL only)
- Willingness to Pay (requires specification of a cost/price variable)
Step 10 — Select Display Options Choose which outputs to display:
- ✅ Coefficient Table (with SEs, z-stats, p-values, CIs)
- ✅ Marginal Effects Table (APE with SEs and CIs)
- ✅ Odds Ratios / Risk Ratios Plot
- ✅ Predicted Probability Plot (over covariate range)
- ✅ ROC Curve and AUC (binary models)
- ✅ Calibration Plot (Hosmer-Lemeshow)
- ✅ Goodness-of-Fit Statistics Table
- ✅ Pre-Trends / Marginal Effect Profile (over subgroups)
- ✅ Linktest Residuals
- ✅ Influence Diagnostics Plot
- ✅ Brant Test Results (ordered models)
- ✅ IIA Hausman Test Results (MNL)
- ✅ Nested Logit Tree Diagram
- ✅ Random Parameter Distributions (Mixed Logit)
- ✅ Confusion Matrix (binary classification)
- ✅ WTP Confidence Intervals
Step 11 — Run the Analysis Click "Run Discrete Choice Model". The application will:
- Validate data format and variable types; convert to appropriate structure if needed.
- Initialise parameters (using linear probability model or random starting values).
- Maximise the log-likelihood using Newton-Raphson / BFGS / IRLS.
- Compute variance-covariance matrix (information matrix or sandwich).
- Compute all selected marginal effects with delta method SEs.
- Run specified diagnostic tests (linktest, Brant, IIA Hausman).
- Generate all selected visualisations and tables.
16. Computational and Formula Details
16.1 Binary Logit MLE: Step-by-Step
Step 1: Initialise parameters
Step 2: Compute fitted probabilities
Step 3: Compute score and Hessian
Step 4: Newton-Raphson update
Step 5: Check convergence
Step 6: Compute variance-covariance matrix
16.2 Average Partial Effects: Full Computation
For binary logit, continuous covariate :
Gradient for delta method SE:
For binary logit, binary covariate :
Where is the fitted index and is the observed value of for individual .
16.3 Multinomial Logit: Score and Hessian
For the MNL with alternatives and reference :
Score for category ():
Hessian blocks:
The full Hessian is block-structured and negative definite, ensuring global concavity of the MNL log-likelihood.
16.4 Ordered Logit: Score and Threshold Constraints
Score for :
Score for threshold :
Thresholds are constrained to be strictly ordered. In practice, use the unconstrained re-parameterisation ( freely estimated).
16.5 Nested Logit: Full Information MLE
The nested logit log-likelihood for individual choosing alternative in nest :
The gradient with respect to requires the chain rule through the inclusive value and involves .
16.6 Conditional Fixed Effects Logit: Computation
For individual with successes across periods, the conditional log-likelihood contribution is:
For and , this simplifies to:
Which is equivalent to a standard logit with first-differenced covariates — the panel FE analogue of the first-differences estimator in linear models.
For , the summation over grows combinatorially ( terms) and is computed efficiently using the Breslow algorithm (analogous to the Cox partial likelihood).
16.7 Mixed Logit: Halton Sequences and Simulation
Halton sequence for prime base : Generate draws from the quasi-random sequence:
Where in base is . Halton sequences for different primes are used for different dimensions of integration.
Simulated log-likelihood:
Where and are Halton draws transformed to standard normal variates via .
Antithetic draws: For each draw , include its mirror to reduce simulation variance.
16.8 WTP Computation and Krinsky-Robb Confidence Intervals
Point estimate:
Delta method SE:
Krinsky-Robb (1986) simulation:
- Draw parameter vectors from .
- Compute for each draw.
- Report the 2.5th and 97.5th percentiles of as the 95% CI.
The Krinsky-Robb CI is preferred over the delta method when is close to zero (since the ratio is highly non-linear near zero).
17. Worked Examples
Example 1: Binary Logit — Probability of Health Insurance Take-Up
Research Question: What factors predict whether an individual aged 19–64 has health insurance coverage?
Data: Cross-sectional survey of working-age adults; outcome: if insured, if uninsured.
Model:
Step 1: Results Table
| Variable | SE | (OR) | APE | |||
|---|---|---|---|---|---|---|
| Intercept | -3.412 | 0.241 | -14.15 | <0.001 | — | — |
| Income (per $10k USD) | 0.318 | 0.031 | 10.26 | <0.001 | 1.374 | +0.048 pp |
| College degree | 0.841 | 0.094 | 8.95 | <0.001 | 2.318 | +0.127 pp |
| Age (years) | 0.042 | 0.007 | 6.00 | <0.001 | 1.043 | +0.006 pp |
| Employed full-time | 1.283 | 0.108 | 11.88 | <0.001 | 3.607 | +0.193 pp |
, , McFadden . AUC = 0.813, Hosmer-Lemeshow (, good calibration).
Step 2: Interpretation
- Income: A $10,000 increase in annual income increases the odds of coverage by a factor of 1.374 (37.4% higher odds). On average, this corresponds to a 4.8 percentage point increase in the probability of being insured (APE).
- College degree: Having a college degree more than doubles the odds of insurance (OR = 2.318). The APE is 12.7 pp — the largest marginal effect in the model.
- Full-time employment: Full-time employment (likely with employer-sponsored insurance) multiplies the odds by 3.607 (APE = 19.3 pp — the largest effect overall).
Step 3: Predicted Probability Profiles
| Income | College | Age | Employed | |
|---|---|---|---|---|
| $25k | No | 30 | No | 0.312 |
| $50k | No | 40 | Yes | 0.741 |
| $75k | Yes | 50 | Yes | 0.941 |
| $25k | Yes | 30 | No | 0.507 |
Step 4: Model Diagnostics
- Linktest: () — no evidence of systematic misspecification.
- Cook's distance: 12 observations with identified; re-estimation without these changes key coefficients by less than 3% → robust.
Example 2: Multinomial Logit — Occupational Choice
Research Question: What individual characteristics predict whether a worker is employed in (1) Professional/Managerial, (2) Technical/Clerical, or (3) Service/Manual occupations?
Data: workers; reference category: Service/Manual (category 3).
Step 1: MNL Coefficient Table (Reference: Service/Manual)
| Variable | Professional (vs. Service) | Technical (vs. Service) | ||
|---|---|---|---|---|
| SE | SE | |||
| Intercept | -2.841 | 0.312 | -1.523 | 0.241 |
| Education (years) | 0.412 | 0.041 | 0.218 | 0.033 |
| Experience (years) | 0.083 | 0.018 | 0.061 | 0.015 |
| Female | -0.391 | 0.112 | 0.284 | 0.098 |
| Urban | 0.521 | 0.134 | 0.312 | 0.118 |
, McFadden ; LR ().
Step 2: Average Partial Effects on Category Probabilities
| Variable | : Professional | : Technical | : Service |
|---|---|---|---|
| Education (+1 year) | +0.041 | +0.009 | -0.050 |
| Experience (+1 year) | +0.007 | +0.003 | -0.010 |
| Female | -0.048 | +0.062 | -0.014 |
| Urban | +0.056 | +0.021 | -0.077 |
Note that effects sum to zero across categories (probability constraint). Being female reduces the probability of professional occupation by 4.8 pp but increases the probability of technical occupation by 6.2 pp.
Step 3: IIA Test
Hausman-McFadden test excluding "Technical" category: , → IIA not rejected. Excluding "Professional": , → IIA not rejected. The MNL is appropriate for this application.
Step 4: Predicted Category Probabilities for Representative Profiles
| Profile | Professional | Technical | Service |
|---|---|---|---|
| 12 yrs education, 5 yrs exp., male, rural | 0.214 | 0.281 | 0.505 |
| 16 yrs education, 10 yrs exp., female, urban | 0.412 | 0.394 | 0.194 |
| 18 yrs education, 20 yrs exp., male, urban | 0.631 | 0.248 | 0.121 |
Example 3: Ordered Logit — Customer Satisfaction
Research Question: What factors predict customer satisfaction with a bank, rated on a 5-point scale (1 = Very Dissatisfied, ..., 5 = Very Satisfied)?
Data: bank customers; outcome: satisfaction rating .
Step 1: Ordered Logit Results
| Variable | SE | |||
|---|---|---|---|---|
| Account Tenure (years) | 0.182 | 0.023 | 7.91 | <0.001 |
| Branch Wait Time (−minutes) | -0.241 | 0.038 | -6.34 | <0.001 |
| Mobile App User | 0.612 | 0.084 | 7.29 | <0.001 |
| Complaint (last 12 mo.) | -1.143 | 0.112 | -10.21 | <0.001 |
| Premium Account | 0.831 | 0.098 | 8.48 | <0.001 |
Estimated Thresholds:
| Threshold | Estimate | SE |
|---|---|---|
| (1|2) | -3.412 | 0.181 |
| (2|3) | -1.841 | 0.143 |
| (3|4) | 0.321 | 0.121 |
| (4|5) | 2.184 | 0.152 |
Step 2: Brant Test for Proportional Odds
| Variable | PO Violated? | ||
|---|---|---|---|
| Account Tenure | 2.14 | 0.543 | No |
| Branch Wait Time | 3.91 | 0.271 | No |
| Mobile App User | 4.21 | 0.240 | No |
| Complaint | 18.41 | 0.000 | Yes |
| Premium Account | 3.12 | 0.373 | No |
| Global test | 32.18 | 0.009 | Yes |
The complaint variable violates proportional odds → estimate a Generalised Ordered Logit allowing the complaint coefficient to vary across thresholds.
Step 3: Average Partial Effects on P(Y = 5: Very Satisfied)
| Variable | APE | SE | |
|---|---|---|---|
| Tenure (+1 year) | +0.024 pp | 0.003 | <0.001 |
| Wait Time (+1 min) | -0.031 pp | 0.005 | <0.001 |
| Mobile App User | +0.082 pp | 0.011 | <0.001 |
| Complaint (yes vs. no) | -0.183 pp | 0.018 | <0.001 |
| Premium Account | +0.111 pp | 0.013 | <0.001 |
Having a complaint in the last 12 months reduces the probability of being Very Satisfied by 18.3 pp — by far the largest effect.
Example 4: Mixed Logit — Transportation Mode Choice
Research Question: How do travellers' preferences for cost, time, and comfort vary across individuals when choosing among Car, Bus, Train, and Bicycle?
Data: Stated preference survey; respondents, each evaluating 8 hypothetical choice scenarios (long format, rows); alternatives with attributes: cost ($), travel time (min.), comfort rating (1-5).
Step 1: Mixed Logit Specification
| Attribute | Distribution |
|---|---|
| Cost ($) | Fixed (negative) |
| Travel Time (min.) | Normal: |
| Comfort Rating | Normal: |
| ASC: Car | Fixed |
| ASC: Train | Fixed |
| ASC: Bus | Fixed |
| (Bicycle = Reference ASC) | — |
Halton draws used.
Step 2: Results
| Parameter | Estimate | SE | ||
|---|---|---|---|---|
| Cost () | -0.0412 | 0.006 | -6.87 | <0.001 |
| Time mean () | -0.0841 | 0.012 | -7.01 | <0.001 |
| Time SD () | 0.0412 | 0.008 | 5.15 | <0.001 |
| Comfort mean () | 0.3121 | 0.041 | 7.61 | <0.001 |
| Comfort SD () | 0.1843 | 0.029 | 6.35 | <0.001 |
| ASC: Car | 1.241 | 0.182 | 6.82 | <0.001 |
| ASC: Train | 0.814 | 0.151 | 5.39 | <0.001 |
| ASC: Bus | -0.312 | 0.141 | -2.21 | 0.027 |
Simulated ; McFadden (vs. for standard MNL).
Step 3: WTP Calculations (Krinsky-Robb 95% CI)
| Attribute | WTP Estimate | 95% CI |
|---|---|---|
| Travel time (per minute saved) | $2.04/min | [$1.61, $2.51] |
| Comfort (per unit increase) | $7.58/unit | [$5.91, $9.31] |
Travellers are willing to pay $2.04 per minute of travel time savings — a Value of Travel Time (VTT) estimate consistent with the transport economics literature.
Step 4: Preference Heterogeneity
The significant (Time SD) indicates substantial preference heterogeneity: 95% of the population has time sensitivity in the range (all negative, i.e., all dislike travel time). In contrast, comfort has , implying some travellers actually have negative comfort preferences — possibly capturing high-income travellers valuing solitude.
Example 5: Conditional Fixed Effects Logit — Panel Adoption Decision
Research Question: Does a reduction in technology cost (logged) increase the probability that a firm adopts a new production technology, controlling for all time-invariant firm characteristics?
Data: Annual panel of manufacturing firms, years; outcome: if firm adopts technology in year ; 214 firms (15%) adopt during the panel.
Model: Conditional fixed effects logit (Chamberlain), conditioning on .
| Variable | SE | APE | |||
|---|---|---|---|---|---|
| Log(Technology Cost) | -0.841 | 0.121 | -6.95 | <0.001 | -0.062 pp |
| Government Subsidy ($) | 0.0312 | 0.008 | 3.90 | <0.001 | +0.023 pp |
| Competitor Adoption Rate | 1.412 | 0.218 | 6.48 | <0.001 | +0.104 pp |
| Time Trend | 0.184 | 0.041 | 4.49 | <0.001 | +0.014 pp |
Number of firms contributing information: 214 (firms adopting at least once). Firms never adopting: 1,207 (dropped by conditioning). .
Interpretation: A 10% increase in technology cost reduces the probability of adoption by approximately pp per year, controlling for all time-invariant firm heterogeneity. Competitive pressure (competitor adoption rate) has the largest effect — a 10 pp increase in competitor adoption rates raises a firm's own probability by 10.4 pp.
18. Common Mistakes and How to Avoid Them
Mistake 1: Interpreting Raw Logit/Probit Coefficients as Marginal Effects
Problem: Reporting the raw from a logit regression as "a one-unit increase in
increases the probability of by ." This is only correct for the Linear
Probability Model. In logit and probit, is the change in the log-odds (logit) or
the latent index (probit), not the probability.
Solution: Always compute and report Average Partial Effects (APE) using the delta method
for standard errors. For communication to non-technical audiences, report predicted probabilities
for representative covariate profiles.
Mistake 2: Applying Multinomial Logit When IIA is Violated
Problem: Using the MNL for alternatives that are close substitutes (e.g., different bus routes,
similar brand variants), leading to unrealistic cross-substitution patterns predicted by the model.
Solution: Test IIA with the Hausman-McFadden or Small-Hsiao test. If IIA is suspect based on
subject-matter knowledge (similar alternatives exist), use Nested Logit (if the nesting
structure is clear) or Mixed Logit (for flexible substitution patterns). Report robustness
across model specifications.
Mistake 3: Ignoring the Proportional Odds Assumption in Ordered Logit
Problem: Estimating an ordered logit without testing the proportional odds assumption, and
reporting a single coefficient for each variable as if it applies uniformly across all thresholds.
When the assumption is violated, the estimated coefficient is an unreliable average.
Solution: Always run the Brant test (global and variable-specific). If violated for one or
more variables, use the Generalised Ordered Logit (partial proportional odds) or report
category-specific marginal effects. Never report ordered logit results without proportional odds
diagnostics.
Mistake 4: Using Standard Fixed Effects Logit Instead of Conditional Logit for Panel Data
Problem: Estimating a logit with individual dummy variables (LSDV approach) for panel data.
Due to the incidental parameters problem, is inconsistent for fixed .
With , the bias is approximately 100%; with , roughly 20%.
Solution: Use Chamberlain's conditional fixed effects logit for panel binary outcomes
(Stata: xtlogit, fe; R: clogit). For probit, use the Mundlak-Chamberlain correlated random
effects approach. Report within-individual variation only.
Mistake 5: Reporting Odds Ratios as Relative Risks (Risk Ratios)
Problem: Interpreting as "twice as likely." This is the odds ratio,
not the relative risk (risk ratio). For common outcomes (), the odds ratio
substantially overestimates the relative risk.
Solution: Be explicit about reporting odds ratios (from logit) vs. relative risks. For
common outcomes, report Average Partial Effects (absolute probability changes) which are
clearer. If relative risk is needed, use Poisson regression with a log link or compute predicted
probability ratios directly.
Mistake 6: Using Only In-Sample Fit Statistics for Model Selection
Problem: Selecting a model (e.g., choosing logit over probit, or choosing a particular set of
covariates) based solely on in-sample pseudo- or log-likelihood, without accounting for
overfitting.
Solution: Use AIC/BIC for comparing models with different covariate sets. For predictive
models, use out-of-sample AUC or Brier score from cross-validation. Always check
calibration via Hosmer-Lemeshow. Distinguish between models for prediction vs. structural
inference.
Mistake 7: Not Checking for Complete Separation
Problem: Running logit/probit on small samples or with many binary predictors without checking
for complete separation. The MLE does not exist, but many software packages produce output
with extremely large (meaningless) coefficients and standard errors without warning the user.
Solution: Check for separation before relying on MLE estimates. Warning signs: coefficients
, SEs , predicted probabilities exactly at 0 or 1. Use Firth penalised
logit or Bayesian logit (weakly informative priors) as robust alternatives to standard MLE
in small or sparse samples.
Mistake 8: Including Irrelevant Alternatives in the Choice Set
Problem: Defining the choice set too broadly (e.g., including alternatives that are not actually
available to the decision-maker) or too narrowly (excluding relevant alternatives). Both distort
the estimated choice probabilities.
Solution: Carefully define the choice set based on availability. For alternative-specific
choice sets (where different individuals face different options), specify the availability matrix
in the model. Report the sensitivity of results to alternative choice set definitions.
Mistake 9: Failing to Account for Preference Heterogeneity
Problem: Estimating a standard MNL or conditional logit that assumes homogeneous preferences
across all individuals, missing important heterogeneity in price sensitivity, taste, or value of
time. This leads to biased substitution patterns and misleading policy simulations.
Solution: Test for heterogeneity by including interaction terms with demographic variables.
For more flexible heterogeneity, estimate a Mixed Logit with normally distributed random
coefficients. Report the distribution of individual-level preferences, not just the mean.
Mistake 10: Using the Wrong Data Format for Conditional Logit
Problem: Estimating a conditional logit with individual-specific data in wide format (one
row per person, multiple columns for different alternatives' attributes). This causes data errors
and incorrect likelihood contributions.
Solution: Convert data to long format: one row per alternative per individual. The dataset
should have rows ( individuals, alternatives). Verify the choice indicator
is coded as 1 for the chosen alternative and 0 for all others, within each individual's choice set.
19. Troubleshooting
| Issue | Likely Cause | Solution |
|---|---|---|
| MLE does not converge | Poor starting values; very flat likelihood; complete separation | Use OLS/LPM as starting values; rescale variables; check for separation; try Firth logit |
| Extremely large coefficients or SEs () | Complete or quasi-complete separation; collinearity | Check pairwise correlations; VIF analysis; merge categories; use penalised estimation (Firth) |
| Negative definite Hessian at convergence | Local optimum; non-concave model extension | Try multiple starting values; use a different optimizer (BFGS vs. Newton-Raphson); check model specification |
| Predicted probabilities exactly 0 or 1 | Complete separation; extreme covariate values | Identify separating combination; drop/transform variable; use Firth logit; check for data errors |
| APE has wrong sign compared to coefficient | Non-linear interaction effects; cross-effects in MNL | APE in MNL can have opposite sign to log-odds coefficient — this is expected; report both |
| IIA Hausman test gives negative statistic | Small sample; numerical imprecision in Hessian estimation | Use Small-Hsiao test instead; check if restricted model is nested in full model; try more alternatives |
| Brant test significant (proportional odds violated) | Heterogeneous covariate effects across thresholds | Fit Generalised Ordered Logit; or Multinomial Logit; report variable-specific Brant results to identify culprits |
| Mixed logit does not converge | Too many random parameters; too few draws; poor scaling | Increase draws (to 1000+); scale attributes to similar magnitude; fix some parameters as fixed; simplify model |
| Conditional logit: no observations after conditioning | All individuals have or | Verify panel structure; ensure within-individual variation in ; check treatment coding |
| WTP confidence interval is extremely wide or includes infinity | Cost coefficient close to zero; poor precision | Report Krinsky-Robb CI instead of delta method; increase sample size; consider fixing cost coefficient |
| Hosmer-Lemeshow test rejects calibration | Model does not predict outcome rates accurately in some regions | Inspect calibration plot decile by decile; add polynomial terms for continuous variables; check for important omitted variables |
| AUC is high but calibration is poor | Model discriminates well but predicted probabilities are poorly scaled | Apply Platt scaling or isotonic regression for calibration correction; consider calibrated probability estimation |
| Panel random effects probit: very slow convergence | Many quadrature points needed; complex likelihood surface | Reduce quadrature points (e.g., 12-20 are usually sufficient); use adaptive quadrature; use Mundlak-Chamberlain approach with standard probit |
| Nested logit: | Incorrect nesting structure; misspecified model | The model is not RUM-consistent; rethink nesting structure; try Mixed Logit as alternative |
| MNL: predictions dominated by one category | Class imbalance; misspecified alternative | Check class proportions; verify reference category; consider alternative-specific constants |
| Interaction terms insignificant despite theoretical expectation | Insufficient statistical power; multicollinearity | Check VIF for interaction; report effect sizes with CIs regardless of significance; consider power analysis |
20. Quick Reference Cheat Sheet
Core Probability Formulas
| Model | Link Function | |
|---|---|---|
| Logit | Logit: | |
| Probit | Probit: | |
| LPM | Identity | |
| MNL | Log relative odds | |
| Ordered Logit | Proportional odds |
Key Formulas
| Formula | Description |
|---|---|
| Binary logit/probit log-likelihood | |
| Average Partial Effect (continuous ) | |
| APE for logit | |
| APE for probit | |
| Odds ratio from logit | |
| Willingness to pay | |
| McFadden's pseudo- | |
| Likelihood Ratio test | |
| Nested logit conditional probability | |
| Newton-Raphson / IRLS update |
Model Selection Guide
| Outcome Type | Alternatives | Recommended Model | |
|---|---|---|---|
| Binary | 2 | — | Logit (default) or Probit |
| Nominal | No attributes | Multinomial Logit | |
| Nominal | With attributes | Conditional Logit | |
| Nominal (correlated) | Nested groups | Nested Logit | |
| Nominal (heterogeneous) | Random preferences | Mixed Logit | |
| Ordinal | PO holds | Ordered Logit | |
| Ordinal | PO violated | Generalised Ordered Logit | |
| Binary, panel FE | 2 | — | Conditional FE Logit |
| Binary, panel RE | 2 | — | Random Effects Probit (Mundlak) |
Assumption Checklist
| Assumption | Model | How to Test | If Violated |
|---|---|---|---|
| Correct link function | Logit/Probit | Linktest; Box-Tidwell | Try alternative link; add polynomial terms |
| No complete separation | All binary | Check large SEs; predicted probs = 0/1 | Firth penalised MLE; Bayesian logit |
| IIA | MNL, CL | Hausman-McFadden; Small-Hsiao | Nested Logit; Mixed Logit |
| Proportional odds | Ordered Logit | Brant test; parallel lines graph | Generalised Ordered Logit; MNL |
| No heteroscedasticity | Probit | Linktest; heteroscedastic probit | Heteroscedastic probit; robust SEs |
| No perfect multicollinearity | All | VIF; condition number | Drop/combine variables; regularise |
| RUM consistency | Nested Logit | Respecify nesting; Mixed Logit | |
| No endogeneity | All | Hausman test vs. IV estimator | Control function; IV logit/probit |
Marginal Effects: Type and Context
| Context | Measure | Formula |
|---|---|---|
| Average effect (standard) | APE / AME | |
| Effect at average person | PEM | |
| Effect for specific profile | Marginal effect at representative value | |
| Binary covariate | Discrete change | |
| Log-odds scale | Raw coefficient | |
| Multiplicative odds | Odds ratio | |
| Money metric | Willingness to Pay |
Standard Error Selection
| Setting | Recommended SE | Rationale |
|---|---|---|
| IID observations, correct spec. | MLE information matrix SEs | Efficient; standard |
| Potential misspecification | Sandwich (robust) SEs | Robust to distributional misspecification |
| Clustered data (firms, regions) | Cluster-robust SEs | Within-cluster correlation |
| Small samples | Bootstrap SEs | More reliable finite-sample inference |
| Marginal effects | Delta method SEs | Propagates uncertainty from |
| WTP | Krinsky-Robb simulation | Better for ratios of estimates |
Fit Statistics at a Glance
| Statistic | Formula | Best for |
|---|---|---|
| McFadden | Overall model fit | |
| AIC | Model comparison (prediction) | |
| BIC | Model comparison (parsimony) | |
| AUC-ROC | Area under ROC curve | Binary discrimination |
| Brier Score | Probability calibration | |
| Count | Pct. correctly classified | Naive classification |
| Hosmer-Lemeshow | Calibration across prediction deciles |
Panel Discrete Choice: Key Properties
| Estimator | Consistent (, fixed )? | Time-Invariant Variables? | Dynamic State Dependence? | Key Reference |
|---|---|---|---|---|
| LSDV Logit (incidental params.) | ❌ | ❌ | Limited | — |
| Conditional FE Logit | ✅ | ❌ | No (by default) | Chamberlain (1980) |
| RE Probit (standard) | ✅ (if ) | ✅ | No | — |
| Correlated RE Probit (Mundlak) | ✅ (approximately) | ✅ | No | Mundlak (1978) |
| Dynamic Logit (Wooldridge) | ✅ | Limited | ✅ | Wooldridge (2005) |
| Mixed Logit (panel) | ✅ | ✅ | Via serial correlation | McFadden & Train (2000) |
This tutorial provides a comprehensive foundation for understanding, applying, and interpreting Discrete Choice Models using the DataStatPro application. For further reading, consult McFadden's "Conditional Logit Analysis of Qualitative Choice Behavior" (1974), Train's "Discrete Choice Methods with Simulation" (Cambridge University Press, 2009), Greene's "Econometric Analysis" (8th ed., 2018), Long's "Regression Models for Categorical and Limited Dependent Variables" (Sage, 1997), or Wooldridge's "Econometric Analysis of Cross Section and Panel Data" (MIT Press, 2010). For feature requests or support, contact the DataStatPro team.