Difference-in-Differences Models: Zero to Hero Tutorial
This comprehensive tutorial takes you from the foundational concepts of Difference-in-Differences (DiD) estimation all the way through advanced extensions, assumption testing, heterogeneity analysis, and practical usage within the DataStatPro application. Whether you are a complete beginner or an experienced analyst, this guide is structured to build your understanding step by step.
Table of Contents
- Prerequisites and Background Concepts
- What is the Difference-in-Differences Design?
- The Mathematical Framework
- The Parallel Trends Assumption
- Identification and Causal Inference
- Standard DiD Estimation
- Hypothesis Testing and Inference
- Effect Size Measures
- Model Fit and Evaluation
- Diagnostics and Assumption Testing
- Extensions: Staggered DiD and Multiple Time Periods
- Extensions: Heterogeneous Treatment Effects
- Extensions: Continuous and Fuzzy Treatment
- Covariates and Controls in DiD
- Using the DiD Component
- Computational and Formula Details
- Worked Examples
- Common Mistakes and How to Avoid Them
- Troubleshooting
- Quick Reference Cheat Sheet
1. Prerequisites and Background Concepts
Before diving into Difference-in-Differences, it is helpful to be familiar with the following foundational concepts. Do not worry if you are not — each concept is briefly explained here.
1.1 Counterfactuals and the Potential Outcomes Framework
The potential outcomes framework (Rubin Causal Model) is the conceptual foundation of causal inference. For each unit and time period , define:
- : The potential outcome that would occur if unit received treatment at time .
- : The potential outcome that would occur if unit did not receive treatment at time .
The individual treatment effect for unit at time is:
The fundamental problem of causal inference: We can never observe both and for the same unit at the same time. We observe only one — the realised outcome. The unobserved outcome is called the counterfactual.
The Average Treatment Effect on the Treated (ATT) is:
DiD is one of the most widely used methods for estimating the ATT using a comparison group to approximate the unobserved counterfactual.
1.2 Selection Bias
Selection bias arises when the assignment to treatment is not random — treated and control units differ systematically in ways that also affect the outcome. A naïve comparison of treated vs. untreated units confounds the treatment effect with pre-existing differences:
DiD removes selection bias due to time-invariant unobserved differences between groups.
1.3 Panel Data
Panel data (also called longitudinal data) consists of observations on the same units (individuals, firms, regions, countries) over multiple time periods. It has two dimensions: a cross-sectional dimension ( units) and a time dimension ( periods).
Panel data are written as for unit and period .
DiD most naturally arises in a panel data context, though it can also be implemented with repeated cross-sections.
1.4 Fixed Effects
A unit fixed effect captures all time-invariant characteristics of unit — both observed and unobserved — that affect the outcome. By including fixed effects in a regression, we effectively compare each unit to itself over time, removing all time-invariant confounders.
A time fixed effect captures factors that affect all units equally at time — common macroeconomic conditions, seasonal patterns, or universal policy changes.
1.5 Ordinary Least Squares (OLS) Regression
OLS regression finds the linear relationship between predictors and outcome by minimising the sum of squared residuals:
DiD is typically implemented as an OLS regression with specific interaction terms and fixed effects, so familiarity with OLS is essential.
1.6 Treatment Assignment and Natural Experiments
A natural experiment is a situation in which the assignment of units to treatment and control conditions is determined by some external, exogenous factor — rather than by the researcher or by the units themselves. Natural experiments approximate the conditions of a randomised controlled trial. Common examples:
- A policy reform that applies to some regions but not others.
- A law change that takes effect at a specific date.
- Geographic boundaries that determine eligibility for a programme.
- Lotteries or other chance-based assignment mechanisms.
DiD is the workhorse estimator for natural experiments with a pre-period and a post-period.
2. What is the Difference-in-Differences Design?
2.1 The Core Idea
Difference-in-Differences (DiD) is a quasi-experimental research design that estimates the causal effect of a treatment or policy by comparing the change over time in the outcome for a treated group to the change over time in the outcome for an untreated (control) group.
The intuition is straightforward:
- The first difference (within the treated group, over time) removes time-invariant differences between treated units and the rest of the world.
- The second difference (between treated and control, within the same time period) removes common time trends that affect both groups equally.
By taking the difference of these two differences, DiD isolates the treatment effect from:
- Pre-existing level differences between treated and control groups.
- Common time trends affecting both groups equally.
2.2 The 2×2 DiD Setup
The simplest (canonical) DiD design has:
- Two groups: A treated group () and a control group ().
- Two time periods: A pre-treatment period () and a post-treatment period ().
- One treatment: Applied only to the treated group, only in the post-treatment period.
The canonical 2×2 DiD table of group-period means:
| Pre-Period () | Post-Period () | Difference (Post − Pre) | |
|---|---|---|---|
| Treated () | |||
| Control () | |||
| Difference (Treated − Control) |
The DiD estimate:
2.3 Real-World Applications
DiD is one of the most widely applied methods in empirical social science, economics, public health, and policy evaluation:
- Labour Economics: Card & Krueger (1994) — Effect of New Jersey's minimum wage increase on fast-food employment, using Pennsylvania as the control group.
- Health Policy: Effect of the Affordable Care Act (ACA) Medicaid expansion on health insurance coverage and health outcomes, comparing expansion to non-expansion states.
- Education Policy: Effect of class size reductions (STAR experiment) or school voucher programmes on student achievement.
- Environmental Economics: Effect of environmental regulations (e.g., Clean Air Act) on air pollution and health outcomes.
- Finance: Effect of financial crises, banking regulations, or central bank interventions on lending and economic activity.
- Criminology: Effect of policing policies, incarceration changes, or gun laws on crime rates.
- Public Health: Effect of vaccination campaigns, smoking bans, or lockdown policies on health outcomes.
- Development Economics: Effect of microcredit programmes, cash transfers, or infrastructure investments on household welfare.
2.4 DiD vs. Other Quasi-Experimental Methods
| Method | Key Assumption | When to Use |
|---|---|---|
| DiD | Parallel trends in absence of treatment | Panel data or repeated cross-sections; policy timing varies |
| Regression Discontinuity (RD) | No manipulation around the cutoff | Assignment determined by a continuous threshold |
| Instrumental Variables (IV) | Instrument relevance and exclusion | Endogenous treatment with a valid instrument |
| Synthetic Control | Weighted average of controls matches treated pre-trend | Single treated unit; many potential controls |
| Event Study | No pre-trends; clean identification window | Multiple time periods around treatment timing |
| Propensity Score Matching | Conditional independence (selection on observables) | Rich covariate data; no unobservable confounders |
3. The Mathematical Framework
3.1 The Canonical 2×2 DiD Model
The standard regression formulation of the 2×2 DiD model is:
Where:
- = outcome for unit at time .
- = indicator for whether unit belongs to the treated group (time-invariant).
- = indicator for whether time period is the post-treatment period (unit-invariant).
- = the DiD interaction term.
- = baseline mean for the control group in the pre-period.
- = pre-treatment difference in levels between treated and control groups (selection bias term).
- = common time trend from pre to post period (time effect for the control group).
- = the DiD estimator — the causal effect of the treatment.
- = idiosyncratic error term.
Predicted cell means from the regression:
| Pre () | Post () | Difference | |
|---|---|---|---|
| Control () | |||
| Treated () | |||
| Difference (T − C) |
3.2 The Two-Way Fixed Effects (TWFE) Model
The most general and widely used DiD regression extends the canonical model to panel data with unit fixed effects and time fixed effects:
Where:
- = unit (entity) fixed effect — absorbs all time-invariant unit characteristics.
- = time fixed effect — absorbs all period-specific shocks common to all units.
- = treatment indicator (= 1 if unit is treated at time ).
- = vector of time-varying controls.
- = the DiD estimator of the Average Treatment Effect on the Treated (ATT).
Key insight: The TWFE model is the natural extension of the 2×2 DiD to multiple units and multiple time periods. The coefficient on the treatment indicator is the DiD estimate once unit and time fixed effects are included.
3.3 The Within-Estimator Interpretation
The TWFE estimator is equivalent to the within estimator (demeaning). Define:
Where (unit mean), (time mean), and (grand mean). Similarly define and .
The TWFE estimator is:
This identifies from within-unit, within-time variation in treatment status.
3.4 Potential Outcomes Representation
In the potential outcomes framework, the DiD estimand is:
This is the ATT in the post-treatment period — the average effect on the treated units of receiving treatment.
The DiD identification strategy replaces the unobserved counterfactual with the observable:
This is exactly the parallel trends assumption — the counterfactual trend for the treated group equals the observed trend for the control group.
4. The Parallel Trends Assumption
4.1 The Assumption Stated
The parallel trends assumption (also called the common trends or parallel paths assumption) is the key identifying assumption of DiD:
In the absence of treatment, the average outcome for the treated group would have followed the same trend as the average outcome for the control group.
Formally:
Crucially: This assumption is about the counterfactual — what would have happened to the treated group had it not been treated. It is fundamentally untestable with post-treatment data, but can be supported with:
- Pre-treatment trend evidence (parallel pre-trends test).
- Institutional knowledge about why the groups are similar in their trends.
- Placebo tests.
4.2 Visualising Parallel Trends
The canonical DiD diagram plots the outcome over time for both groups:
Outcome
|
| ● Treated (actual)
| /
| / ↑ δ = Treatment Effect
| / /
| / ---/ ← Treated counterfactual (unobserved)
| / /
| ● /
| / \ /
|/ ● Control (observed, serves as counterfactual trend)
|
+--------+--------→ Time
Pre Post
↑
Treatment
begins
The treatment effect is the vertical distance between the actual treated outcome and the counterfactual treated outcome in the post-period. The control group's observed trajectory is the counterfactual trend.
4.3 When is Parallel Trends Plausible?
Parallel trends is more plausible when:
- Treatment and control groups are similar in observed characteristics and pre-treatment trends.
- Treatment is determined by a sharp, exogenous rule (geographic, legislative, administrative).
- The treatment and control groups come from the same broad population (e.g., neighbouring counties, similar industries, adjacent cohorts).
- There are no other contemporaneous changes that differentially affect treated and control groups.
Parallel trends is less plausible when:
- Groups are systematically different in ways related to the outcome trajectory (e.g., high-income vs. low-income countries).
- Treatment is self-selected based on anticipated trends (e.g., firms that chose to adopt a technology because they expected growth).
- There are anticipation effects — units change behaviour before the treatment officially starts.
4.4 Parallel Trends in Different Functional Forms
The parallel trends assumption is not scale-invariant. It may hold on the levels scale but not on the logarithmic scale (or vice versa):
- Levels scale: (additive parallel trends).
- Log scale: (multiplicative/proportional parallel trends).
The choice of outcome transformation (levels, logs, rates) should be guided by theory about the nature of the treatment effect and the plausibility of parallel trends.
4.5 Conditional Parallel Trends
The parallel trends assumption may only hold conditional on observable covariates :
When unconditional parallel trends is implausible, including covariates (Section 14) can restore the assumption by controlling for observable differences in time trends between groups.
5. Identification and Causal Inference
5.1 What DiD Identifies
Under the parallel trends assumption, the DiD regression coefficient identifies the Average Treatment Effect on the Treated (ATT) in the post-treatment period:
Where is the treatment onset period.
Not identified by DiD:
- The Average Treatment Effect (ATE) — the effect averaged over both treated and control units.
- The effect on the control group had it been treated.
- The long-run effect if treatment effects change over time (addressed in event study designs).
5.2 The No Anticipation Assumption
A supplementary assumption is no anticipation: treated units do not change their behaviour in the pre-treatment period in anticipation of receiving treatment.
Formally, for all :
Why it matters: If treated units begin changing before the treatment officially starts (e.g., firms start investing as soon as a subsidy is announced), the pre-period outcome already reflects anticipatory responses. This violates the parallel trends assumption in the pre-period and biases the DiD estimator.
How to check: Pre-treatment placebo tests (event study coefficients for pre-period leads should be near zero).
5.3 The Stable Unit Treatment Value Assumption (SUTVA)
SUTVA has two components:
-
No interference: The treatment status of unit does not affect the potential outcomes of unit (no spillovers, general equilibrium effects, or cross-unit contamination).
-
No hidden versions of treatment: There is only one version of the treatment; all treated units receive the same treatment.
Violations: Spillovers arise when treatment of some units affects control units (e.g., a local employment policy in one area displaces workers to other areas, affecting those areas' outcomes). SUTVA violations bias the DiD estimator.
5.4 Exogeneity of Treatment Timing
In staggered DiD designs (Section 11), a key requirement is that the timing of treatment adoption is exogenous — not determined by pre-existing trends or anticipation of future outcomes. If units that were doing well adopt treatment earlier, the DiD estimator is biased.
5.5 DiD as a Special Case of the Fixed Effects Estimator
The 2×2 DiD estimator is numerically equivalent to the first-differences estimator in a two-period panel:
Where for treated units and for control units. OLS on this first-differenced equation produces — exactly the DiD formula.
6. Standard DiD Estimation
6.1 OLS Estimation of the Canonical DiD
The 2×2 DiD regression:
is estimated by OLS. The DiD coefficient:
This can be written as:
Where contains the constant, , , and their interaction .
6.2 TWFE Estimation with Panel Data
The TWFE estimator is obtained by including unit and time dummies (or using the within-transformation):
Using dummy variables:
Using the within (demeaning) transformation:
Where and similarly for other variables.
The TWFE estimator is consistent under the parallel trends assumption and strict exogeneity of treatment given fixed effects.
6.3 First Differences Estimator
An alternative to demeaning is first differencing, which subtracts the previous period's observation:
In a two-period model, first differences and within estimation are identical. For , they differ in efficiency — first differences is more efficient when follows a random walk; within estimation is more efficient when is serially uncorrelated.
6.4 Weighted DiD
When the groups have unequal sizes or when reweighting is needed to improve comparability, a weighted DiD uses weights :
Common weighting schemes:
- Population weights: Weight by group size.
- Propensity score weights: Reweight control units to match the distribution of pre-treatment characteristics in the treated group (augmented inverse probability weighting — AIPW).
- Variance weights: Inverse of estimated error variance for each unit.
6.5 DiD with Repeated Cross-Sections
When panel data (the same units followed over time) are unavailable, DiD can be implemented with repeated cross-sections — independent samples drawn from the same population at each time period. The DiD regression:
Where:
- = treated group indicator for individual in group .
- = post-treatment period indicator.
- = DiD estimator (interpreted as a change in group-period means).
The DiD estimator is valid under the assumption that the cross-sectional samples are representative of the same underlying population in each period, even though different individuals are observed.
7. Hypothesis Testing and Inference
7.1 Standard Error Choices
The choice of standard errors is critical in DiD analyses. Several options are available, with different assumptions:
7.1.1 OLS Standard Errors
Valid only under homoscedasticity and no serial correlation. Almost never appropriate for DiD — treated group observations are typically serially correlated.
7.1.2 Heteroscedasticity-Robust (HC) Standard Errors
Where . Accounts for heteroscedasticity but not serial correlation.
7.1.3 Cluster-Robust Standard Errors
The most recommended standard errors for DiD. Clustering at the group level (e.g., state, firm, country) allows for arbitrary heteroscedasticity and serial correlation within clusters:
Where indexes clusters, and are the design matrix and residuals for cluster .
Critical recommendation (Bertrand, Duflo & Mullainathan, 2004): Always cluster at the level of treatment assignment (e.g., state-level policy → cluster at state level). Failure to do so leads to severely underestimated standard errors and spurious significance.
7.1.4 Wild Cluster Bootstrap
When the number of clusters is small (), cluster-robust standard errors based on asymptotic approximations can be unreliable. The wild cluster bootstrap (Cameron, Gelbach & Miller) provides more reliable inference:
- Estimate the model and obtain cluster-robust residuals .
- For each bootstrap replication :
- Draw with equal probability for each cluster .
- Construct bootstrap residuals .
- Form the bootstrap outcome: .
- Re-estimate .
- Use the distribution of to compute p-values and confidence intervals.
| Standard Error Type | When to Use | Key Assumption |
|---|---|---|
| OLS | Never (DiD context) | IID errors |
| HC (Robust) | Small , homogeneous clusters | Heteroscedastic, no serial corr. |
| Cluster-Robust | Standard recommendation | Within-cluster correlation allowed |
| Wild Cluster Bootstrap | Few clusters () | More reliable with few clusters |
| Block Bootstrap | Panel data, spatial correlation | Resamples entire clusters |
7.2 The Wald Test for the DiD Coefficient
The Wald test for the DiD effect tests :
Where is the residual degrees of freedom. With cluster-robust standard errors, use the -distribution with degrees of freedom (where is the number of clusters):
A confidence interval for :
7.3 F-Test for Joint Significance
To jointly test whether a vector of DiD coefficients is zero (e.g., in a model with multiple treatment indicators):
Where is the number of restrictions.
7.4 Inference with Few Treated Units
A common challenge is when only a few units receive treatment (e.g., one state, two firms). In such cases:
- Cluster-robust standard errors have very few clusters → unreliable asymptotic approximations.
- Randomisation inference (permutation tests): Repeatedly re-assign treatment to randomly selected units and re-estimate . The p-value is the fraction of placebo estimates at least as large as the observed estimate.
- Synthetic control methods: Construct a weighted control unit that matches the treated unit's pre-period trajectory.
8. Effect Size Measures
8.1 The DiD Coefficient as Effect Size
The primary effect size in DiD is the DiD coefficient itself. Its interpretation depends on the model specification:
- Levels regression ( in original units): is the absolute change in the outcome caused by treatment (e.g., 3.2 percentage points, 500 USD, 2.1 hours).
- Log outcome (): change for small values; more precisely, change.
- Standardised outcome (mean 0, SD 1): is in standard deviation units — directly comparable to Cohen's .
8.2 Percent Change Effect
When the outcome is in levels, the percent change caused by treatment is:
Where is the pre-treatment mean of the treated group. This contextualises the absolute effect size relative to the pre-treatment baseline.
8.3 Standardised Effect Size (Cohen's d Analogue)
Standardise the DiD estimate by the pre-treatment standard deviation of the outcome:
Where is the pooled pre-treatment standard deviation across treated and control groups. Benchmarks follow Cohen (1988):
| | Effect Size | | :---------- | :---------- | | | Small | | | Medium | | | Large |
8.4 Relative Reduction/Increase
For outcomes where the baseline level matters (e.g., crime rates, disease incidence), report the relative effect:
Or the relative DiD:
8.5 Number Needed to Treat (NNT)
For binary outcomes (e.g., employed/unemployed, insured/uninsured):
The NNT represents the number of units that need to be treated to produce one additional success (or prevented failure), contextualising the policy significance of the effect.
8.6 and Explained Variance
While not a primary effect size for DiD, report the within- (after partialling out fixed effects) to convey how much treatment variation explains the residual variation in the outcome:
Report both the overall and the within for TWFE models.
9. Model Fit and Evaluation
9.1 Goodness-of-Fit Statistics
Standard regression fit statistics apply to the DiD regression:
| Statistic | Formula | Description |
|---|---|---|
| Overall variance explained | ||
| Within | Variance explained within unit × time cells | |
| Between | Based on group-time means | Variance explained between group-time cells |
| Adjusted | penalised for parameters | |
| RMSE | Root mean squared error | |
| AIC | Penalised fit (lower is better) | |
| BIC | Strongly penalised fit (lower is better) |
9.2 Fit of the Counterfactual
A key model evaluation step is assessing how well the control group serves as a counterfactual for the treated group's pre-treatment trajectory. Visually:
- Plot the raw pre-treatment trends for treated and control groups.
- Assess whether the trends are parallel (or conditionally parallel after covariate adjustment).
Quantitatively: Compute the pre-treatment DiD — the difference in trends during the pre-period. If this is near zero and statistically insignificant, the parallel trends assumption is supported.
9.3 Information Criteria for Model Comparison
When comparing DiD specifications (e.g., different control sets, different functional forms, different clustering levels), use AIC and BIC:
Where is the number of parameters and is the maximised likelihood.
Note: AIC and BIC comparisons are only valid for models fitted to the same sample using the same outcome variable (e.g., levels vs. logs are not comparable on these criteria).
9.4 Assessing Balance on Pre-Treatment Covariates
A critical validity check is whether treated and control groups have similar pre-treatment characteristics:
- Report a balance table of pre-treatment mean differences in key covariates between groups.
- Test statistical differences using two-sample -tests or non-parametric tests.
- Report standardised mean differences (SMD = mean difference / pooled SD) as effect sizes for balance.
- SMD is commonly used as a threshold for acceptable balance.
10. Diagnostics and Assumption Testing
10.1 Pre-Trends Test (Event Study)
The event study (dynamic DiD) is the primary tool for testing the parallel trends assumption using pre-treatment data. It estimates a separate DiD coefficient for each time period:
Where:
- = treatment onset period for unit .
- indexes periods relative to treatment onset (negative = pre-treatment, positive = post-treatment).
- Period is the omitted base period (normalised to zero for identification).
- for treated units.
Interpretation:
- Pre-treatment coefficients (): Should be statistically indistinguishable from zero. Significant pre-treatment coefficients indicate pre-existing differences in trends — a violation of parallel trends.
- Post-treatment coefficients (): Estimate the dynamic treatment effect at each horizon after treatment. Increasing post-treatment effects may indicate treatment ramp-up; decreasing effects may indicate decay.
Formal pre-trends test: Joint -test (or test) that all pre-treatment coefficients () are jointly zero:
A non-significant result supports the parallel trends assumption; a significant result casts doubt.
⚠️ Passing the pre-trends test does not prove parallel trends hold in the post-period — trends may diverge after treatment for reasons unrelated to treatment. The pre-trends test is a necessary but not sufficient condition for identification.
10.2 Placebo Tests
Placebo tests assess whether the estimated DiD effect could have arisen by chance or due to confounding:
10.2.1 Placebo Time Periods
Estimate the DiD using only pre-treatment data, using a false treatment date (e.g., 2 years before the true treatment):
A significant when treatment has not yet occurred suggests confounding or a violation of parallel trends.
10.2.2 Placebo Treatment Groups
Assign treatment to groups that were not actually treated and estimate the DiD. If the "treatment effect" is significant for these falsely treated groups, the design has poor identification.
10.2.3 Outcome Placebo Tests
Estimate the DiD using outcomes that should not be affected by the treatment. A null result () for these placebo outcomes increases confidence that the design is not picking up spurious effects.
10.3 Sensitivity to Parallel Trends Violations
Rambachan and Roth's sensitivity analysis (2023) provides a formal framework for assessing how large a violation of parallel trends would need to be to reverse the conclusion. The key parameter captures the maximum allowable deviation from parallel trends:
Report breakdown values of — the maximum deviation consistent with the estimated effect remaining statistically significant or of the correct sign.
10.4 Testing for Anticipation Effects
Estimate the DiD including period (the period immediately before treatment) in the event study:
If is significantly different from zero, anticipation effects may be present.
10.5 Checking for Compositional Changes
In DiD with repeated cross-sections, the composition of the treated or control groups may change between periods. If the treatment induces sample selection (e.g., a health policy causes sick people to enter/exit the workforce), the DiD estimator may be biased.
How to check:
- Compare the distribution of pre-treatment characteristics across groups and periods.
- Test for treatment effects on selection-related outcomes (e.g., sample size, attrition rates).
- Use balanced panel data where possible to avoid compositional issues.
10.6 Residual Diagnostics
Standard regression diagnostics apply to the DiD residuals:
- Serial correlation test (Wooldridge, 2002): Tests whether the residuals from the first-differenced equation are serially correlated. Under H₀ (no serial correlation in levels), the first-differenced residuals have correlation . Significant deviation suggests serial correlation.
- Heteroscedasticity: Breusch-Pagan or White test; motivates cluster-robust standard errors.
- Normality of residuals: Jarque-Bera test; Q-Q plot (less critical with large ).
- Outlier detection: Cook's distance; leverage (); DFFITS.
11. Extensions: Staggered DiD and Multiple Time Periods
11.1 Staggered Treatment Adoption
In many real-world settings, different units adopt treatment at different points in time — this is called staggered (or differential timing) DiD. For example, different US states adopt a policy in different years.
The TWFE regression in this context:
The TWFE estimator is a weighted average of all possible 2×2 DiD comparisons between groups that adopt treatment at different times — what Goodman-Bacon (2021) calls Bacon decomposition.
11.2 The Bacon Decomposition
Goodman-Bacon (2021) shows that the TWFE estimator in a staggered design decomposes as:
Where is the 2×2 DiD comparing early adopters (treatment at time ) vs. late adopters (treatment at time ), and are weights summing to 1.
The problem of "forbidden comparisons": Some of these 2×2 DiDs compare a late adopter group in the post-period against an early adopter group that has already been treated — using already-treated units as a control group. If treatment effects are heterogeneous and dynamic (treatment effects change over time), this produces negative weights that can lead to:
- A significant of the wrong sign even when all individual group-time ATTs are positive.
- A misleading averaged effect that conceals substantial heterogeneity.
11.3 Robust Staggered DiD Estimators
Several robust estimators have been developed to address the staggered DiD problem:
11.3.1 Callaway & Sant'Anna (2021) — Cohort-Specific ATTs
Define a cohort as the set of units that first receive treatment at the same calendar time . The cohort-average treatment effect on the treated is:
Where is the potential outcome at time if first treated at time , and is the never-treated potential outcome.
Aggregation: Individual cohort-time ATTs are aggregated to form:
- Simple average: .
- Calendar time aggregation: = average ATT at calendar time across all treated cohorts.
- Event time aggregation: = average ATT at event time across all cohorts.
11.3.2 Sun & Abraham (2021) — Interaction-Weighted Estimator
Decompose the TWFE estimate using cohort × period interactions:
The interaction-weighted (IW) estimator aggregates using the share of each cohort in each period as weights, producing a heterogeneity-robust estimate of the average effect.
11.3.3 de Chaisemartin & D'Haultfœuille (2020) —
The estimator uses only "clean" comparisons — periods in which treatment status changes — to form the estimate:
Where if unit switches from untreated to treated between and , and weights ensure comparability.
11.3.4 Borusyak, Jaravel & Spiess (2024) — Imputation Estimator
Imputes the counterfactual using a linear factor model:
- Estimate and using untreated observations only.
- Impute the counterfactual: for treated observations.
- Estimate ATTs: .
11.4 Choosing Among Staggered DiD Estimators
| Estimator | Robust to Effect Heterogeneity | Multiple Controls | Covariates | Key Reference |
|---|---|---|---|---|
| TWFE | ❌ (negative weights possible) | ✅ | ✅ | — |
| Callaway-Sant'Anna | ✅ | ✅ | ✅ | Callaway & Sant'Anna (2021) |
| Sun-Abraham | ✅ | ✅ | Limited | Sun & Abraham (2021) |
| de Chaisemartin-D'Haultfœuille | ✅ | Limited | Limited | dCH & DH (2020) |
| Borusyak-Jaravel-Spiess | ✅ | ✅ | ✅ | BJS (2024) |
💡 For staggered designs, always report the Bacon decomposition to diagnose the extent of potentially problematic comparisons, and supplement TWFE with at least one robust estimator.
12. Extensions: Heterogeneous Treatment Effects
12.1 Why Treatment Effects May Be Heterogeneous
The standard DiD model estimates a single average treatment effect (ATT). In reality, treatment effects often vary across:
- Units: Different firms, regions, or individuals respond differently to the same treatment.
- Time: Treatment effects may grow, decay, or oscillate over time after treatment onset.
- Subgroups: Effects may differ by gender, income, size, geography, or other characteristics.
- Treatment intensity: Larger doses may produce larger effects (see Section 13).
12.2 Subgroup DiD Analysis
To examine how treatment effects vary across a categorical moderator :
- = treatment effect for the reference subgroup ().
- = differential treatment effect — how much the effect differs for relative to .
- Total effect for : .
Test of effect heterogeneity: (no heterogeneity). Use cluster-robust standard errors.
12.3 Dynamic Treatment Effects (Event Study)
The event study design (Section 10.1) directly estimates dynamic treatment effects:
For (periods since treatment onset). Plot with confidence intervals to visualise:
- Immediate effects (): Effect in the year/period treatment begins.
- Ramp-up: Effects growing over time (learning, diffusion, cumulative investment).
- Decay: Effects diminishing over time (adaptation, spillovers, fading).
- Persistence: Effects stable over time (structural change).
12.4 Heterogeneity-Robust Aggregation
The robust staggered DiD estimators (Section 11.3) produce cohort-specific ATTs that can be aggregated in various ways:
Average across all cohorts and post-periods:
Event-time average: ATT as a function of time since treatment:
Calendar-time average: ATT in each calendar year:
13. Extensions: Continuous and Fuzzy Treatment
13.1 Continuous Treatment Intensity (Dose-Response DiD)
When the treatment variable is continuous (e.g., amount of subsidy, level of minimum wage increase, exposure to a policy) rather than binary, the DiD model becomes:
Where is now a continuous variable representing the intensity of treatment. The coefficient represents the effect of a one-unit increase in treatment intensity on the outcome.
Dose-response curve: Plot the predicted outcome as a function of treatment intensity at different time periods to visualise the dose-response relationship.
13.2 Fuzzy DiD (Instrumental Variables DiD)
In a fuzzy DiD design, the policy change shifts the probability of treatment but does not deterministically assign treatment. For example:
- A subsidy makes adoption cheaper but does not mandate it.
- An eligibility rule changes, but not all eligible units comply.
The binary treatment variable measures actual take-up; the policy indicator is the instrument (discontinuous change in eligibility or incentive).
First stage: Treatment as a function of the instrument:
Second stage: Outcome as a function of predicted treatment:
The IV-DiD estimator estimates the Local Average Treatment Effect (LATE) — the effect on compliers (units that switch treatment status in response to the policy change).
Fuzzy DiD estimator:
13.3 Triple Differences (DDD)
Triple differences (DDD) adds a third source of variation to further control for confounders. The idea is to difference out group-specific time trends that are common to all individuals within a group:
Where is an additional dimension (e.g., age group, income group) that determines eligibility within the treated group.
DDD estimator:
DDD is valuable when the comparison across regions includes contamination from regional trends that differentially affect all groups in treated regions.
14. Covariates and Controls in DiD
14.1 Why Include Covariates?
Adding covariates to the DiD model serves two distinct purposes:
-
Improving efficiency (precision): Covariates that predict the outcome reduce residual variance, shrinking standard errors and widening the confidence interval.
-
Restoring conditional parallel trends: If unconditional parallel trends is implausible but trends are parallel after conditioning on observable characteristics, including covariates removes the confounding and restores identification.
14.2 Time-Invariant Covariates
In the TWFE model, time-invariant covariates (e.g., gender, ethnicity, geographic characteristics) are absorbed by the unit fixed effect and cannot be estimated separately. However, they can be included as interactions with the treatment or time variables to allow their effect to vary:
14.3 Time-Varying Covariates
Time-varying covariates can be included directly in the TWFE model:
⚠️ Including time-varying covariates that are themselves affected by the treatment (i.e., "bad controls" or "mediators") is a common mistake. Including such variables absorbs part of the treatment effect, leading to underestimation of . Only include covariates that are determined before treatment or that are plausibly unaffected by treatment.
14.4 Regression Adjustment (Outcome Regression)
The regression-adjusted DiD uses the control group's pre-to-post relationship between covariates and the outcome to construct an improved counterfactual:
Where is the predicted counterfactual change based on the control group's covariate-outcome relationship. This improves efficiency and removes covariate-related bias.
14.5 Doubly Robust DiD (DR-DiD)
The doubly robust estimator combines propensity score weighting and outcome regression. It is consistent if either the propensity score model or the outcome regression model is correctly specified:
Where are propensity score reweighting terms and is the regression-adjusted counterfactual change.
14.6 Controlling for Pre-Treatment Trends (Linear Trend Adjustment)
When parallel trends is violated by unit-specific linear time trends, include unit-specific trend terms:
Where is a unit-specific linear time trend. This allows each unit to have its own pre-treatment trajectory, controlling for heterogeneous trends.
⚠️ Including unit-specific trends is a strong assumption (units would have continued on their pre-treatment trend indefinitely) and can overcontrol. Use only when unit-specific trends are well-established in the pre-period and when there are sufficient pre-period observations to estimate them.
15. Using the DiD Component
The Difference-in-Differences component in the DataStatPro application provides a comprehensive workflow for DiD estimation, testing, and visualisation.
Step-by-Step Guide
Step 1 — Select Dataset Choose the dataset from the "Dataset" dropdown. The dataset should have:
- A unit identifier column (individual, firm, region, country).
- A time period column (year, quarter, month).
- An outcome variable (continuous or binary).
- A treatment indicator (binary: 0 = control, 1 = treated).
Step 2 — Select DiD Design Choose the DiD specification:
- 2×2 DiD (two groups, two periods — canonical design)
- Two-Way Fixed Effects (TWFE) (panel data, multiple periods)
- Staggered DiD (multiple treatment timing groups)
- Event Study / Dynamic DiD (multiple pre- and post-periods)
- Triple Differences (DDD) (three sources of variation)
- Fuzzy DiD / IV-DiD (non-compliance with treatment)
Step 3 — Select Variables Map the required variables from your dataset:
- Unit ID: The unique identifier for each unit (individual, firm, state).
- Time ID: The time period variable.
- Outcome (Y): The continuous or binary dependent variable.
- Treatment Group (): Binary indicator for treated group (time-invariant for 2×2).
- Treatment Indicator (): Binary indicator for treatment status (may vary by time for TWFE/staggered).
- Covariates: Optional time-varying or time-invariant controls.
Step 4 — Specify Treatment Timing
- 2×2 DiD: Specify the pre- and post-period labels.
- TWFE/Staggered: The application detects treatment timing from the variable automatically. Review the detected cohort structure.
- Event Study: Specify the base period (omitted category, default: ) and the number of leads and lags to include.
Step 5 — Configure Fixed Effects Select fixed effects to include:
- ✅ Unit Fixed Effects () — strongly recommended for panel data.
- ✅ Time Fixed Effects () — strongly recommended.
- Unit-Specific Linear Trends — optional; use when pre-trend concerns exist.
Step 6 — Configure Standard Errors Select the standard error type:
- Cluster-Robust (default and recommended) — specify the clustering variable (typically the treatment assignment unit).
- HC (Heteroscedasticity-Robust) — use when clusters are very small or unavailable.
- Wild Cluster Bootstrap — specify for few clusters ().
- Block Bootstrap — for more general panel dependence.
Step 7 — Select Staggered DiD Estimator (if applicable) For staggered designs, choose the robust estimator:
- TWFE (standard but potentially biased with heterogeneous effects)
- Callaway-Sant'Anna
- Sun-Abraham
- Bacon Decomposition (diagnostic)
- de Chaisemartin-D'Haultfœuille
Step 8 — Configure Inference Options
- Confidence level: Default 95%.
- Bootstrap replications: For wild bootstrap (default: 999).
- Permutation replications: For randomisation inference (default: 999).
Step 9 — Select Display Options Choose which outputs to display:
- ✅ DiD Coefficient Table (estimate, SE, t, p, CI)
- ✅ Pre-Treatment Trends Plot
- ✅ Event Study Plot (with CIs)
- ✅ 2×2 DiD Table (group-period means)
- ✅ Parallel Trends Test (joint F-test on pre-period coefficients)
- ✅ Counterfactual Plot
- ✅ Placebo Test Results
- ✅ Bacon Decomposition Plot (staggered designs)
- ✅ Balance Table (pre-treatment covariate balance)
- ✅ Residual Diagnostics
- ✅ Coefficient Profile Plot across Subgroups
- ✅ Dynamic Effects Plot
Step 10 — Run the Analysis Click "Run DiD Model". The application will:
- Construct the design matrix with appropriate interaction terms and fixed effects.
- Estimate the DiD coefficient(s) via OLS with specified standard errors.
- Compute the event study / dynamic effects coefficients.
- Run the pre-trends test and parallel trends diagnostics.
- Execute placebo tests (if requested).
- Compute the Bacon decomposition (for staggered designs).
- Generate all selected visualisations and tables.
16. Computational and Formula Details
16.1 The 2×2 DiD Estimator: Step-by-Step
Step 1: Compute group-period means
For groups (control, treated) and periods (pre, post).
Step 2: First differences
Step 3: DiD estimate
Step 4: Standard error (homoscedastic OLS)
With observations per cell, :
Where .
16.2 TWFE Estimation: The Demeaning Procedure
Step 1: Compute unit means, time means, and grand mean
Step 2: Demean all variables
Step 3: Estimate TWFE by OLS on demeaned variables
16.3 Cluster-Robust Variance Estimator
With clusters and the TWFE estimator:
Where is a small-sample correction, and are the demeaned treatment vector and residuals for cluster .
16.4 Event Study Regression: Full Specification
For a unit with treatment onset at period and balanced panel from to :
Define event-time dummies:
For (omitting as the reference).
Stack the regression:
Estimate by TWFE (adding unit and time FE as dummy variables or using within-transformation).
Confidence bands: For each , compute .
16.5 The Bacon Decomposition
For a staggered design with cohorts (groups adopting treatment at different times), the TWFE estimator decomposes as:
Where:
- = 2×2 DiD comparing early adopters vs. late adopters .
- = 2×2 DiD comparing cohort vs. never-treated units.
- Weights are proportional to cell sizes and treatment variance.
The decomposition reveals how much of the TWFE estimate comes from each pairwise comparison, and which comparisons use already-treated units as controls (potentially problematic).
16.6 Pre-Trend Test Statistic
Joint F-test for pre-treatment event study coefficients:
Where selects the pre-treatment coefficients (), is the vector of event study estimates, is their variance-covariance matrix (cluster-robust), and is the number of pre-period restrictions.
Under (parallel trends in pre-period): approximately.
16.7 DiD with Binary Outcomes
For binary outcomes (), the linear probability model (LPM) DiD remains valid and interpretable:
estimates the probability change (in percentage points) caused by treatment. While predicted probabilities may fall outside , the LPM DiD estimator of is unbiased under parallel trends.
For probits or logits, the DiD interpretation is more complex and non-linear. The average marginal effect from a nonlinear DiD:
Where is the estimated CDF (probit or logistic) and is the linear predictor. Note: the LPM is generally preferred for DiD with binary outcomes due to tractability.
17. Worked Examples
Example 1: 2×2 DiD — Effect of Minimum Wage on Employment
Research Question: Did a 20% increase in the minimum wage in State A in 2019 affect the fast-food employment rate, using State B (which had no minimum wage change) as the control?
Data: Monthly employment rates for fast-food workers in State A (treated) and State B (control), 2017–2021. For simplicity, we use 2018 as pre-period and 2019 onward as post-period.
Step 1: Group-Period Mean Table
| Pre-2019 Mean () | Post-2019 Mean () | Change () | |
|---|---|---|---|
| State A (Treated, ) | 72.4% | 70.1% | -2.3 pp |
| State B (Control, ) | 74.1% | 73.2% | -0.9 pp |
| DiD | -2.3 − (−0.9) = −1.4 pp |
Step 2: OLS Regression
| Coefficient | Estimate | Cluster-Robust SE | 95% CI | ||
|---|---|---|---|---|---|
| Intercept () | 74.1 | 0.42 | 176.4 | < 0.001 | [73.2, 75.0] |
| State A () | -1.7 | 0.61 | -2.79 | 0.023 | [-3.1, -0.3] |
| Post () | -0.9 | 0.31 | -2.90 | 0.018 | [-1.6, -0.2] |
| DiD () | -1.4 | 0.54 | -2.59 | 0.031 | [-2.6, -0.2] |
, monthly observations.
Step 3: Interpretation
The minimum wage increase in State A reduced fast-food employment by an estimated 1.4 percentage points (, 95% CI: pp). This represents a relative reduction from the pre-treatment baseline.
Step 4: Pre-Trends Check (Using 2017–2018 data)
Using quarterly 2017–2018 data, estimate a placebo DiD treating 2018 Q1 as "post":
Placebo , SE = 0.48, → No significant pre-treatment difference in trends. Parallel trends assumption is supported.
Step 5: Visualisation Summary
Pre-period trends for both states are approximately parallel (both declining slightly). Post-2019, State A's employment declines more sharply than State B's, consistent with the minimum wage effect.
Example 2: TWFE — Effect of Broadband Access on Business Formation
Research Question: Did broadband internet access (treated when broadband penetration > 50%) increase the rate of new business formation across US counties, 2000–2010?
Data: Annual panel of counties, years ( observations); outcome: log business formation rate; treatment: broadband penetration indicator.
Step 1: TWFE Regression
Step 2: Results
| Variable | Coefficient | Cluster-Robust SE (County) | ||
|---|---|---|---|---|
| Broadband (DiD) | 0.0841 | 0.0214 | 3.93 | < 0.001 |
| ln(Population) | 0.1243 | 0.0381 | 3.26 | 0.001 |
| Unemployment | -0.0182 | 0.0051 | -3.57 | < 0.001 |
| County FE | ✅ (3,142 dummies) | — | — | — |
| Year FE | ✅ (11 dummies) | — | — | — |
Within , .
Step 3: Interpretation
Broadband internet access increases the log business formation rate by 0.0841, corresponding to an increase in business formation. The effect is highly significant () after controlling for county and year fixed effects and time-varying population and unemployment controls.
Standardised effect:
A small-to-medium standardised effect.
Step 4: Event Study
Estimating event study coefficients for 3 years before and 5 years after treatment adoption:
| Period () | SE | Significant? | ||
|---|---|---|---|---|
| 0.011 | 0.018 | 0.543 | No | |
| -0.008 | 0.015 | 0.591 | No | |
| (reference = 0) | — | — | — | |
| 0.041 | 0.019 | 0.031 | Yes | |
| 0.072 | 0.023 | 0.002 | Yes | |
| 0.084 | 0.024 | < 0.001 | Yes | |
| 0.091 | 0.026 | < 0.001 | Yes | |
| 0.088 | 0.027 | 0.001 | Yes | |
| 0.083 | 0.029 | 0.004 | Yes |
Pre-treatment coefficients: Joint , → No pre-trends. Post-treatment effects ramp up over 2–3 years and then stabilise — consistent with gradual adoption and business formation.
Example 3: Staggered DiD — Effect of Paid Family Leave Policies on Female Labour Force Participation
Research Question: Did the adoption of state-level paid family leave (PFL) policies affect female labour force participation (FLFP) across US states, with different states adopting at different times (2004–2016)?
Data: Annual panel of 50 states, 2000–2020; outcome: FLFP rate (%); treatment: PFL adoption indicator. 12 states adopt PFL at different times; 38 states never adopt (control).
Step 1: TWFE Estimate (Standard)
pp, SE = 0.74, .
Step 2: Bacon Decomposition
| Comparison Type | Weight | 2×2 DiD Estimate |
|---|---|---|
| Early adopters vs. Never treated | 0.41 | 2.31 |
| Late adopters vs. Never treated | 0.28 | 2.04 |
| Early vs. Late (early as treated) | 0.22 | 1.41 |
| Late vs. Early (late as treated) | 0.09 | 0.63 |
The decomposition reveals that 31% of the TWFE estimate comes from comparing early vs. late adopters — using already-treated states as controls. The "Late vs. Early" comparison () is notably smaller, suggesting heterogeneous treatment effects across adoption cohorts.
Step 3: Callaway-Sant'Anna Robust Estimator
Computing for each adoption cohort and time :
| Cohort (First Treated Year ) | Average Post-Treatment ATT | |
|---|---|---|
| 2004 (California) | 1 | 3.21 pp |
| 2008 (New Jersey) | 1 | 2.84 pp |
| 2013 (Rhode Island) | 1 | 2.41 pp |
| 2016 (New York) | 1 | 1.98 pp |
| Other early adopters (2004–2008) | 4 | 2.68 pp |
| Other late adopters (2009–2016) | 5 | 1.72 pp |
Aggregated average ATT (Callaway-Sant'Anna): 2.31 pp (SE = 0.82, ).
Observation: The Callaway-Sant'Anna estimate (2.31 pp) is larger than the TWFE estimate (1.82 pp), and the discrepancy is explained by the downward-biasing effect of using already-treated states as controls in the "Late vs. Early" TWFE comparison.
Step 4: Dynamic Effects (Event Study)
| Event Time | CS Estimate | SE | 95% CI |
|---|---|---|---|
| -0.12 | 0.28 | [-0.67, 0.43] | |
| 0.08 | 0.24 | [-0.39, 0.55] | |
| (reference) | — | — | |
| 1.21 | 0.41 | [0.41, 2.01] | |
| 2.18 | 0.54 | [1.12, 3.24] | |
| 2.41 | 0.61 | [1.21, 3.61] | |
| 2.58 | 0.68 | [1.25, 3.91] |
Pre-trends test: , → No pre-trends. Effects increase over the first two years post-adoption and stabilise, suggesting that businesses and workers gradually adjust to the new policy.
Conclusion: PFL policies increase female labour force participation by approximately 2.3 percentage points on average (Callaway-Sant'Anna robust estimate). The TWFE estimate is downward biased by about 0.5 pp due to heterogeneous treatment effects across adoption cohorts. Effects materialise immediately and grow slightly over the first two years.
Example 4: Triple Differences — Effect of Health Insurance Expansion on Hospital Admissions
Research Question: Did Medicaid expansion under the ACA increase hospital admissions for low-income adults (the eligible group) compared to higher-income adults (the ineligible group), in expansion vs. non-expansion states?
Data: State-year panel; outcome: hospitalisation rate per 1,000 adults; three dimensions: State (expansion vs. non-expansion), Year (pre-2014 vs. post-2014), Income group (low-income eligible vs. higher-income ineligible).
Triple Differences Regression:
Where all two-way interactions are absorbed by the appropriate two-way fixed effects.
| Coefficient | Estimate | SE | |
|---|---|---|---|
| DDD () | 12.4 | 3.21 | < 0.001 |
The DDD estimate suggests that Medicaid expansion increased hospitalisation rates for low-income adults (who became eligible) by 12.4 per 1,000 relative to higher-income adults in expansion states, and relative to the same income groups in non-expansion states. This controls for any general time trends in hospitalisation, any state-specific income disparities, and any common shifts in healthcare utilisation across income groups.
18. Common Mistakes and How to Avoid Them
Mistake 1: Failing to Use Cluster-Robust Standard Errors
Problem: Using OLS standard errors (or even heteroscedasticity-robust HC standard errors) in a DiD design, resulting in severely underestimated standard errors and spuriously small p-values. DiD residuals are almost always serially correlated within units across time, and within groups (e.g., states) across individuals.
Solution: Always cluster standard errors at the level of treatment assignment. For state-level policies, cluster at the state level. If the number of clusters is small (), use the wild cluster bootstrap instead of asymptotic cluster-robust SEs.
Mistake 2: Ignoring Pre-Trends
Problem: Reporting a significant DiD estimate without testing or reporting the parallel trends assumption, leaving the identification assumption entirely unvalidated and the results unconvincing.
Solution: Always conduct and report the event study pre-trends test. Plot the event study coefficients with confidence bands for at least 2–3 pre-treatment periods. Report the joint -test for pre-period coefficients and discuss any concerning patterns, even if not statistically significant.
Mistake 3: Including Bad Controls (Post-Treatment Outcomes)
Problem: Including time-varying covariates that are themselves caused by the treatment (mediators or "bad controls") — e.g., including health insurance status as a covariate when the treatment is a health policy that affects insurance. This absorbs part of the treatment effect through the covariate, biasing toward zero.
Solution: Only include covariates that are predetermined (determined before treatment) or plausibly unaffected by the treatment. When in doubt, report the model with and without the covariate and check sensitivity.
Mistake 4: Applying Standard TWFE to Staggered Designs Without Checking
Problem: Using the standard TWFE estimator in a staggered design with heterogeneous treatment effects, obtaining a potentially biased or sign-reversed DiD estimate without investigating its composition.
Solution: Always run the Bacon decomposition for staggered designs to understand what the TWFE estimate represents. If heterogeneous treatment effects are plausible, supplement with (or switch to) a robust estimator: Callaway-Sant'Anna, Sun-Abraham, or Borusyak-Jaravel-Spiess. Report both for transparency.
Mistake 5: Confusing the ATT with the ATE
Problem: Interpreting the DiD estimate as the Average Treatment Effect (ATE) — the effect averaged over all units — when DiD actually identifies the ATT (Average Treatment Effect on the Treated).
Solution: Be precise in reporting: DiD estimates the ATT — the effect for the treated units specifically. The ATE (which would include the effect on control units) is not identified by DiD unless additional assumptions (or different estimators) are invoked.
Mistake 6: Treating the Pre-Trends Test as Definitive
Problem: Passing the pre-trends test (no significant pre-treatment coefficients) and concluding that parallel trends definitely holds in the post-period. This overstates the confidence in identification.
Solution: The pre-trends test is supportive evidence, not proof. Pre-treatment parallel trends do not guarantee post-treatment parallel trends. Supplement with theoretical arguments for why the groups would have trended similarly, with placebo tests, and with Rambachan-Roth sensitivity analysis. Be honest about residual uncertainty.
Mistake 7: Selecting the Control Group Retrospectively Based on Pre-Trends
Problem: Searching across many potential control groups and selecting the one that shows the most parallel pre-trends with the treated group. This "pre-trend matching" leads to data mining, overfitting, and inflated confidence in parallel trends.
Solution: Specify the control group a priori based on institutional knowledge and theoretical comparability, not post-hoc based on pre-trend patterns. If multiple control groups are plausible, report results for all of them and assess robustness.
Mistake 8: Ignoring Anticipation Effects
Problem: Treating the period immediately before the official treatment start as a clean pre-period, when in fact treated units may have already begun responding in anticipation of treatment.
Solution: Test for anticipation effects by examining whether the period just before treatment () shows a significant coefficient in the event study. Consider extending the "pre-period" to exclude periods potentially affected by anticipation. If anticipation is present, adjust the model (e.g., redefine the treatment as starting earlier).
Mistake 9: Using Levels When Parallel Trends Holds Only in Logs
Problem: Estimating the DiD in levels when treated and control groups are growing at the same rate (multiplicative parallel trends), rather than by the same amount (additive parallel trends). This produces spurious pre-trends on the levels scale.
Solution: Inspect raw pre-treatment trends in both levels and logs. If trends are parallel in logs but not levels, use the log outcome. Report the pre-trends test for the chosen specification and note the functional form assumption.
Mistake 10: Not Reporting the Full Event Study
Problem: Reporting only the single pooled DiD coefficient when the treatment has dynamic effects over multiple time periods, losing information about the timing, ramp-up, and persistence of the effect.
Solution: Always produce and report the full event study figure with pre- and post-treatment coefficients and confidence bands. The event study provides far more information than a single coefficient and is essential for assessing both identification (pre-trends) and the nature of the effect (dynamic patterns).
19. Troubleshooting
| Issue | Likely Cause | Solution |
|---|---|---|
| DiD coefficient is unexpected sign | Parallel trends violation; wrong treatment assignment; data coding error | Check treatment coding; inspect raw trends plots; run event study for pre-trend evidence |
| Very large standard errors | Too few clusters (); very small treatment group; collinearity | Use wild cluster bootstrap; report exact p-value (randomisation inference); check VIF |
| Significant pre-trends (parallel trends violated) | Systematic pre-existing trend differences; selection into treatment based on trends | Include unit-specific linear trends; use conditional parallel trends (covariates); consider synthetic control or matching |
| Event study coefficients show large dip at | Anticipation effects; treatment actually starts earlier than recorded | Extend pre-period; redefine treatment onset; check institutional knowledge |
| TWFE gives negative estimate despite positive raw DiD | Heterogeneous treatment effects in staggered design (Bacon negative weights) | Run Bacon decomposition; use Callaway-Sant'Anna or other robust staggered estimator |
| Never-treated group is very small | Limited comparison group; potential control group contamination | Expand the never-treated control; use only timing-variation comparisons; consider synthetic control |
| Pre-trends test passes but Rambachan-Roth bounds are wide | Weak pre-trend evidence; short pre-period; noisy outcome | Collect more pre-periods; use better outcome measure; report sensitivity analysis bounds prominently |
| Fixed effects absorb all treatment variation | Treatment is perfectly collinear with unit or time FE; no within-unit variation in | Check whether treatment is time-varying; verify panel structure; ensure varies within units |
| Coefficient estimates change dramatically with different control sets | Model is sensitive to covariate specification; potential bad controls included | Report all specifications; identify and exclude bad controls (post-treatment variables); use doubly robust estimator |
| Wild cluster bootstrap gives -value of 1 | Asymmetric bootstrap distribution; very few clusters; extreme outlier cluster | Increase bootstrap replications; use refinement bootstrap; investigate influential clusters; consider randomisation inference |
| Log outcome produces extreme predictions | Zeros in the outcome variable (log undefined) | Use inverse hyperbolic sine (IHS) transformation: ; or use Poisson TWFE for count outcomes |
| DDD estimate is implausibly large | Interaction with ineligible group not clean; compositional changes | Verify eligibility classification; test for treatment effects on ineligible group (should be zero); check for spillovers |
| Staggered design: Callaway-Sant'Anna gives very wide CIs | Small cohort sizes; few pre-periods for some cohorts; sparse data | Report cohort-level ATTs separately; aggregate with caution; increase sample |
| Placebo treatment group shows significant effect | SUTVA violation (spillovers); contamination of control group | Investigate potential spillover mechanisms; redefine control group to exclude exposed units; report sensitivity to control group definition |
20. Quick Reference Cheat Sheet
Core DiD Formulas
| Formula | Description |
|---|---|
| 2×2 DiD estimator | |
| 2×2 DiD regression | |
| TWFE DiD | |
| Within-transformation (demeaning) | |
| TWFE estimator | |
| Event study regression | |
| Test statistic (clustered) | |
| Confidence interval | |
| Standardised effect size | |
| Percent change effect |
2×2 DiD Table Template
| Pre () | Post () | Difference | |
|---|---|---|---|
| Treated () | |||
| Control () | |||
| Difference |
Standard Error Selection Guide
| Setting | Recommended SE | When |
|---|---|---|
| Large () | Cluster-robust (HC1) | Standard panel DiD |
| Moderate () | Cluster-robust with bias correction | Borderline case |
| Small () | Wild cluster bootstrap | Few clusters |
| Very few () | Randomisation inference | Only a few treated clusters |
| Cross-section with groups | Cluster at group level | Group-level treatment |
Assumption Checklist
| Assumption | How to Test | If Violated |
|---|---|---|
| Parallel trends | Pre-trends test (event study); placebo tests | Add covariates; unit-specific trends; Rambachan-Roth bounds |
| No anticipation | Check event study coefficient | Redefine treatment timing; extend pre-period |
| SUTVA (no spillovers) | Placebo treatment on nearby controls; spillover tests | Redefine control group; use exclusion zones |
| No compositional change | Check covariate balance across periods | Use balanced panel; control for composition |
| Exogenous treatment timing | Institutional knowledge; pre-trends test | Instrument for timing; use conditional parallel trends |
| No interference | Study design; geographic checks | Spatial correlation SEs; define clean control zones |
DiD Estimator Comparison for Staggered Designs
| Estimator | Robust to Heterogeneity | Easy to Implement | Software |
|---|---|---|---|
| TWFE | ❌ | ✅ | Any regression software |
| Bacon Decomposition | Diagnostic only | ✅ | DataStatPro, bacondecomp (R/Stata) |
| Callaway-Sant'Anna | ✅ | Moderate | DataStatPro, did (R), csdid (Stata) |
| Sun-Abraham | ✅ | Moderate | DataStatPro, sunab (Stata) |
| de Chaisemartin-DH | ✅ | Moderate | DataStatPro, did_multiplegt (Stata) |
| Borusyak-Jaravel-Spiess | ✅ | Moderate | DataStatPro, did_imputation (Stata) |
Effect Size Interpretation
| Measure | Formula | Unit | Interpretation |
|---|---|---|---|
| DiD coefficient () | Direct estimate | Same as | Absolute change in |
| Log DiD | % change | Percentage change in | |
| Percent change | % | Change relative to baseline | |
| Cohen's | SD units | Standardised effect | |
| NNT | $1/ | \hat{\delta} | $ |
Model Specification Checklist
| Feature | Recommendation | Notes |
|---|---|---|
| Unit fixed effects | ✅ Always include | Removes time-invariant confounders |
| Time fixed effects | ✅ Always include | Removes common time shocks |
| Cluster-robust SEs | ✅ Always use | Cluster at treatment assignment level |
| Pre-trends test | ✅ Always report | Event study with pre-periods |
| Bacon decomposition | ✅ For staggered | Diagnose TWFE composition |
| Robust staggered estimator | ✅ For staggered | At least one robust estimator |
| Placebo tests | ✅ Report | Placebo time, group, or outcome |
| Covariate balance table | ✅ Report | Pre-treatment balance check |
| Unit-specific trends | ⚠️ Use with caution | Only if strong pre-trend concern |
| Binary outcome LPM | ✅ Preferred | More tractable than probit DiD |
| Log outcome | ✅ If proportional trends | Check functional form |
Key Identification Assumptions
| Assumption | Formal Statement | Testable? | Diagnostic |
|---|---|---|---|
| Parallel trends | Partially (pre-period only) | Event study pre-trends | |
| No anticipation | for | Yes (pre-period ) | Check |
| SUTVA | No spillovers; one version of treatment | Partially | Geographic placebo; excluded-zone test |
| Exogenous timing | not determined by anticipated outcomes | Partially | Pre-trend test; institutional knowledge |
| Stable composition | Sample composition unchanged by treatment | Yes | Covariate balance across periods |
This tutorial provides a comprehensive foundation for understanding, applying, and interpreting Difference-in-Differences Models using the DataStatPro application. For further reading, consult Angrist & Pischke's "Mostly Harmless Econometrics" (Princeton University Press, 2009), Callaway & Sant'Anna's "Difference-in-Differences with Multiple Time Periods" (Journal of Econometrics, 2021), Roth et al.'s "What's Trending in Difference-in-Differences?" (Journal of Econometrics, 2023), or Goodman-Bacon's "Difference-in-Differences with Variation in Treatment Timing" (Journal of Econometrics, 2021). For feature requests or support, contact the DataStatPro team.