The TWFE default
The two-way fixed effects (TWFE) regression is the traditional workhorse for DiD estimation:
$$ Y_{it} = \alpha_i + \lambda_t + \tau D_{it} + \varepsilon_{it}, $$where $\alpha_i$ are unit fixed effects, $\lambda_t$ are time fixed effects, and $D_{it}$ is the treatment indicator. The coefficient $\tau$ is interpreted as the average treatment effect. It is intuitive, easy to implement, and computationally fast. Under heterogeneous treatment effects, it can be misleading.
The promise: why TWFE worked in the canonical case
In the canonical 2×2 design — two groups, two periods, one group treated in period 2 — TWFE recovers the ATT exactly. Unit fixed effects absorb time-invariant differences between treated and control units; time fixed effects absorb common shocks; the treatment coefficient captures the causal effect under parallel trends. This made TWFE the default estimator for decades.
With multiple cohorts adopting at different times, TWFE appears to generalise naturally: unit fixed effects absorb cohort-specific levels, time fixed effects absorb calendar shocks, and the treatment coefficient captures the average effect across all treated unit-periods. The regression pools information efficiently and standard errors are straightforward to compute with clustering.
The problem: three implicit comparisons
Staggered adoption breaks the clean 2×2 structure. TWFE implicitly performs three types of comparisons:
- Newly treated units vs not-yet-treated units — valid under Parallel Trends.
- Newly treated units vs never-treated units — valid under Parallel Trends.
- Newly treated units vs already-treated units — problematic under heterogeneity.
The third comparison is the source of the problem. When TWFE compares a newly treated unit to an already-treated unit, it uses the already-treated unit’s post-treatment outcome as a counterfactual. But that outcome reflects the already-treated unit’s own treatment effect, not the counterfactual under no treatment. If treatment effects differ across cohorts or evolve over time, this comparison is contaminated.
Negative weights and sign reversal
de Chaisemartin and d’Haultfœuille (2020) and Goodman-Bacon (2021) formalised this by showing that TWFE can assign negative weights to some cohort-time effects. The decomposition is:
$$ \hat{\tau}^{TWFE} = \sum_g \sum_{t \ge g} w_{gt} \tau(g,t), $$where the weights $w_{gt}$ sum to one but are not guaranteed to be non-negative. In extreme cases, TWFE can produce an estimate with the opposite sign from all underlying $\tau(g,t)$ — a phenomenon known as sign reversal.
Example (Sign Reversal): A loyalty programme rolls out to two cohorts. Cohort A adopts in period 2 and experiences an effect of $+10$ in all post-treatment periods. Cohort B adopts in period 4 and experiences an effect of $+5$. Both effects are positive. However, if TWFE assigns negative weight to Cohort A’s later periods — because Cohort A serves as a “control” for Cohort B’s treatment — the overall TWFE estimate can be attenuated toward zero or even turn negative.
The intuition: TWFE treats already-treated units as if they were untreated when constructing comparisons for later-adopting cohorts. If early adopters have large positive effects, their elevated outcomes make later adopters look worse by comparison, biasing estimates downward.
When does TWFE fail most severely?
TWFE problems are most severe when:
- Treatment effects are heterogeneous across cohorts — early adopters experience different effects than late adopters, and comparisons between them are contaminated.
- Effects evolve over time — using already-treated units as controls conflates treatment dynamics with the counterfactual.
- Many cohorts, staggered timing — more opportunities for forbidden comparisons.
- Never-treated units are scarce or absent — TWFE is forced to rely heavily on already-treated comparisons.
TWFE is less problematic when:
- Effects are homogeneous across cohorts and event times — forbidden comparisons are not biased.
- The design is close to canonical 2×2 — few cohorts and a clear pre/post distinction.
- Abundant never-treated units dominate the sample and provide most of the identifying variation.
Diagnosing TWFE problems
Run diagnostics before abandoning TWFE.
Bacon decomposition (Goodman-Bacon 2021): breaks the TWFE estimator into its component 2×2 comparisons — treated vs never-treated, treated vs not-yet-treated, and treated vs already-treated. If the already-treated comparison dominates, TWFE is unreliable. Implemented in bacondecomp (Stata) and the did package (R).
Weight diagnostics (de Chaisemartin and d’Haultfœuille 2020): compute the weights $w_{gt}$ and identify which cohort-time effects receive negative weight. If many weights are negative or large negative weights attach to important cells, TWFE is suspect. Implemented in twowayfeweights (Stata) and DIDmultiplegt (R).
TWFE vs modern estimator comparison: run both TWFE and a modern estimator (Callaway–Sant’Anna, Sun–Abraham, or Borusyak–Jaravel–Spiess). If estimates agree, TWFE may be acceptable despite its theoretical problems. If they diverge substantially, the divergence reveals the magnitude of the bias from forbidden comparisons.
Worked example: TWFE vs modern estimators
Three cohorts adopt a loyalty programme in periods 2, 4, and 6, observed through period 8. True effects are heterogeneous:
| Cohort | $\tau(g,t)$ for all $t \ge g$ |
|---|---|
| $g=2$ | 10 |
| $g=4$ | 5 |
| $g=6$ | 2 |
All true effects are positive. TWFE estimates a single coefficient $\hat{\tau}^{TWFE}$. Because cohort $g=2$ serves as a control for cohorts $g=4$ and $g=6$ in some comparisons, and because $g=2$’s outcomes are elevated by its own treatment effect, TWFE underestimates the average effect. The Bacon decomposition reveals that a substantial fraction of the TWFE weight comes from already-treated comparisons, with negative weights on some cohort-time cells.
Callaway–Sant’Anna estimates $\hat{\tau}(g,t)$ for each cohort-time pair using only never-treated or not-yet-treated controls, then aggregates with non-negative weights to produce an overall ATT that correctly reflects the positive effects across all cohorts. The discrepancy between TWFE and CS quantifies the bias from forbidden comparisons.
When is TWFE still useful?
Despite its problems, TWFE retains value:
- As a benchmark: report TWFE alongside modern estimators to show the magnitude of the bias correction. Agreement gives readers confidence that heterogeneity is not severe.
- When effects are plausibly homogeneous: forbidden comparisons do not introduce bias, and TWFE is an efficient estimator.
- Computational speed: for very large panels (millions of observations), TWFE is fast while some modern estimators are slow. Use TWFE for exploratory analysis, then confirm with modern estimators on subsamples or with computational optimisations. The Borusyak–Jaravel–Spiess imputation estimator achieves comparable speed while avoiding negative weights.
- Abundant never-treated units: forbidden comparisons contribute little, and bias is small.
The key discipline: diagnose before deciding. Run the Bacon decomposition, check the weights, compare to a modern estimator. If TWFE passes these checks, use it. If it fails, use modern estimators and report the discrepancy.
Takeaway
TWFE is not wrong — it is an estimator of a well-defined quantity. The problem is that the quantity it estimates is a weighted average of $\tau(g,t)$ with potentially negative weights, which can diverge from any economically meaningful treatment effect parameter when effects are heterogeneous. The solution is not to discard TWFE but to understand when its weights are benign and when they are not, and to use modern estimators that enforce non-negative weights whenever TWFE diagnostics raise concern.
References
- Shaw, C. (2025). Causal Inference in Marketing: Panel Data and Machine Learning Methods (Community Review Edition), Section 4.4.
- Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2), 254–277.
- de Chaisemartin, C., and d’Haultfœuille, X. (2020). Two-way fixed effects estimators with heterogeneous treatment effects. American Economic Review, 110(9), 2964–2996.
- Callaway, B., and Sant’Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2), 200–230.