The shared insight
Modern heterogeneity-robust estimators address TWFE’s pathologies by enforcing one rule: never use already-treated units as controls. For each cohort $g$ in each post-treatment period $t \ge g$, the estimator compares outcomes for cohort $g$ to a comparison group drawn from never-treated units ($G_i = \infty$) or not-yet-treated units ($G_i > t$). This guarantees that the comparison group provides a valid counterfactual under parallel trends.
Using not-yet-treated units as controls requires a timing assumption: untreated potential outcomes $Y_{it}(\infty)$ must evolve similarly for cohorts with $G_i = g$ and $G_i > t$ (staggered parallel trends, Section 4.3). If early adopters would have grown faster than late adopters absent treatment, not-yet-treated units are not valid controls. This caveat applies to all modern estimators that use not-yet-treated comparisons.
Unified setup
Observe a panel with units $i = 1, \ldots, N$ and periods $t = 1, \ldots, T$. Let $G_i = \min\{t : D_{it} = 1\}$ be the adoption date ($G_i = \infty$ for never-treated). Treatment is absorbing. Define:
- Treated cohort: $\mathcal{T}_g = \{i : G_i = g\}$
- Never-treated comparison: $C_t^{NT} = \{i : G_i = \infty\}$
- Not-yet-treated comparison: $C_{g,t}^{NYT} = \{i : G_i > t\}$
Every modern estimator builds on the same cohort-time DiD contrast:
$$ \hat{\tau}(g, t \mid C_{g,t}) = \frac{1}{|\mathcal{T}_g|} \sum_{i \in \mathcal{T}_g} (Y_{it} - Y_{i,g-1}) - \frac{1}{|C_{g,t}|} \sum_{i \in C_{g,t}} (Y_{it} - Y_{i,g-1}). $$Under staggered parallel trends this is unbiased for $\tau(g, t)$. All modern estimators then aggregate:
$$ \hat{\tau}^{(\text{est})} = \sum_g \sum_{t \ge g} w_{g,t}^{(\text{est})} \hat{\tau}(g, t \mid C_{g,t}), \quad \sum_{g,t} w_{g,t}^{(\text{est})} = 1. $$The estimators differ in their choice of comparison set $C_{g,t}$, their weighting scheme $w_{g,t}$, and how they incorporate covariates or factor structures.
Callaway and Sant’Anna (CS)
CS estimates $\hat{\tau}(g,t)$ for each cohort-time pair directly:
$$ \hat{\tau}(g,t) = \mathbb{E}_n[Y_t - Y_{g-1} \mid G = g] - \mathbb{E}_n[Y_t - Y_{g-1} \mid C], $$where $C$ is either strictly $G = \infty$ or strictly $G > t$. These cohort-time estimates are then aggregated with pre-specified weights to produce $ATT_{agg}$, event-time effects $\theta_k$, cohort-specific effects $\tau_g$, or calendar-time effects $\tau_t$.
CS supports conditional parallel trends by conditioning on covariates $X_{it}$ — using propensity score weighting ($\hat{e}(X_{it})$), IPW, or doubly robust methods to adjust for imbalances between treated and control units.
Strengths: flexible aggregations and diagnostic plots; rich covariate support; handles most marketing panel settings.
Limitations: computation slows with many cohort-time pairs; standard errors widen when cohorts are small; the analyst must choose between never-treated and not-yet-treated controls, and this choice can affect results when the two groups differ systematically.
Software: att_gt() + aggte() in the did R package; csdid in Stata.
Sun and Abraham (SA)
SA takes an interaction-weighted event-study approach, estimating a dynamic regression with cohort-by-event-time interactions:
$$ Y_{it} = \alpha_i + \lambda_t + \sum_g \sum_{k \ne -1} \delta_{g,k} \mathbf{1}\{G_i = g\} \mathbf{1}\{t - G_i = k\} + \varepsilon_{it}. $$The coefficients $\delta_{g,k}$ consistently estimate the cohort-specific event-time effects $\tau(g, g+k)$. They are then aggregated using cohort-share weights:
$$ \hat{\theta}_k = \sum_g \frac{N_g}{N_{\text{treated}}} \hat{\delta}_{g,k}. $$By excluding already-treated units from the implicit comparison group SA avoids the negative weighting problem of TWFE. It produces event-study plots that are transparent and easy to interpret.
Strengths: natural choice when event-time dynamics $\theta_k$ are the primary estimand; integrates cleanly with standard regression workflows; results are easy to visualise.
Limitations: requires specifying a reference period (typically $k=-1$); sensitive to this normalisation choice; cohort-share weights can be dominated by large cohorts; requires sufficient variation in event-time exposure across cohorts.
Software: sunab() in the fixest R package; eventstudyinteract in Stata.
de Chaisemartin and d’Haultfœuille (dCdH)
dCdH decomposes the TWFE estimator to identify which comparisons are “forbidden” (already-treated vs newly treated) and which are valid. It reports the fraction of the sample contributing to forbidden comparisons and proposes alternative weighting schemes that exclude or downweight them. It also provides diagnostics for whether the sign of the treatment effect can be inferred despite heterogeneity.
Best used as a diagnostic layer around TWFE, not as a standalone estimator. A practical rule of thumb: if more than ~20% of TWFE weight comes from forbidden comparisons, treat TWFE with suspicion and prefer CS or SA.
Limitations: the decomposition can be difficult to interpret with many cohorts and periods; assumes treatment is binary and absorbing.
Software: DIDmultiplegt in R; did_multiplegt in Stata.
Borusyak, Jaravel, and Spiess (BJS)
BJS takes an imputation-based approach:
- Fit an outcome model for untreated potential outcomes $Y_{it}(\infty)$ using only untreated observations (pre-treatment periods for treated units; all periods for never-treated units). The model can be unit + time fixed effects, interactive fixed effects (IFE), or factor models.
- Impute counterfactuals $\hat{Y}_{it}(\infty)$ for treated observations.
- Compute treatment effects as $\hat{\tau}(g,t) = Y_{it} - \hat{Y}_{it}(\infty)$ for treated cells.
- Aggregate with sample weights.
By pooling information across all cohort-time cells when fitting the outcome model, BJS is more efficient than CS when cohorts are small. It is particularly effective when an IFE or factor structure for untreated outcomes is credible — for example, when treated and control units are subject to common industry shocks with differential exposure.
Limitations: heavily relies on the outcome model being correctly specified; if the factor structure is misspecified or $R$ is chosen incorrectly, imputation bias can exceed that of simpler CS designs; requires a sufficiently long pre-treatment period (at least 5–10 periods) to estimate factors reliably; avoid with short panels.
Software: did2s (Gardner’s two-stage imputation, closely related) in R; did_imputation in Stata.
Choosing among estimators
| Situation | Recommended estimator |
|---|---|
| Default starting point with multiple cohorts | CS (did) |
| Event-time dynamics $\theta_k$ are primary estimand | SA (sunab) |
| Diagnosing TWFE forbidden comparisons | dCdH (DIDmultiplegt) |
| Parallel trends in raw levels implausible; IFE credible; long pre-period | BJS (did2s / did_imputation) |
| No never-treated units (all units eventually adopt) | CS with not-yet-treated controls |
| Very large panels ($N$ and $T$ both large) | BJS (TWFE version scales well; CS can be slow) |
| Short pre-treatment period (2–3 periods) | CS or SA + conditional PT with covariates; avoid BJS |
Decision by data structure:
- Abundant never-treated units + long pre-period → CS and SA both work well.
- No never-treated units → CS with not-yet-treated control option.
- Few cohorts (2–3) → SA may struggle to estimate cohort-specific effects precisely; prefer CS.
- Short pre-period → pre-trend assessment is limited; factor models (BJS) not feasible.
- Large $N$ and $T$ → BJS with TWFE scaling; CS may be slow.
Always estimate multiple methods and check for agreement. If CS, SA, and BJS produce similar estimates, conclusions are robust. If estimates diverge substantially (different signs, or magnitudes differing by more than ~50%), conclusions are fragile and should be reported as such. Present results from multiple methods, explain which assumptions drive the differences, and let readers assess the credibility of each approach.
Software summary
| Estimator | R | Stata | Python |
|---|---|---|---|
| Callaway–Sant’Anna | did::att_gt() + aggte() | csdid | — |
| Sun–Abraham | fixest::sunab() | eventstudyinteract | pyfixest |
| dCdH | DIDmultiplegt | did_multiplegt | — |
| BJS / imputation | did2s | did_imputation | — |
Software implementations evolve. Check package documentation for current syntax and options — consult vignettes rather than relying on static code snippets.
Takeaway
All modern estimators share the same core discipline: estimate $\tau(g,t)$ using clean comparisons, aggregate with transparent non-negative weights, and report the estimand alongside the estimator. The choice among CS, SA, dCdH, and BJS is driven by the data structure, the estimand of interest, and the credibility of identifying assumptions — not by package preference. When in doubt, start with CS, diagnose with dCdH, and confirm with SA or BJS.
References
- Shaw, C. (2025). Causal Inference in Marketing: Panel Data and Machine Learning Methods (Community Review Edition), Section 4.5.
- Callaway, B., and Sant’Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2), 200–230.
- Sun, L., and Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous effects. Journal of Econometrics, 225(2), 175–199.
- de Chaisemartin, C., and d’Haultfœuille, X. (2020). Two-way fixed effects estimators with heterogeneous treatment effects. American Economic Review, 110(9), 2964–2996.
- Borusyak, K., Jaravel, X., and Spiess, J. (2024). Revisiting event study designs: Robust and efficient estimation. Review of Economic Studies, 91(6), 3253–3285.