MMM 707: Multiple Treated Units and Staggered Adoption

Hybrid methods extend naturally to settings with multiple treated units and staggered adoption. The main challenge is no longer constructing a counterfactual for a single treated store or market, but doing so coherently across many treated units that adopt at different times. The gains from hybrids in this setting are the same as before — better pre-treatment fit through augmentation, robustness to trend violations through time weighting and more stable weights through regularisation — but we now have to combine unit-level estimates into group-time and event-time summaries. Throughout this section, let $I$ denote the set of treated units, grouped into cohorts $g$ by their adoption date, and let $J$ denote the set of potential donors. For a given calendar period $t$, let $J_{-t}$ be the set of units that have not yet been treated by period $t$ and can therefore serve as valid donors. Let $\bar{Y}_{g,t}$ denote the average outcome for cohort $g$ in period $t$, and similarly for other bar-notation quantities.

Common Intervention Times: Unit-Level vs Pooled Estimation When several treated units adopt at a common time, there are two natural ways to use hybrids. The first is unit-level estimation. You estimate a separate synthetic control, augmented synthetic control or SDID for each treated unit using the same donor pool and predictor set. Let $\hat{\tau}_i$ denote a chosen summary effect for treated unit $i$ (for example, the average post-treatment effect over a specified horizon). This yields unit-specific effects that you can aggregate into an overall average using policy-relevant weights, $\hat{\tau}_{\text{pooled}} = \sum_{i \in I} w_i \hat{\tau}_i$, with $w_i$ reflecting, for example, equal importance across treated units, market size or revenue contribution. The unit-level route has the advantage of transparency. In a store-format trial with five flagships, you can show five separate hybrid counterfactuals and see directly that the format lifts sales in dense urban centres but has little effect in smaller suburban locations. That heterogeneity matters for decisions about scaling. The cost is computational — especially for more complex hybrids — and practical: some treated units will inevitably be harder to match than others, and unit-level estimates for those cases will be fragile. The second route is pooled estimation. Here you estimate a single set of donor weights that minimises an aggregate pre-treatment loss across all treated units, for example:

$$ \min_{w}\; \sum_{i \in I} \sum_{t \in T_{\text{pre}}} \left(Y_{it} - \sum_{j \in J} w_j Y_{jt}\right)^2 + \eta \lVert w \rVert_2^2, $$

subject to the usual convexity constraints $w_j \ge 0$ and $\sum_{j \in J} w_j = 1$. The resulting $w$ defines a common synthetic control path, which you compare to the average treated path. Pooled estimation is simpler to implement and produces a single set of weights to interpret, but it can fit some treated units well and others

7.7 Multiple Treated Units and Staggered Adoption poorly. If treatment effects or untreated dynamics vary sharply across treated units, a single pooled synthetic control will obscure that variation. In practice, many applications start with unit-level hybrids to diagnose heterogeneity and then report both unit-level estimates and a pooled summary based on policy-motivated weights.

Aggregation and Policy-Relevant Weighting Once you have unit-level or cohort-level hybrid estimates, you must decide how to weight them in forming aggregate effects. The right weights depend on the question. If you care about the average effect across the treated units in your sample — for example, for an internal cost–benefit calculation — equal weights or sample-size weights are natural. If you care about extrapolating to a larger target population — such as a national store network — aggregation weights should reflect how representative each treated unit or cohort is of that population. Hybrid methods do not change this logic. What they change is the quality of each unit- or cohort-level effect. The aggregation step is the same as in Chapter 4: choose weights that answer your substantive question, check that they are non-negative and interpretable, and report them.

Staggered Adoption and Event-Time Effects When units adopt at different times, we index cohorts by their adoption date $g$. For each cohort $g$ and calendar period $t \ge g$, we can estimate a cohort–time treatment effect $\tau(g, t)$ by applying a hybrid estimator to the comparison between cohort $g$ and donors in $J_{-t}$ [Ben-Michael et al., 2022]. Here $J_{-t}$ denotes the set of units that are not yet treated at time $t$ and therefore remain valid donors for cohort $g$ at that date. This choice of donor set matters for the estimand. It targets the effect for cohort $g$ at time $t$ relative to not-yet-treated donors (and never-treated donors, when available). For SDID, for instance, we estimate unit weights $\hat{w}_{jg}$ and time weights $\hat{v}_{sg}$ on the pre-treatment panel for cohort $g$ and donors [Arkhangelsky et al., 2021], and then form a doubly differenced comparison between the cohort’s average outcome and its synthetic control in period $t$, exactly as in Section 7.4 but now applied to cohort means rather than a single unit. We can convert these cohort–time effects into event-time profiles by defining event time $k = t - g$ and averaging over the cohorts that contribute data at each $k$. If $G_k$ is the set of cohorts with observations at event time $k$, we can form:

$$ \hat{\theta}_k = \sum_{g \in G_k} w_g \, \hat{\tau}(g, g + k), $$

with weights $w_g \ge 0$ that sum to one within each $G_k$. This is the same event-time aggregation logic developed in Chapter 5. The only difference is that hybrid methods improve the construction of $\hat{\tau}(g, t)$ inside each cohort–time cell by using unit and time weights and, for ASCM, regression adjustments.

A practical detail that matters in staggered designs is donor-pool evolution. Early-adopting cohorts enjoy large donor pools but relatively short pre-treatment windows. Late-adopting cohorts have longer pre-periods but smaller sets of not-yet-treated controls. Hybrid estimators help in both directions — ASCM and ridge SC stabilise estimation with short pre-periods, while SDID’s time weights focus comparisons on more comparable segments of the pre-period — but they cannot create donors where none exist. You should always inspect which cohorts contribute to each event-time estimate and how the donor pool changes over event time.

Connection to Group–Time Effects The modern difference-in-differences literature, reviewed in Chapter 4, organises staggered designs around cohort–time effects $\tau(g, t)$ [Callaway and Sant’Anna, 2021, Sun and Abraham, 2021]. Many papers describe these as group–time ATT objects. Hybrid methods plug directly into this framework. For each $g$ and $t$, you construct a hybrid counterfactual for cohort $g$ using donors in $J_{-t}$; the gap between observed and synthetic hyb outcomes in period $t$ is an estimate of $\tau(g, t)$. Formally, if $\hat{Y}_{g,t}(0)$ is the hybrid counterfactual for cohort $g$ at hyb time $t$, then $\hat{\tau}(g, t) = \bar{Y}_{g,t} - \hat{Y}_{g,t}(0)$ plugs directly into the aggregation formulas in Chapter 4. Aggregating these cohort–time effects over cohorts and time yields the same overall, calendar-time or event-time summaries as in Chapter 4; hybrid methods do not change the aggregation algebra. This also clarifies the relationship to the negative-weighting problem. As that chapter shows, certain naive aggregation schemes over cohorts and time can assign negative weights to some $\tau(g, t)$ terms when treatment effects are heterogeneous [Sun and Abraham, 2021]. Estimating $\tau(g, t)$ with hybrids does not, by itself, eliminate negative weighting. What prevents negative weighting is structuring the analysis cohort-by-cohort and using aggregation schemes that keep weights non-negative and interpretable. The contribution of hybrids is to improve the quality of each $\tau(g, t)$ estimate by constructing better counterfactuals from not-yet-treated donors.

Practical Challenges in Staggered Panels Two practical issues recur in staggered applications. The first is variation in pre-treatment length across cohorts. Early-adopting cohorts may have only a handful of pre-periods, while late adopters have many. With very short pre-periods, synthetic control weights are under-identified and can overfit. In those cases, SDID with time regularisation or ridge SC can be more stable than unregularised SC, because they shrink weights and, in SDID’s case, focus attention on the most informative parts of the pre-period. At the extreme, when a cohort has only a handful of pre-treatment observations, you should be cautious about relying on any weighting-based method and consider reporting that cohort separately with appropriate caveats. The second issue is shrinking donor pools. As more cohorts adopt treatment, fewer not-yet-treated units remain available as donors. By the last adoption wave, only never-treated units are left. When the donor pool drops to only a handful of units, hybrid estimators are forced to lean heavily on those remaining donors, and

7.7 Multiple Treated Units and Staggered Adoption weight stability deteriorates. Monitoring donor-pool size and composition across cohorts, and being explicit about which cohorts have robust donor support, is essential.

Choosing Between Hybrid Methods in Staggered Designs The basic trade-offs between SC, ASCM and SDID carry over to staggered adoption. SDID is attractive when you believe that, after reweighting units and periods, treated and not-yet-treated cohorts satisfy a weighted parallel-trends condition, and when pre-treatment dynamics differ across cohorts. Its time weights can down-weight months or quarters where cohort trends diverge, making it easier to isolate a common component. ASCM is more appealing when certain cohorts are clear outliers relative to donors and simple reweighting cannot achieve good fit, or when you have strong prior beliefs about which observables drive outcome differences across cohorts. In those cases, the regression adjustment gives you an additional lever to repair residual imbalances that SC and SDID leave behind. Standard SC remains competitive when cohorts lie comfortably within the donor convex hull and achieve excellent pre-treatment fit. In those rare but pleasant cases, the extra structure and tuning parameters of SDID or ASCM may not buy you much. In practice, the best way forward is empirical. For a subset of cohorts, estimate SC, ASCM and SDID using the same donor sets $J_{-t}$ and compare pre-period fit, event-time pre-trends and post-period estimates. If all three methods tell a consistent story, you can confidently report the simplest design. If they diverge, that divergence is itself diagnostic. In terms of identification, severe discrepancies between SC and SDID point to tension between convex-hull coverage and (weighted) parallel trends, whereas discrepancies between SC/SDID and ASCM highlight sensitivity to the outcome-regression component. It tells you that some combination of convex-hull coverage, weighted parallel trends and outcome modelling is failing and that any conclusions about long-run dynamic effects in that part of the event-time profile should be presented with appropriate caution.

Reporting and Transparency Staggered hybrid designs place a premium on clear reporting. At minimum, you should report how each cohort contributes to aggregate and event-time estimates. In practice, this means tabulating the cohort weights $w_g$ used in each event-time estimate $\hat{\theta}_k$, listing $G_k$ for key values of $k$, and indicating where only early-adopting cohorts contribute. You should show pre-treatment fit diagnostics for each cohort — plots of treated vs hybrid counterfactual paths and summary measures of fit — and flag cohorts with visibly poor match. Finally, you should treat alternative aggregation schemes and estimator choices as sensitivity analyses, not afterthoughts. Reweight cohorts in ways that reflect different policy questions, rerun the analysis with SC, ASCM and SDID where feasible, and show how the dynamic treatment effect profile changes. When estimates prove stable across these perturbations, you have a much stronger case that your findings about campaign ramp-up, peak and decay are not artefacts of a particular hybrid specification.

7.8 Tuning, Implementation, and Donor Curation

References

Shaw, C. (2025). Causal Inference in Marketing: Panel Data and Machine Learning Methods (Community Review Edition), Section 7.7.