MMM 703: Regularised and Balancing Variants of SC

Standard synthetic control can overfit. When the donor pool is large and the pre-treatment period short, the optimisation can find weights that chase noise rather than signal. A donor that happens to share an idiosyncratic seasonal blip with the treated unit may receive high weight for the wrong reason. The synthetic control then fits pre-treatment outcomes tightly but predicts post-treatment outcomes poorly. Return to the flagship stores. The retailer has twenty potential control stores. Standard synthetic control assigns weight 0.45 to one suburban store that happens to experience a similar seasonal spike in month fourteen, weight 0.35 to a regional store with an unrelated promotional calendar, and spreads the remaining 0.20 across three others. The pre-treatment root mean squared prediction error looks impressively low. Yet the weights reflect coincidence, not structural similarity. When the loyalty programme launches, the synthetic control diverges from the flagships’ trajectory because the weighted donors share noise, not fundamentals. Regularised synthetic control attacks this problem by penalising weight configurations that are overly concentrated. Instead of letting the optimisation pursue pre-treatment fit at any cost, regularisation shrinks weights towards simpler, more diffuse patterns. This typically sacrifices a small amount of in-sample fit in exchange for weights that are more stable and less sensitive to sampling variation.

Ridge Synthetic Control Ridge synthetic control adds an $L_2$ penalty to the standard optimisation problem [Doudchenko and Imbens, 2016]. Let $X_1$ collect pre-treatment outcomes and covariates for the treated unit and let $X_0$ stack the corresponding donor values, so that $X_1 \in \mathbb{R}^k$ and $X_0 \in \mathbb{R}^{k \times N_0}$, where $N_0$ is the number of donors. Let $V$ weight the predictors as in Chapter 6. Ridge SC solves

$$ \min_w \|X_1 - X_0 w\|_V^2 + \eta \|w\|_2^2 \quad \text{s.t. } w_j \ge 0,\ \sum_j w_j = 1, $$

where $\|a\|_V^2 := a' V a$, $\eta > 0$ controls the strength of the penalty, and $\|w\|_2^2 = \sum_j w_j^2$. Under the simplex constraints, minimising $\|w\|_2^2$ discourages putting all the mass on one or two donors and instead favours more diffuse weights. In fact, if $\bar{w}$ denotes the uniform vector with components $1/N_0$, then $\|w - \bar{w}\|_2^2 = \|w\|_2^2 - 1/N_0$, so penalising $\|w\|_2^2$ is equivalent (up to an additive constant) to shrinking towards uniform weights. In terms of the variance decomposition in equation (6.24), shrinking $\|w\|_2^2$ reduces the Herfindahl index $\sum_j (w_j^*)^2$ and thereby the idiosyncratic-variance component $\sigma^2 (1 + \sum_j (w_j^*)^2)$. Uniform weights $w_j = 1/N_0$ minimise the penalty, while highly concentrated weights incur a large penalty. The parameter $\eta$ governs the trade-off. If you set $\eta = 0$ you recover standard synthetic control. As $\eta$ grows, the objective gives more weight to simplicity and less to pre-treatment fit, and the solution moves towards uniform weights. The goal is to choose an intermediate $\eta$ that retains enough structure from the pre-treatment data while damping out idiosyncratic noise. A practical way to do this is to use the same cross-validation logic introduced for ASCM. Split the pre-treatment period into training and validation blocks, estimate ridge

7.3 Regularised and Balancing Variants of SC

SC over a grid of $\eta$ values on the training block, and select the $\eta$ that minimises prediction error on the validation block. Cross-validation can reduce overfitting to pre-period noise, but it cannot validate that the resulting weights will produce unbiased post-treatment counterfactuals. Applied to the flagship stores, ridge SC with a cross-validated $\eta$ will typically spread weights across more donors. Pre-treatment RMSPE might increase from 0.03 to 0.05, which is a modest deterioration in fit. In return, the synthetic control reflects a broader set of stores with similar customer demographics and format characteristics, rather than a narrow pair that merely share a transient seasonal spike. Regularisation is not free. If the true counterfactual is genuinely well-approximated by a sparse combination of donors that closely resemble the flagships, ridge SC dilutes those weights and introduces bias. You trade accuracy in those best-case designs for robustness in the more common case where the data are noisy and the pre-period is short. From a design perspective, standard SC and ridge SC both construct counterfactuals as weighted averages of observed donors, using the same predictor set. Ridge SC simply adds a stronger structural assumption on the weights: that you prefer diffuse combinations unless the data provide a compelling reason to concentrate.

Balancing Synthetic Control Ridge SC regularises via the objective function. Balancing SC takes a complementary route and builds explicit covariate balance into the constraints. The optimisation problem becomes

$$ \min_w \|w\|_2^2 \quad \text{s.t. } w_j \ge 0,\ \sum_j w_j = 1,\ \|X_1 - X_0 w\|_B \le \delta, $$

where $\|\cdot\|_B$ measures imbalance in a chosen set of balance statistics and $\delta$ is a tolerance parameter chosen ex ante as a credibility constraint. For example, you might take $\|a\|_B := \sqrt{a' B a}$ for a positive semidefinite weighting matrix $B$ built from standardised mean differences over a selected covariate subset. The objective again prefers diffuse weights, while the balance constraint insists that the synthetic control match the treated unit on key predictors within tolerance $\delta$. This formulation inverts the emphasis of standard SC. Standard SC minimises a measure of covariate imbalance subject to convexity constraints on the weights. Balancing SC minimises weight complexity while treating covariate balance as non-negotiable. The inversion matters when you have strong prior views about which characteristics must be matched for the design to be credible. Consider a DMA-level advertising study where the treated DMA (San Francisco) has population 4.7 million, median household income 112,000 dollars, and baseline monthly sales of 2.3 million dollars. Suppose the analyst believes that any credible counterfactual must match these characteristics within about 10 per cent. Standard SC might achieve an impressive match on pre-treatment sales by assigning large weight to a smaller DMA whose sales path happens to track San Francisco but whose demographics look nothing like it. Balancing SC encodes the demographic requirements directly: it looks for weights that bring standardised differences in population, income, and baseline sales below a chosen threshold, even if that forces some deterioration in the fit of the pre-treatment sales path. The tolerance parameter $\delta$ tunes this tension. Very small $\delta$ values demand near-exact balance and may make the problem infeasible if the treated unit lies outside the donor convex hull. Very large $\delta$ values relax the constraint so much that the solution drifts back towards uniform weights. In practice you choose $\delta$ by inspecting standardised mean differences for critical covariates and aiming for thresholds (for example below 0.1 in absolute value) as a heuristic, calibrated to how predictive those covariates are for outcomes. Balancing SC and ridge SC serve different purposes. Balancing SC is useful when particular covariates, such as demographics or baseline volume, are non-negotiable for credibility and you want that requirement enforced mechanically. Ridge SC is better suited when you have a large donor pool relative to the pre-period and are primarily worried about overfitting idiosyncratic noise, without strong prior views about specific covariates beyond the predictor set already used in SC.

Elastic Net and Other Variants Elastic net synthetic control combines $L_1$ and $L_2$ penalties in the objective. In regression settings the $L_1$ term induces sparsity by pushing many coefficients exactly to zero. In synthetic control designs with convexity constraints, many donors already receive zero weight even without an $L_1$ penalty, so in typical marketing panels the additional sparsity from an explicit $L_1$ term is modest. Under the non-negativity and sum-to-one constraints, $\|w\|_1 = 1$ for all feasible $w$, so any $L_1$ penalty must operate on deviations from a target (for example, $\|w - \bar{w}\|_1$) or be embedded in a formulation without the simplex constraint. In practice, this limits the incremental impact of $L_1$ penalties beyond what convexity already imposes. Elastic net SC can be useful when the donor pool is extremely large and you want aggressive regularisation, but ridge SC will suffice in most applications covered in this book. Other weighting schemes push in similar directions. Entropy-balancing weights, for example, minimise a divergence measure from a reference distribution subject to moment constraints on covariates [Hainmueller, 2012], and can be adapted to panel settings by balancing on pre-treatment summaries (for example, means and trends) and then applying the resulting weights to the panel. Kernel-based approaches weight donors by similarity in a richer feature space, while Bayesian formulations place priors directly on weights or counterfactual trajectories. Each of these methods imposes a particular structure on how weights should behave. They differ in how they trade fit against stability and interpretability, but they share the core principle that some regularisation of weights is often necessary to obtain reliable out-of-sample performance.

Extrapolation and Interpolation Regularisation also changes where the synthetic control sits inside the donor space. Unregularised SC interpolates inside the convex hull of donors and often places substantial weight on boundary donors that most

7.3 Regularised and Balancing Variants of SC closely resemble the treated unit (that is, donors whose predictor vectors lie near the edge of the convex hull of $X_0$ in predictor space), as discussed in Chapter 6. Ridge SC shrinks those weights towards the centre of the donor distribution and reduces reliance on any single boundary donor. This is helpful when the treated unit is well represented in the middle of the donor cloud, because the synthetic control becomes less sensitive to outlier donors. It can hurt when the treated unit genuinely lies near the boundary. Shrinking weights towards the centre then pulls the synthetic control away from the treated unit’s true untreated trajectory. Balancing SC brings its own potential failure mode. When covariate constraints are tight and the donor pool does a poor job of spanning the treated unit’s characteristics, the optimisation may be forced onto donors that match on demographics but differ sharply on outcome dynamics. The resulting synthetic control satisfies the covariate balance criteria but predicts pre-treatment outcomes poorly. As in the earlier SC sections, diagnostics start with pre-treatment fit and covariate balance. If regularisation causes a dramatic increase in RMSPE or clearly worsens the alignment of pre-treatment paths, the method is fighting the data and should be treated with suspicion. If it modestly increases RMSPE while stabilising weights and preserving acceptable balance on key covariates, the trade-off is more likely to be favourable.

Choosing Among Variants The choice among standard, ridge, and balancing synthetic control depends on both the data structure and the credibility constraints of the application. In practice, choices among variants should be driven by pretreatment RMSPE, effective number of donors $N_{\mathrm{eff}}$, covariate balance, and sensitivity of estimated effects to small changes in the donor pool and predictor set, as discussed in Section 6.5. Standard SC is well suited to settings where the donor pool is modest, the pre-treatment period is long enough to pin down stable weights, and the treated unit lies comfortably inside the donor convex hull. In those designs, unregularised weights often achieve tight pre-treatment fit without obvious instability. Ridge SC is more appropriate when the donor pool is large relative to the pre-treatment period and you see signs of overfitting, such as highly concentrated weights on donors that share only idiosyncratic patterns with the treated unit. The penalty spreads weight across more donors and typically improves out-of-sample performance at the cost of a small increase in pre-treatment RMSPE. Balancing SC is most attractive when certain covariates are critical to the story you will tell. In the DMA advertising study, for instance, a marketing executive is unlikely to accept a counterfactual that differs sharply from San Francisco on population or income, regardless of how well it matches sales trends. In that case you should encode demographic balance as a hard design requirement via balancing SC. In practice you do not need to commit to a single variant. A robust workflow applies standard SC and one or two regularised variants to the same campaign and compares both pre-treatment diagnostics and estimated treatment paths. When the estimates and diagnostics line up across methods, you gain confidence that conclusions are not driven by a particular regularisation choice. When they diverge, that disagreement is itself an important finding. It signals that some combination of convex-hull coverage, covariate balance, and regularisation structure is failing, and that your substantive conclusions should reflect this uncertainty explicitly rather than resting on a single preferred specification.

7.4 Synthetic Difference-in-Differences (SDID)

References

Shaw, C. (2025). Causal Inference in Marketing: Panel Data and Machine Learning Methods (Community Review Edition), Section 7.3.