MMM 709: Diagnostics and Goodness of Fit

Hybrid methods are only as credible as the diagnostics that accompany them. In marketing panels, where data are noisy and designs are often tight, you cannot afford to treat a hybrid estimator as a black box. You need to check how well it fits the pre-treatment path, how well it balances covariates, how weights are distributed across donors and time, how sensitive estimates are to individual donors or periods, and how stable the design is under placebo checks. This section outlines a diagnostic workflow tailored to hybrids and connects each diagnostic to specific remedial actions, with Chapter 17 providing a more general framework. These are diagnostics and falsification checks. Passing them does not prove identification, but failing them is strong evidence against it.

Pre-Treatment Fit Pre-period fit is the first line of defence. For any hybrid method, define the pre-treatment RMSPE as v u $T_0$ u1 X 2 RMSPEpre = t $Y_{1t}$ − Ŷ1tsyn , $T_0$ t=1 where Ŷ1tsyn is the method-specific hybrid counterfactual in period t, and $T_0$ is the last pre-treatment period. For SC, this is the weighted donor average; for ASCM, it is the augmented synthetic control; for SDID, it is the reweighted, intercept-shifted comparison. RMSPE is not a goal in itself but a way to compare specifications and methods on a common scale. Expressing RMSPE relative to the pre-treatment standard deviation of $Y_{1t}$ helps with interpretation: an RMSPE that is small relative to the natural variability in outcomes suggests that the synthetic path tracks the treated unit closely; an RMSPE of the same order as the outcome SD indicates that the counterfactual is effectively noise. Comparing RMSPE across SC, ASCM, SDID and DiD shows whether the extra structure in hybrids is buying you materially better pre-treatment fit. In the factor-model notation from Chapter 6, RMSPEpre is best viewed as an empirical proxy for pre-period imputation error, not a direct bound on post-period bias. Small RMSPEpre is therefore necessary but not sufficient for small post-treatment bias. If RMSPE is of the same order as the post-period effect you hope to detect, the design cannot cleanly separate treatment from misspecification. The key is to combine RMSPE with other diagnostics. A hybrid that shaves a few percentage points off RMSPE by pushing weights onto one odd donor is less credible than a slightly looser-fitting design with more stable weights and better covariate balance.

7.9 Diagnostics and Goodness of Fit

Covariate Balance Outcome paths tell only part of the story. You also care about whether the hybrid synthetic control matches the treated unit on covariates that predict outcomes and treatment. Standardised mean differences (SMDs) are a useful summary: SMDx = where X1 is the treated unit’s covariate value,

X1 −

j $\hat{w}_j$ Xj sX

, j $\hat{w}_j$ Xj is the weighted donor mean, and sX is a reference

P standard deviation (for example, the pooled standard deviation of X across treated and donor units). Here X represents either baseline (time-invariant) covariates or pre-period averages of time-varying covariates. Values close to zero indicate good balance; large absolute values indicate imbalance. Thresholds from the matching literature — such as SMD below 0.1 in absolute value as “good balance” and above about 0.2 as concerning — are useful guides rather than hard rules. In marketing contexts you should focus especially on covariates that are central to the business story: store size, income, channel mix, competitive intensity, and so on. If one of these remains substantially imbalanced even under a hybrid design, you need to decide whether that imbalance plausibly biases the estimated effect. ASCM gives you a natural lever to correct residual imbalances by including problematic covariates in the augmentation model. Ridge and balancing SC let you tilt the weighting scheme towards better balance at the cost of some pre-treatment fit. If, after reasonable adjustments, a key covariate remains badly imbalanced, you should report that fact and discuss its implications for identification.

Weight Dispersion Hybrid estimators work by reweighting donors and, in SDID, time periods. Diagnostics should therefore include a view of how concentrated or diffuse those weights are. For unit weights, an informative summary is the “effective number of donors”

1 2, j $\hat{w}_j$

Neff = P which is the inverse of the Herfindahl index of the weights. A value of one means all weight sits on a single donor; a value equal to the number of donors N0 means weights are uniform. For SDID, an analogous effective number of pre-treatment periods

1 Teff = P$T_0$

2 t=1 v̂t shows how many periods meaningfully contribute. Extremes in either direction are informative. Very low Neff indicates that the hybrid is leaning heavily on one or two donors and may be vulnerable to their idiosyncrasies. Very high Neff indicates that the estimator is close to uniformly averaging over donors and may be ignoring structure that could help sharpen the counterfactual. In many marketing panels, an interior solution where a modest number of sensible donors carry most of the weight is what you hope to see. If diagnostics show otherwise, revisit the penalty strength, donor pool and predictor set.

Leverage and Influence Even when overall weight dispersion looks reasonable, individual donors or periods can have outsized influence. Simple leave-one-out checks are powerful here. Re-estimate the hybrid estimator excluding each of the largestweight donors in turn and record how much the estimated treatment effect changes. Do the same for pretreatment periods in SDID by excluding one period at a time when estimating time weights. If dropping a single donor or period barely moves the estimate, the design is robust along that dimension. If excluding one donor or one pre-period causes large swings, dig deeper. Is that donor truly comparable, or does it have a unique pattern that the algorithm is over-using? Does the influential period contain an unusual local promotion, a data glitch or a macro shock that should be handled explicitly? The point is not to mechanically delete influential observations, but to understand why they matter. Large influence for a donor or period that is marginal to the business question (for example, a very small donor market or a holiday period outside the main evaluation window) may be less concerning than large influence for a core donor or a period central to the decision; diagnostics should be interpreted in that light.

Placebo Checks Placebo checks probe the stability assumptions that hybrids rely on. The idea is to pretend that treatment occurred earlier than it did and see whether the estimator generates spurious “effects” inside the pre-treatment window. Concretely, choose a pseudo-intervention date T ∗ < $T_0$ . Treat periods up to T ∗ as pre-treatment, estimate the hybrid estimator on that shorter pre-period, and then compute pseudo treatment effects for periods between T ∗ + 1 and $T_0$ . If the model extrapolates well, these pseudo effects should be small and centred near zero. If you see large, systematic pseudo effects, the specification is not stable even within the pre-period, and there is little reason to trust it out of sample. Large, systematic pseudo effects indicate violations of the stability condition in Assumption 22: the pre-treatment mechanism linking weights, covariates and untreated outcomes is not even stable within the pre-period. These checks are particularly revealing for complex hybrids such as ASCM, SDID and TROP, where overfitting the early pre-period is easy. When placebo checks fail, the appropriate reaction is not to hunt for a different pseudo date that “passes”, but to rethink the specification: shorten the pre-period used for estimation, simplify the augmentation model, strengthen regularisation, or reconsider whether the design can support a hybrid at all. Pre-specify a small set of pseudo-dates (for example, one or two meaningful cut points based on seasonality or business cycles) rather than scanning the entire pre-period.

7.9 Diagnostics and Goodness of Fit

Residual Diagnostics Residuals provide a complementary view. For a given hybrid estimator, define residuals as e1t = $Y_{1t}$ − Ŷ1tsyn for all periods t, and plot them over time with the intervention date marked. In a well-specified design, pre-treatment residuals should fluctuate around zero without obvious trend or seasonality. Systematic drifts, seasonal cycles or clusters of large outliers in the pre-period signal misspecification. Seasonal residual structure suggests the donor pool is mismatched on seasonality or the pre-period window is misaligned. Structural breaks suggest redefining the pre-period or excluding donors with breaks. Post-treatment residuals will reflect both treatment effects and any remaining misspecification. A clean treatment story looks like a discernible level shift or pattern emerging after treatment against a background of pre-period noise that has been well controlled. If pre-period residuals already exhibit strong non-random structure, it becomes hard to argue that post-period deviations are due to treatment rather than to model failure.

Method-Specific Considerations Different hybrids bring their own diagnostic nuances. For ASCM, the size of the augmentation correction relative to the pure synthetic-control component is informative. If the augmentation correction routinely accounts for a large share of the level of Ŷ1tsyn across pre- and post-treatment periods, the estimator is heavily reliant on the outcome model. Inspecting coefficient paths for the augmentation regression, and how sensitive the correction is to tuning choices and alternative specifications of the augmentation model, helps you judge whether that reliance is defensible. For SDID, the pattern of time weights v̂t is revealing. Concentration on the very last pre-treatment period effectively turns the design into a first-difference comparison. Near-uniform time weights put equal emphasis across the whole pre-period. Combined with diffuse unit weights, the estimator moves closer to a conventional DiD-style comparison, but it is not identical to DiD in levels; both extremes should be justified by the seasonal and macro environment. When you see extreme patterns, you should ask whether they make sense given seasonality, promotions and macro conditions, and whether alternative specifications of the pre-period window would yield more interpretable weights. For TROP and other factor-based hybrids, the tuning parameters and inferred factor rank play a diagnostic role. When cross-validation pushes the factor penalty so high that the factor component is effectively shut down, the data are telling you that a simpler hybrid (closer to SDID) suffices. When unit-weight decay is very strong, the estimator is effectively using nearest neighbours. Reporting these tuning outcomes alongside fit metrics gives readers a sense of how complex the underlying structure really needs to be. Given the current research stage of TROP, treat these tuning patterns as exploratory diagnostics rather than as definitive evidence that a rich factor structure is required.

Putting Diagnostics Together In practice, you should view diagnostics as a package rather than as a sequence of hurdles. A credible hybrid design is one where pre-period RMSPE is small relative to outcome variability and clearly better than simpler alternatives; covariates central to the business story are well balanced; unit and time weights are neither excessively concentrated nor implausibly uniform; estimates are not unduly driven by a single donor or period; placebo checks show no large spurious effects; and residuals in the pre-period look like noise rather than trend. When several of these diagnostics point in the same positive direction, you can be reasonably confident that the hybrid method is constructing a credible counterfactual. When they send mixed signals — for example, excellent RMSPE but poor balance on key covariates, or good balance but unstable placebo performance — you should say so and treat any causal claims with appropriate caution. The goal is not to certify a method as “valid” once a checklist is ticked, but to build a cumulative case that your design and estimator are fit for purpose in the marketing context at hand. Chapter 17 provides a complementary, design-level checklist; you should interpret the hybrid-specific diagnostics here within that broader framework.

7.10 Inference

References

Shaw, C. (2025). Causal Inference in Marketing: Panel Data and Machine Learning Methods (Community Review Edition), Section 7.9.