MMM 710: Inference

Hybrid methods do not change the basic purpose of inference. We still want to know how much uncertainty surrounds our estimated treatment effects and whether those effects are unusually large relative to what the design would produce by chance. Uncertainty bands can target either the counterfactual path Ŷ1t (0) itself or the treatment effect $ au$̂1t = $Y_{1t}$ − Ŷ1t (0); the interpretation and width differ. What hybrids do change is the structure of that uncertainty: weights are estimated, time weights v̂t may be estimated, augmentation models contribute additional noise, and, for TROP, factor structure is tuned. Inference must respect this extra layer of estimation while remaining honest about the limits imposed by small marketing panels. In this section we sketch inference strategies that are well suited to the hybrid estimators in this chapter, and point to Chapter 16 for the full development of panel inference tools.

Method-Specific Variance Structures Each hybrid estimator has its own variance structure, reflecting which objects are treated as fixed and which as estimated. For SDID, both unit weights and time weights v̂t are estimated from the pre-treatment panel. Arkhangelsky et al. [2021] derive an analytic variance formula that treats these weights as functions of the data and accounts for their sampling variability when the control group is large. In small-N marketing panels, this large-controlgroup asymptotics can be fragile; in those cases it is safer to treat the analytic formula and bootstrap estimates as complementary checks rather than to rely on one alone. In practice, many applications rely either on that formula as implemented in software or on bootstrap methods that resample units or cohorts, as described below. For ASCM, uncertainty comes from two sources: the synthetic control weights and the augmentation model. The more weight the design puts on the regression adjustment, the more variance (and potential model dependence) flows through that channel; the more weight it puts on the synthetic control component, the more variance comes from which donors carry weight and how their outcomes fluctuate. Closed-form variance expressions exist under specific modelling assumptions, but they depend on nuisance quantities that are hard to estimate reliably in small samples. In particular, variance contributions from the augmentation model depend on how well its residual structure is captured; when the regression is high-dimensional or heavily regularised, asymptotic formulas can be especially misleading. For this reason, resampling-based approaches are the natural default. For ridge SC and related regularised SC variants, the ridge penalty stabilises weights and typically reduces variance relative to unregularised SC, at the cost of some bias. Because the weights are complicated functions of the data and the penalty, analytic variance expressions quickly become unwieldy. Again, cluster-robust or bootstrap-based inference is the practical choice. For TROP and other factor-based hybrids, the variance structure is more complex still because unit weights, time weights v̂t and the factor component are all tuned from the data. Here, inference is best treated as a research topic rather than a settled routine. Proposals in the literature include cross-validation-style variance estimates based on how effects vary across folds. At the time of writing, these ideas are promising but not yet standard, and we recommend using TROP primarily as a robustness check alongside simpler methods whose inference is better understood.

Placebo and Permutation Tools Placebo and permutation tools play a central role in hybrid inference because they make minimal modelling assumptions and directly probe the stability of the design. In-space placebos apply the hybrid estimator to each donor unit in turn, pretending that donor j was treated at the same time as the actual treated unit. Comparing the resulting placebo gaps to the treated unit’s gap gives a sense of how unusual the observed effect is relative to what the method produces when applied to untreated units. Ranking statistics such as the ratio of post- to pre-treatment RMSPE, introduced in Chapter 6, can be turned into randomisation-style tail probabilities under symmetry assumptions. In observational marketing panels, it is safer to treat these as descriptive diagnostics — “our treated unit is more extreme than all but one placebo” — rather than as literal frequentist p-values. In other words, they tell you how unusual your treated unit looks when processed through the same estimator and design, conditional on the realised panel, not how likely such an effect would be under a random-assignment mechanism that did not in fact operate. In-time permutations treat intervention timing as if it were unknown within the pre-treatment window. By shifting the pseudo-intervention date forward and backward and recomputing pseudo effects, you obtain a distribution of “effects” that arise purely from fitting and extrapolating within the pre-period. If the actual post-treatment effect sits well outside the range of these pseudo effects, that strengthens the case that you are seeing treatment rather than model artefacts. If it falls squarely inside that range, your design is not discriminating strongly between real and pseudo treatment. Large pseudo-effects under in-time permutations are direct evidence against the pre/post stability condition in Assumption 22: the mechanism that fits early pre-period data does not extrapolate even to later pre-periods. Both procedures stress-test the hybrid’s weighting and augmentation components without additional parametric assumptions. They are especially valuable when conventional asymptotics are unreliable because the number of units or pre-periods is small.

Bootstrap Approaches Bootstrap methods, introduced in Chapter 16, provide a flexible route to standard errors and confidence intervals when analytic formulas are unavailable or untrustworthy. In panel hybrids, the natural resampling unit is the unit or cohort, not individual observations. A unitlevel (or cohort-level) block bootstrap proceeds by resampling units with replacement within the treated

7.10 Inference and donor groups separately, re-estimating the hybrid estimator on each bootstrap sample and recording the resulting treatment effect. The dispersion of these bootstrap replicates provides a standard error, and percentile intervals provide confidence bands. When there is only a single treated unit, resampling treated units is not meaningful; single-treated-unit settings typically rely on placebo distributions, time permutations, or residual-based procedures rather than a naive “treated-unit bootstrap”. If you do bootstrap in such settings, it is usually via resampling donors (and/or blocks of time under additional assumptions), with interpretation as a sensitivity device. This preserves the within-unit time-series structure while acknowledging that treated and donor units are the primary sources of sampling variability. When the number of treated units is extremely small (for example, one or two flagships), bootstrap variability in pooled effects will be dominated by betweenunit heterogeneity rather than sampling noise; in such cases, simple unit-level bootstraps can understate uncertainty and should be interpreted conservatively. Wild bootstrap variants adapt the same idea to settings with few units and heteroscedastic residuals. Wild bootstrap variants either hold weights fixed and perturb residuals, or re-estimate the full procedure in each bootstrap draw; both aim to approximate sampling variation under a null. This is particularly useful when you pool effects across cohorts or event times and want to account for heterogeneity in noise levels across markets and over time. In both cases, you must respect the design: resampling should not break the treatment–control structure (for example, treated units should remain treated) or the staggered-adoption pattern.

Conformal and Prediction-Interval Views Conformal inference, discussed in Chapter 16 [Chernozhukov et al., 2021, Cattaneo et al., 2021], offers an alternative perspective by constructing prediction intervals for the treated unit’s untreated outcomes based on the distribution of pre-treatment residuals. For hybrids, the idea is simple. You compute residuals between observed and synthetic outcomes in the pre-period, treat their empirical distribution as a reference for what “noise” looks like under no treatment, and then form intervals around the hybrid counterfactual in the post-period by adding and subtracting suitable quantiles of those residuals. If the observed post-treatment outcomes consistently lie outside these bands in one direction, that is evidence of a treatment effect. If they oscillate inside the bands, the data are consistent with noise around the estimated counterfactual. The strength of this approach is that it makes minimal distributional assumptions beyond a form of exchangeability: conditional on no treatment, the distribution of residuals in the postperiod is assumed to match the distribution of pre-treatment residuals. In panels with serial correlation, you need a blocked or exchangeable-by-block residual assumption, not i.i.d. residuals. Its weakness is that those assumptions are not guaranteed in short, seasonal marketing panels. We recommend using conformal-style bands as a complement to, not a replacement for, placebo and bootstrap-based checks.

Randomisation Inference and Multiplicity Randomisation inference has a clean interpretation in genuine experiments, where treatment is assigned by design. In that case, permuting treatment labels among units and recomputing the hybrid estimator for each permutation reproduces the exact randomisation distribution under a sharp null of no effect. Comparing the observed effect to this distribution yields p-values with a straightforward causal meaning. In genuine randomised geo-experiments, where treatment assignment across markets is known and under the experimenter’s control, this interpretation is exact. In observational panels, any such test must be viewed as a hypothetical sensitivity exercise conditional on strong assumptions about the assignment mechanism. In observational hybrids, treatment is not randomly assigned, so randomisation inference can only be interpreted as a sensitivity tool under strong assumptions about selection. A small permutation tail probability (interpreted as a diagnostic measure of relative extremeness) tells you that, given the observed pattern of outcomes and your hybrid estimator, it would be rare to see an effect as large as the one you estimate if treatment labels were randomly shuffled. It does not, by itself, prove that treatment caused the effect. Use such tests as one piece of a larger inferential puzzle, not as a standalone verdict. Multiplicity is unavoidable when you examine many post-treatment periods, cohorts or outcomes. Testing each post-treatment event-time effect θk at 5% inflates the chance of at least one false positive. Chapter 16 discusses formal tools such as Bonferroni corrections, false discovery rate control and stepdown procedures. In the hybrid context, it is often more informative to combine these with graphical event-study plots and joint tests (for example, testing that all post-treatment event-time effects θk are zero). For hybrid estimators that feed into event-time profiles, the uniform-band and joint-test ideas from the event-study chapter (Chapter 5) apply directly. In marketing applications with small samples, you will frequently be underpowered for strict family-wise error control; the focus should be on patterns that are consistent across periods, methods and specifications, rather than on individual stars in a table.

Small-Sample Considerations Most marketing panels are small enough that textbook asymptotics are, at best, rough guidance. With few treated units, modest donor pools and short pre-treatment histories, any inferential procedure must be interpreted with care. Chapter 16 discusses the distinction between sampling-based and design-based uncertainty in more detail [Abadie et al., 2020], which is especially relevant when inference is layered on top of observational hybrid designs. With a small donor pool, rank-based placebo tests can only take on a coarse set of tail probabilities, and bootstrap distributions may be jagged. With short pre-periods, placebo checks in time have limited power and conformal intervals are based on few residuals. With few treated units, pooled estimates will have wide intervals and considerable sampling noise. In these settings, the most honest stance is to de-emphasise binary significance thresholds and focus on the size, sign and robustness of estimated effects across methods

7.10 Inference and specifications. In practice, this means showing effect paths with confidence or prediction bands, placebo distributions, and bootstrap intervals side by side, rather than relying on asterisks in tables. Whatever inferential tools you deploy, good practice is to be explicit about sample sizes (numbers of treated units, donors, pre- and post-periods), about which inference method you used and why, and about how multiple testing was handled if you report many coefficients. This transparency lets readers calibrate their own confidence in your conclusions and see where the evidence is strong, where it is suggestive and where it is simply too thin to support sharp claims.

References

Shaw, C. (2025). Causal Inference in Marketing: Panel Data and Machine Learning Methods (Community Review Edition), Section 7.10.