MMM 407: Inference in DiD

Why inference is a first-order issue

Section 4.7 makes a central point: in DiD, identification and estimation are only half the job. Inference can fail even when the estimator is well chosen. Correlated errors, few clusters, and many simultaneous hypotheses can make confidence intervals too narrow and p-values overly optimistic.

Clustering

Outcomes for the same unit over time are correlated because of persistent unobservables, autocorrelated shocks, and dynamic feedback. Ignoring this serial correlation understates uncertainty.

Default practice is cluster-robust standard errors by unit. This allows arbitrary within-unit correlation while assuming independence across clusters.

If cross-unit correlation is plausible (regional shocks, category shocks, macro shocks), use two-way clustering by unit and time. This captures:

within-unit serial dependence,
cross-sectional dependence within each period.

The trade-off is larger standard errors (more realistic uncertainty) and slightly greater computational cost.

Common mistake: wrong clustering level

Cluster at the level treatment varies.

If treatment is assigned at store level, cluster by store.
If treatment is assigned at DMA level, cluster by DMA (not by store).

Clustering at a finer level than assignment (for example, store when treatment varies by DMA) typically yields SEs that are too small. Clustering coarser than needed is conservative but reduces power.

Worked example: clustering can flip conclusions

The section’s example (loyalty rollout, 500 stores, 12 quarters):

Unclustered: $\hat{\tau}=8.2$, $SE=2.1$, 95% CI $[4.1, 12.3]$, $p<0.001$.
Clustered by store: $SE=4.8$, 95% CI $[-1.2, 17.6]$, $p=0.09$.

Same point estimate, very different inferential conclusion. This is why clustering is not optional in panel DiD.

Small-sample corrections and few clusters

Cluster-robust asymptotics can be unreliable when the number of clusters $G$ is small.

Practical guidance from Section 4.7:

$G<20$: asymptotic clustered SEs often unreliable; prefer wild cluster bootstrap or randomisation inference.
$20 \le G \le 50$: asymptotic SEs may be usable, but check against bootstrap and apply small-sample corrections (for example HC2/HC3 variants).
$G\ge50$: asymptotic cluster-robust SEs are generally reliable.

These are guidelines, not hard thresholds. Reliability also depends on:

cluster size imbalance,
within-cluster correlation strength,
leverage concentration in a few treated clusters.

Important notation reminder from the chapter: inference uses cluster count $G$; do not confuse this with adoption time $G_i$.

Wild cluster bootstrap

Wild cluster bootstrap resamples entire clusters and builds a finite-sample null distribution for the test statistic. It is especially useful when treated clusters are few (for example, a geo experiment with only a handful of treated DMAs).

The bootstrap test imposes the null when constructing the distribution, so it is a hypothesis-testing tool. In practice, report bootstrap p-values alongside conventional CIs if they diverge.

Randomisation inference

Randomisation inference (permutation logic) gives exact p-values under:

the sharp null of zero effect for all units and periods, and
the stated randomisation protocol.

In true experiments, this is compelling because assignment is known by design. In observational staggered adoption, assignment is not known by design, so randomisation inference is only as credible as the assumed assignment mechanism (for example, timing random conditional on covariates). It is not design-free.

Multiple testing in DiD

Multiplicity appears naturally when estimating:

full event-time paths $\{\theta_k\}$,
many cohort-time effects $\{\tau(g,t)\}$,
many subgroup ATTs.

Without adjustment, the chance of at least one false positive exceeds nominal levels.

Bonferroni controls family-wise error, but can be conservative.
FDR control is less conservative and often better for exploratory analyses.
Romano-Wolf stepdown can improve power by exploiting dependence while controlling family-wise error.

Decision framing from the chapter:

Use Bonferroni or Romano-Wolf when false positives are costly (regulatory or irreversible decisions).
Use FDR for exploratory heterogeneity scans where power matters.

Practical workflow for marketing applications

Section 4.7 recommends this default workflow:

Cluster by unit as baseline.
Add two-way clustering when cross-unit shocks are plausible.
Use wild cluster bootstrap when clusters are few.
Adjust for multiplicity when testing many coefficients/subgroups.
Report sensitivity to alternative clustering levels when uncertain.

Pre-specify primary vs exploratory analyses in the pre-analysis plan. Control type I error tightly for the primary estimand; label exploratory analyses as exploratory.

Table 4.5 in markdown: Inference method decision guide

Situation	Recommended method	Notes
Standard panel, $G \ge 50$ clusters	Cluster-robust SEs by unit	Default choice
Cross-unit correlation (geo experiments, regional shocks)	Two-way clustering (unit x time)	Accounts for serial and cross-sectional correlation
Few clusters ($G < 20$)	Wild cluster bootstrap	Better finite-sample inference
Experimental design with known randomisation	Randomisation inference	Exact p-values under sharp null and stated protocol
Many event-time coefficients or many subgroups	FDR control or Romano-Wolf	Adjust for multiplicity
Primary estimand + exploratory analyses	Adjust primary only	Label exploratory analyses clearly

Software pointers from Section 4.7

Inference choices should follow design and estimand decisions.

Wild cluster bootstrap: boottest (Stata), fwildclusterboot (R)
Two-way clustering: reghdfe (Stata), fixest (R)
Romano-Wolf: rwolf (Stata), wildrwolf (R)
FDR control: p.adjust(method = "BH") (R)

Modern DiD estimators (for example, Callaway-Sant’Anna, Sun-Abraham) provide estimator-specific variance calculations for aggregated estimands. Consult package documentation for the exact asymptotic argument and variance construction.

Takeaway

In DiD, inferential choices can materially change conclusions even when point estimates are stable. The minimum standard is to align clustering with treatment assignment, diagnose small-$G$ risk, and handle multiplicity explicitly. Transparent reporting of alternative clustering and inference choices is part of identification credibility, not a robustness afterthought.

References

Shaw, C. (2025). Causal Inference in Marketing: Panel Data and Machine Learning Methods (Community Review Edition), Section 4.7.
Abadie, A., Athey, S., Imbens, G. W., and Wooldridge, J. (2023). When should you adjust standard errors for clustering? Quarterly Journal of Economics, 138(1), 1-35.
Cameron, A. C., Gelbach, J. B., and Miller, D. L. (2008). Bootstrap-based improvements for inference with clustered errors. Review of Economics and Statistics, 90(3), 414-427.