STB 204: Testing for Treatment Effects and Conclusions

Sections 6 and 7 of Sandoval, Waudby-Smith and Jordan (2026) bring the paper full circle, applying the abstract machinery to the motivating example from the introduction: a pharmaceutical company testing multiple treatment variations against a control.

The setup: multi-armed randomized experiments

Each experimental unit $n$ has $K+1$ potential outcomes:

$$Y_n(0), Y_n(1), \ldots, Y_n(K),$$

where $Y_n(0)$ is the control outcome and $Y_n(a)$ for $a \in \mathcal{A} = \{1, \ldots, K\}$ are the treatment outcomes under variation $a$ (e.g., different dosages). All outcomes are bounded in $[0, 1]$.

The average treatment effect for variation $a$ under distribution $P$ is:

$$\psi_P(a) := E_P[Y(a) - Y(0)].$$

The hypothesis test asks whether any treatment variation exceeds a threshold $\delta$:

$$\mathcal{P}^{(\delta)} = \{P \mid \forall a \in \mathcal{A},\; \psi_P(a) \leq \delta\} \quad \text{vs.} \quad \mathcal{Q}^{(\delta)} = \{P \mid \exists a \in \mathcal{A},\; \psi_P(a) > \delta\}.$$

For testing a positive treatment effect, set $\delta = 0$.

Data collection with propensity scores

Algorithm 3 describes the data collection protocol:

Nature samples $(Y_n(0), Y_n(1), \ldots, Y_n(K)) \sim P$.
The statistician chooses $A_n \in \{1, \ldots, K\}$ based on past data.
Draw $Z_n \sim \text{Bernoulli}(\pi)$ for a fixed propensity score $\pi \in (0, 1)$.
Assign unit $n$ to treatment $A_n$ if $Z_n = 1$, or control if $Z_n = 0$.
Observe $Y^{\text{obs}}_n(A_n) = Z_n Y_n(A_n) + (1 - Z_n) Y_n(0)$.

Three causal identification assumptions are required:

SUTVA: $Y^{\text{obs}}_n(A_n) = Z_n Y_n(A_n) + (1 - Z_n) Y_n(0)$ — no interference across units.
No confounding: $(Y_n(0), \ldots, Y_n(K), A_n) \perp\!\!\!\perp Z_n$.
Positivity: $\pi \in (0, 1)$ — both treatment and control have positive probability.

The Horvitz-Thompson estimator

For each arm $a$, the Horvitz-Thompson estimator is:

$$\hat{\psi}_n(a) = \frac{1}{N_a(n)} \sum_{i=1}^n \mathbf{1}\{A_i = a\} \left(\frac{Z_i Y^{\text{obs}}_i(A_i)}{\pi} - \frac{(1-Z_i) Y^{\text{obs}}_i(A_i)}{1-\pi}\right),$$

where $N_a(n) = \sum_{i=1}^n \mathbf{1}\{A_i = a\}$.

This is an unbiased estimator of the average treatment effect: $E^{\text{RCT}}[\hat{\psi}_n(a)] = \psi_P(a)$.

Under bounded outcomes, $\hat{\psi}_n(a) \in [-1/(1-\pi), 1/\pi]$. Applying the transformation $x \mapsto \pi(1 + x(1-\pi))$ maps both $\hat{\psi}_n(a)$ and $\delta$ to the unit interval:

$$\tilde{\psi}_n(a) := \pi(1 + \hat{\psi}_n(a)(1-\pi)), \quad \tilde{\delta} := \pi(1 + \delta(1-\pi)).$$

The e-process

With $d = 1$, define the $\mathcal{P}^{(\delta)}$-e-values:

$$E_n := \left(1, \frac{\tilde{\psi}_n(A_n)}{\tilde{\delta}}\right).$$

The test statistic is:

$$W^{(\delta)}_n = \prod_{a \in \mathcal{A}} \exp\left\{\max_{\lambda \in [0,1]} \sum_{i=1}^n \mathbf{1}\{A_i = a\} \log\left(1 - \lambda + \lambda \frac{\tilde{\psi}_i(A_i)}{\tilde{\delta}}\right) - R^{\text{CO96}}_{N_a(n)}\right\}.$$

This is a $\mathcal{P}^{(\delta)}$-e-process that can be fed into SPRUCE for arm selection.

Corollary 6.2: guarantees carry through

The e-process inherits all the guarantees from the main theorems:

Optimal growth rate:
$$\lim_{n \to \infty} \frac{1}{n} \log W^{(\delta)}_n = \max_{(a,\lambda) \in \mathcal{A} \times [0,1]} E_Q^{\text{RCT}}\!\left[\log\left(1 - \lambda + \lambda \frac{\tilde{\psi}_1(a)}{\tilde{\delta}}\right)\right] \quad \text{Q-almost surely.}$$
Optimal expected rejection time:
$$\lim_{\alpha \to 0^+} \frac{E_Q^{\text{RCT}}[\tau_\alpha]}{\log(1/\alpha)} = \left(\max_{(a,\lambda) \in \mathcal{A} \times [0,1]} E_Q^{\text{RCT}}\!\left[\log\left(1 - \lambda + \lambda \frac{\tilde{\psi}_1(a)}{\tilde{\delta}}\right)\right]\right)^{-1}.$$
Unimprovable: no other arm-selection or portfolio-choosing rule can do better.

Figure 3 shows empirical growth rates and stopping time distributions for this setup. SPRUCE again tracks the oracle arm closely, while Round Robin and Random Selection lag behind.

Conclusions

The paper’s closing summary captures the main contributions:

Generalized testing by betting to a multi-arm setting where the statistician chooses which data source to sample from at each step.
Global nulls: test that all arms satisfy a property, reject if at least one doesn’t.
Type-I error control is free under multi-arm data collection — any arm-selection rule preserves validity.
Power requires new algorithms: the paper defines multi-armed log-optimality and proposes SPRUCE to achieve it.
Matching bounds: both growth rate and expected rejection time achieve matching lower and upper bounds in the high-confidence regime.
The same quantity governs both: the optimal expected log-increment (Kelly growth rate) determines both the fastest achievable growth rate and the shortest achievable expected rejection time.

The paper sits at a rich intersection of three fields — sequential hypothesis testing, multi-armed bandits, and causal inference — and its techniques (especially the concentration inequalities for Kelly growth rates) may find applications beyond the specific testing problem studied here.

The full series

This concludes our walkthrough of Sandoval, Waudby-Smith, and Jordan (2026):

STB 101 — Single-arm foundation: e-processes, test supermartingales, Ville’s inequality, log-optimality
STB 102 — Multi-arm protocol: partial information, global nulls, the oracle-history comparator
STB 201 — SPRUCE algorithm: portfolio regret, allocation regret, achieving log-optimality
STB 202 — Rejection time: matching lower and upper bounds with exact constants
STB 203 — Concentration inequalities: sub-exponential log-increments, regret-based confidence bounds
STB 204 — Treatment effect testing and conclusions

Key takeaways

SPRUCE is practical: it combines a computationally efficient e-process ($W^{\text{CO96}}$) with a UCB-style arm selection rule that doesn’t require knowing optimal portfolios.
The causal inference connection is real: the treatment effect testing example shows how SPRUCE plugs into standard RCT designs with Horvitz-Thompson estimation.
The results are tight: matching lower and upper bounds with exact constants mean there’s no room for asymptotic improvement.
The concentration inequalities are reusable: Proposition 5.2 and Lemma 3.8 could be applied to other problems involving Kelly-optimal wealth growth.

References

Sandoval, Waudby-Smith, and Jordan (2026). Multi-Armed Sequential Hypothesis Testing by Betting.
Horvitz and Thompson (1952). A generalization of sampling without replacement from a finite universe.
Rubin (1974, 1978). The potential outcomes framework and Bayesian inference for causal effects.
Imbens and Rubin (2015). Causal Inference in Statistics, Social, and Biomedical Sciences.