In the previous post, we covered the single-arm foundation of sequential testing by betting. Now we turn to the core innovation of Sandoval, Waudby-Smith, and Jordan (2026): what happens when the statistician must choose among multiple arms at each time step, observing only the outcome of the arm they selected?

This partial-information setting is where the paper gets interesting — and where it connects to multi-armed bandits and causal inference.

The multi-armed data collection protocol

At each time step $n$, the following happens:

  1. Nature samples a $K$-vector of outcomes: $(Y_n(1), \ldots, Y_n(K)) \sim P$.
  2. The statistician chooses an arm $A_n \in \{1, \ldots, K\}$ based on all previously observed data $(A_i, Y_i(A_i))_{i=1}^{n-1}$.
  3. The statistician observes only $Y_n(A_n)$ — the outcomes for all other arms $Y_n(a)$ with $a \neq A_n$ remain unobserved.

This is the classic partial information (or “bandit feedback”) setting: you only see the outcome of the action you took, not the counterfactual outcomes for actions you didn’t take. Think of a pharmaceutical company testing multiple treatment dosages — at each patient, they assign one dosage and observe only that patient’s response, never learning what would have happened under a different dose.

The history-oracle filtration

A key technical device in the paper is the history-oracle filtration $\mathcal{H} = (\mathcal{H}_n)_{n \in \mathbb{N}_0}$, defined as:

$$\mathcal{H}_n := \sigma\big((Y_i(1), \ldots, Y_i(K))_{i=1}^n\big).$$

This is a mathematical construct that assumes access to all outcomes across all arms — something the statistician never actually has. Objects that are $\mathcal{H}_{n-1}$-measurable are called $\mathcal{H}$-predictable.

Why introduce a filtration the statistician can’t actually use? Because it serves as a comparator benchmark. The paper’s main results show that e-processes built from partial information can achieve the same asymptotic growth rates as processes that do have oracle access to the full history.

Global null hypotheses

The paper focuses on testing a global null hypothesis. Here’s the setup:

  • Let $\mathcal{P}_1, \ldots, \mathcal{P}_K$ be sub-collections of distributions, where $\mathcal{P}_a$ encodes some property of $Y_1(a)$ (e.g., $E_P[Y_1(a)] = \mu_a$) but is uninformative about the other arms.
  • The global null is the intersection: $\mathcal{P} := \bigcap_{a \in \mathcal{A}} \mathcal{P}_a$.
  • The alternative is the complement: $\mathcal{Q} := \bigcup_{a \in \mathcal{A}} \mathcal{P}_a^c$.

In plain English: the global null says all arms are null (e.g., all dosages are ineffective), and the alternative says at least one arm is non-null (e.g., at least one dosage works).

Type-I error control is easy — power is hard

Here’s a key insight from Proposition 2.6: adaptive multi-arm data collection does not complicate type-I error control. If you construct e-values $E_n(a)$ such that $E_P[f_n(Y_n(a)) \mid \mathcal{H}_{n-1}] \leq 1$ for every arm $a$, then any predictable arm-selection rule $(A_n)_{n \in \mathbb{N}}$ produces a valid test supermartingale:

$$M_n = \prod_{i=1}^n f_i(Y_i(A_i)).$$

Thresholding at $1/\alpha$ gives a level-$\alpha$ sequential test. The validity holds regardless of how you choose arms.

The hard question is: what is a powerful rule for choosing arms? Since you only observe the outcome of the arm you chose, you face an exploration-exploitation tradeoff — do you pull the arm that looks best so far, or explore other arms that might be even better?

The test statistic class

The paper focuses on test supermartingales of the form:

$$W_n = \prod_{i=1}^n \lambda_i^\top E_i(A_i),$$

where:

  • $E_i(A_i)$ is a $(d+1)$-vector of $\mathcal{P}$-e-values for the chosen arm
  • $\lambda_i \in \Delta^d$ is a portfolio (betting weights on the simplex)

This is a rich class that includes several important special cases:

Two-sided bounded mean testing

Test whether the mean of a bounded random variable equals $\mu_0 \in [0, 1]$ across all arms:

$$\mathcal{P}_= = \{P \mid \forall a \in \mathcal{A},\; E_P[Y(a)] = \mu_0\} \quad \text{vs.} \quad \mathcal{Q}_{\neq} = \{P \mid \exists a \in \mathcal{A},\; E_P[Y(a)] \neq \mu_0\}.$$

The test supermartingale takes the form:

$$W_n = \prod_{i=1}^n \left[(1-\lambda_i)\frac{1 - Y_i(A_i)}{1 - \mu_0} + \lambda_i \frac{Y_i(A_i)}{\mu_0}\right].$$

The oracle-history comparator class

To define what “good” means in this setting, the paper introduces the oracle-history comparator class $\mathbb{W}$ — the collection of all e-processes of the form (4) that satisfy Assumption 1, even those that could not have been generated under partial information. This includes processes that have full access to all arm outcomes at every step.

The ambition is striking: the paper aims to construct e-processes from partial information that are asymptotically as good as any process in $\mathbb{W}$, including those with oracle access.

Multi-armed log-optimality

This leads to the central definition:

A $\mathcal{P}$-e-process $(W_n)_{n \in \mathbb{N}}$ is multi-armed $Q$-log-optimal if for all $Q \in \mathcal{Q}$ and any other process $\widetilde{W} \in \mathbb{W}$:

$$\liminf_{n \to \infty} \left(\frac{1}{n} \log W_n - \frac{1}{n} \log \widetilde{W}_n\right) \geq 0 \quad \text{Q-almost surely.}$$

This means your growth rate matches the best possible process — even ones that know the optimal arm $a_Q$ and the optimal portfolio $\lambda_Q(a_Q)$ from the start.

The Kelly criterion interpretation

The paper draws a connection to Breiman’s study of “favorable games.” Think of the data collection protocol as a repeated stochastic game where at each step, the gambler must:

  1. Choose which sub-game (arm) to play
  2. Place a bet (portfolio) on that sub-game

Multi-armed log-optimality means the gambler’s growth rate matches someone who knows both which sub-game is most favorable and how to bet optimally on it.

What comes next

Defining log-optimality is one thing — achieving it is another. The next post covers the paper’s main algorithm, SPRUCE (Sublinear Portfolio Regret Upper Confidence Estimation), and the two types of regret that must be controlled:

  • Portfolio regret: how well you bet on each arm
  • Allocation regret: how well you choose which arms to pull

The key result: controlling both yields oracle-like performance.

Key takeaways

  1. Partial information is the challenge: you only see the outcome of the arm you chose, creating an exploration-exploitation tradeoff.
  2. Type-I error control is free: any arm-selection rule works for validity. Power is what requires care.
  3. Global null = all arms null: the alternative is that at least one arm is non-null.
  4. Multi-armed log-optimality means matching the growth rate of an oracle that knows the best arm and best portfolio.
  5. The history-oracle filtration is a mathematical benchmark, not something the statistician can actually use.

References

  • Sandoval, Waudby-Smith, and Jordan (2026). Multi-Armed Sequential Hypothesis Testing by Betting.
  • Cover and Ordentlich (1996). Universal portfolios with side information.
  • Kelly (1956). A new interpretation of information rate.
  • Breiman (1960, 1961). Optimal gambling systems for favorable games.