STB 201: Achieving Log-Optimality with SPRUCE

We’ve defined what it means to be multi-armed log-optimal. Now comes the hard part: actually achieving it. The paper’s main contribution is the SPRUCE algorithm (Sublinear Portfolio Regret Upper Confidence Estimation), which matches the growth rate of an oracle that knows both the best arm and the best portfolio from the start.

Two types of regret to control

The key insight of Proposition 3.5 is that multi-armed log-optimality follows from controlling two types of regret simultaneously:

1. Arm-wise portfolio regret

This measures how well you bet on each individual arm. For arm $a$:

$$R^{\text{Port}}_n(a) := \max_{\lambda \in \Delta^d} \sum_{i=1}^n \mathbf{1}\{A_i = a\} \log\big(\lambda^\top E_i(a)\big) - \sum_{i=1}^n \log W_n(a).$$

This is just the single-arm portfolio regret, but restricted to the rounds where arm $a$ was actually pulled. If you use Cover’s universal portfolio algorithm on each arm, you get arm-wise portfolio regret bounded by $\frac{d \log(N_a(n) + 1)}{2} + \log(2)$, where $N_a(n)$ is the number of times arm $a$ has been pulled.

This part is easy. Any single-arm log-optimal betting algorithm works arm-by-arm.

2. Portfolio-oracle allocation regret

This measures how well you choose which arms to pull:

$$R^{\text{Alloc}}_{n,Q} := n \cdot E_Q\!\left[\log\big(\lambda_Q(a_Q)^\top E(a_Q)\big)\right] - \sum_{i=1}^n E_Q\!\left[\log\big(\lambda_Q(A_i)^\top E(A_i)\big)\right].$$

The first term is what you’d get if you always pulled the optimal arm $a_Q$ with its optimal portfolio $\lambda_Q(a_Q)$. The second term is what you actually get, pulling arms $A_1, \ldots, A_n$ with their respective optimal portfolios.

The “portfolio-oracle” qualifier is crucial: the algorithm doesn’t know $\lambda_Q(a)$ for any arm, but the regret is measured as if it did. This is the source of all technical difficulty.

The decomposition

The proof sketch of Proposition 3.5 decomposes the gap between your growth rate and the oracle’s into three pieces:

Allocation regret — vanishes if sublinear
Portfolio regret — vanishes if sublinear (arm-by-arm)
Concentration error — vanishes by the strong law of large numbers applied separately to each arm

If both regrets are sublinear, your asymptotic growth rate matches the oracle’s exactly.

Why off-the-shelf bandit algorithms don’t work

Standard UCB algorithms assume you can observe the reward $\log(\lambda_Q(A_i)^\top E(A_i))$ after pulling arm $A_i$. But $\lambda_Q(a)$ is unknown — it’s the log-optimal portfolio for arm $a$, which depends on the true alternative distribution $Q$. You can’t compute it.

Even worse, the log-increments $\log(\lambda_Q(a)^\top E(a))$ can take arbitrarily large negative values, so they’re not sub-Gaussian — another assumption that standard bandit analyses rely on.

The SPRUCE algorithm

SPRUCE solves this by constructing upper confidence bounds on the expected log-increment using only observable quantities. Here’s how it works:

For each arm $a$, define:

$$\text{UCB}_a(n) := \max_{\lambda \in \Delta^d} \frac{1}{N_a(n-1)} \sum_{i=1}^n \mathbf{1}\{A_i = a\} \log(\lambda^\top E_i(A_i)) + \sqrt{\frac{8b\gamma \log(\zeta n + 1)}{N_a(n-1)}} + \frac{4\gamma \log(\zeta n + 1)}{N_a(n-1)} + \frac{R^{\text{CO96}}_{N_a(n-1)}}{N_a(n-1)}.$$

The first term is the empirical maximum log-increment over all portfolios. The remaining terms form a confidence bonus that shrinks as the arm is pulled more often. The parameters $\gamma > 2$ and $\zeta > 0$ control exploration incentives and disincentives.

Algorithm steps:

Pull each arm once (initialization)
For $n = K+1, \ldots$:
- Select $A_n = \arg\max_a \text{UCB}_a(n)$
- Observe $Y_n(A_n)$ and construct e-values
- Update the e-process

Two e-process constructions

SPRUCE offers two options for constructing the e-process:

$W^{\text{UP}}$ — Universal portfolio

$$W^{\text{UP}}_n := \prod_{i=1}^n \lambda^{\text{UP}}_i(A_i)^\top E(A_i),$$

where $\lambda^{\text{UP}}_i$ is Cover’s universal portfolio, computed as a mixture over all portfolios weighted by their past performance. This is a test supermartingale with guaranteed sublinear portfolio regret.

$W^{\text{CO96}}$ — Regret-based lower bound

$$W^{\text{CO96}}_n := \prod_{a=1}^K \exp\left\{\max_{\lambda \in \Delta^d} \sum_{i=1}^n \mathbf{1}\{A_i = a\} \log\big(\lambda^\top E_i(a)\big) - R^{\text{CO96}}_{N_a(n)}\right\},$$

where $R^{\text{CO96}}_N = \frac{d \log(N+1)}{2} + \log(2)$ is the Cover-Ordentlich regret bound.

This is an e-process (not a test supermartingale) because it’s a pathwise lower bound on $W^{\text{UP}}$. The advantage: the maximum in the exponential can be computed efficiently via root-finding, while the integral in $\lambda^{\text{UP}}_n$ can be computationally expensive and numerically unstable.

The main theorem

Theorem 3.7 states that both e-processes constructed by SPRUCE are multi-armed $Q$-log-optimal. Specifically:

Asymptotic equivalence: $\frac{1}{n} \log W_n - \frac{1}{n} \log W^\star_{n,Q} \to 0$ Q-almost surely, where $W^\star_{n,Q}$ is the oracle process.
Optimal growth rate: $\lim_{n \to \infty} \frac{1}{n} \log W_n = \max_{(a,\lambda)} E_Q[\log(\lambda^\top E_1(a))]$ Q-almost surely.
Unimprovable: no other process (even with oracle access to the full history) can exceed this growth rate.

Empirical performance

The paper’s Figure 1 shows empirical growth rates under “easy” and “hard” data generating processes. SPRUCE converges to the oracle arm’s growth rate, while Round Robin and Random Selection lag behind — their growth rates are positive but strictly suboptimal.

Figure 2 shows the distribution of stopping times at $\alpha = 0.001$. SPRUCE’s rejection time distribution closely matches the oracle arm’s, while the baselines are slower and more variable.

The concentration inequality behind it all

Lemma 3.8 provides a nonasymptotic concentration bound on the deviation of $\frac{1}{n} \log W_n$ from the optimal growth rate. The bound has four components:

Allocation regret tail — vanishes if regret is sublinear
Portfolio regret tail — vanishes if regret is sublinear
Sub-exponential concentration terms — summable over $n$
Exponential tail — summable over $n$

When both regrets are sublinear, the entire bound vanishes as $m \to \infty$, proving almost sure convergence.

Key takeaways

Two regrets, one goal: control portfolio regret (betting well) and allocation regret (choosing well) to achieve oracle-like performance.
SPRUCE is a modified UCB: it constructs confidence bounds on unobservable expected log-increments using empirical maxima plus regret-based bonuses.
Two constructions: $W^{\text{UP}}$ (universal portfolio, a test supermartingale) and $W^{\text{CO96}}$ (regret-based lower bound, an e-process) — the latter is computationally cheaper.
The result is tight: no process, even with full history access, can grow faster asymptotically.
The concentration inequality (Lemma 3.8) is the technical engine — it may be of independent interest for other problems involving Kelly-optimal wealth growth.

References

Sandoval, Waudby-Smith, and Jordan (2026). Multi-Armed Sequential Hypothesis Testing by Betting.
Cover and Ordentlich (1996). Universal portfolios with side information.
Orabona and Jun (2022). Sharp confidence sequences via regret bounds.
Waudby-Smith et al. (2025). Log-optimality of e-processes.
Auer, Cesa-Bianchi, and Fischer (2002). Finite-time analysis of the multiarmed bandit problem (UCB).