Designing a Bias-Aware Algorithm
Now that we have established the utility shift model (where markets share most parameters but differ on a sparse subset $s_0$), we need an algorithm that can exploit this structure.
The paper introduces Transfer Joint Assortment-Pricing (TJAP), an episodic, UCB-style bandit algorithm. TJAP orchestrates three distinct components to balance variance reduction, bias control, and exploration.
1. Aggregate-then-Debias Estimation
How do you find the shifted parameters without knowing which ones they are? TJAP uses a two-step estimation pipeline at the start of each episode $m$:
Step A (Aggregate): Pool all the data from the $H$ source markets during the previous episode. Fit an aggregate parameter estimate $\hat{\nu}^{(ag)}$ using Maximum Likelihood Estimation.
- Why: This step drives down variance massively because of the large sample size. For the shared coordinates, $\hat{\nu}^{(ag)}$ is highly accurate.
- The catch: For the $s_0$ shifted coordinates, $\hat{\nu}^{(ag)}$ is biased.
Step B (Debias): Freeze the aggregate estimate, and use the target-market data from the previous episode to fit a sparse correction vector $\delta$. This is done via an $\ell_1$-penalized maximum likelihood (similar to a Lasso regression) on the target data residuals.
- Why: The $\ell_1$ penalty encourages $\delta$ to be sparse (mostly zeros). It will only “activate” on coordinates where the target data strongly disagrees with the pooled estimate.
- The final estimate: $\hat{\nu} = \hat{\nu}^{(ag)} + \hat{\delta}$.
2. Optimistic Decision Making (Two-Radius UCB)
Standard contextual bandits construct an Upper Confidence Bound (UCB) around the estimated reward and pick the arm that maximizes it. TJAP does the same, but with two major complications: decisions are joint (assortment + continuous prices), and the confidence interval must account for transfer bias.
TJAP constructs an optimistic utility envelope for every product $i$ at time $t$ that takes the form:
$$ \tilde{v}_{it}(p_{it}) = \underbrace{\langle x_{it}, \hat{\theta} \rangle - \langle x_{it}, \hat{\gamma} \rangle p_{it}}_{\text{point estimate}} + \underbrace{\alpha \|x_{it}\|_{W^{-1}}}_{\text{variance radius}} + \underbrace{\beta \|x_{it}\|_1}_{\text{bias radius}} $$- Variance Radius ($\alpha$): Standard self-normalized bounds tracking statistical uncertainty. It uses the massive pooled information matrix $W$, so this radius shrinks very quickly as $H$ grows.
- Bias Radius ($\beta$): A novel term proportional to $s_0 \lambda$ (where $\lambda$ is the Lasso penalty). It represents the irreducible uncertainty on the shifted coordinates that target data hasn’t fully resolved yet.
The algorithm then maximizes the expected revenue $\tilde{R}_t(S, p)$ over all valid assortments and continuous price vectors using this optimistic utility. Crucially, the authors prove that optimism at the utility level translates to optimism at the revenue level, even after the non-linear MNL transformation.
3. Episodic Information-Geometry Control
Bandit algorithms suffer if they stop exploring before they’ve gathered enough information across all dimensions of the feature space. In TJAP, the target data must be rich enough to identify the sparse shifts via the $\ell_1$-debiasing step.
This requires the target-market Fisher information matrix to satisfy a Restricted Eigenvalue (RE) condition. TJAP ensures this by operating in doubling episodes. During most of the episode, it greedily exploits the UCB estimates. But at the end of the episode, it checks if the target information matrix is sufficiently “curved.” If it isn’t, the algorithm triggers a short burst of forced random exploration to guarantee identifiability for the next episode’s debiasing step.
By freezing the estimation geometry within an episode and only updating at the episode boundaries, TJAP remains computationally efficient and statistically stable.
In the next post, we will look at the rigorous regret bounds for TJAP, which explicitly quantify the limits of transfer learning in pricing.