DPM 101: Why Multi-Market Assortment-Pricing Needs Transfer Learning

The Multi-Market Challenge

Consider a digital platform or retailer expanding into a new geographic location or launching a new product category. In this “target market,” the seller has very little historical data. However, they possess a wealth of data from existing “source markets” (other cities, older categories).

The goal is to maximize revenue in the target market, which requires solving the joint assortment-pricing problem:

Assortment: Which subset of products should be displayed on the “shelf” (given capacity constraints)?
Pricing: What price should be charged for each displayed product?

This joint problem is notoriously difficult. Under discrete-choice demand, products act as substitutes. Changing the price of one item affects the demand for every other item on the shelf. If the seller misestimates customer preferences, they will not only pick suboptimal prices but might also select the wrong assortment entirely.

Data from source markets could drastically speed up learning in the target market. But there is a catch: customer preferences differ across markets. If we simply pool all the data together without correcting for these differences, we risk learning the wrong preferences, which propagates systematic bias into our assortment and pricing decisions.

This is the exact problem addressed in the recent paper Transfer Learning for Contextual Joint Assortment-Pricing under Cross-Market Heterogeneity (Chen, Chen & Zhang, 2026). Let’s start by formalizing the environment.

The Contextual MNL Choice Model

In each period $t$, customers arrive with observable contextual information (e.g., product features, user demographics, seasonality). Let $x_{it} \in \mathbb{R}^d$ be the feature vector for product $i$.

In market $h$, the latent utility a customer derives from product $i$ is modeled via a contextual Multinomial Logit (MNL) specification:

$$ v_{it}^{(h)} = \langle x_{it}, \theta^{(h)} \rangle - \langle x_{it}, \gamma^{(h)} \rangle p_{it} $$

where:

$\theta^{(h)}$ captures the baseline preference for the product’s attributes.
$\gamma^{(h)}$ captures the price sensitivity.
$p_{it}$ is the posted price.

If the seller offers an assortment $S_t^{(h)}$ with prices $p_t^{(h)}$, the probability that a customer in market $h$ purchases product $i \in S_t^{(h)}$ follows the standard MNL formulation:

$$ q_t^{(h)}(i \mid S_t^{(h)}, p_t^{(h)}) = \frac{\exp(v_{it}^{(h)})}{1 + \sum_{\ell \in S_t^{(h)}} \exp(v_{\ell t}^{(h)})} $$

(The “1” in the denominator represents the normalized utility of the outside option—choosing not to buy anything).

Combining baseline preferences and price sensitivity into a single parameter vector $\nu^{(h)} := (\theta^{(h)}, \gamma^{(h)})$, we can see that across markets, the structural substitution patterns are identical, but the specific preference parameters $\nu^{(h)}$ are allowed to differ.

Bandit Feedback and the Regret Objective

The seller’s objective is strictly target-market centric ($h=0$): maximize expected cumulative revenue. The expected revenue in a given period is:

$$ R_t^{(0)}(S, p) = \sum_{i \in S} p_{i} q_t^{(0)}(i \mid S, p) $$

Crucially, learning happens under bandit feedback. The seller only observes the realized purchase—did the customer buy product A, product B, or nothing? They do not observe the latent utilities, nor do they observe what the customer would have done if a different assortment or price vector had been offered.

To measure performance, we evaluate the seller’s policy against a clairvoyant benchmark that knows the true target-market parameters $\nu^{(0)}$ and always selects the optimal assortment $S_t^*$ and optimal prices $p_t^*$:

$$ \text{Regret}(T) = \sum_{t=1}^T \left( R_t^{(0)}(S_t^*, p_t^*) - R_t^{(0)}(S_t, p_t) \right) $$

The Core Tension

This setup defines the central challenge of the paper:

Learning $\nu^{(0)}$ using only target-market data is slow, leading to high regret.
Source markets $h \in \{1, \dots, H\}$ generate data according to $\nu^{(h)}$.
If $\nu^{(0)} \neq \nu^{(h)}$, naive pooling introduces a persistent bias that might never vanish, potentially leading to linear regret.

How can a learning algorithm safely harness the variance reduction of $H$ source markets while preventing the bias caused by cross-market preference shifts? The answer relies on assuming a structured form of heterogeneity, which we will explore in the next post.