Stan 102: Trusting R-hat

R-hat is the first convergence diagnostic we read after fitting the Stan 101 logistic regression, and it stays on our dashboard as models grow. This follow-up focuses entirely on what R-hat measures, how Stan reports it, and the playbook we use when it drifts above a clean threshold.

Why R-hat is mandatory

Detects whether independent Markov chains explore the same posterior bulk.
Flags sticky chains faster than eyeballing trace plots.
Ships with every Stan fit, so you never need to bolt on extra tooling.
Extends to transformed parameters and generated quantities, not just raw draws.

We still scan divergences and effective sample sizes, but R-hat is the quick, global check for “did the sampler finish its job?”

How Stan computes R-hat

Stan follows the split-chain estimator from Vehtari, Gelman, Simpson (2021). For each parameter $\theta$, chains are split in half to double the number of sequences, then within- and between-chain variances are compared:

$$ \hat{R} = \sqrt{\frac{\hat{V}}{W}}, \qquad \hat{V} = \frac{N - 1}{N} W + \frac{1}{N} B, $$

where

$W$ is the mean of the within-chain variances,
$B$ is the between-chain variance scaled by the number of draws per split chain $N$.

If chains mix, $B$ and $W$ agree and $\hat{R} \rightarrow 1$. Values above $1.01$ mean at least one chain is wandering in a different part of the posterior or stuck in a bad geometry.

Reading R-hat in CmdStanPy

CmdStanPy computes R-hat as part of the summary. Pull it programmatically so you can wire alerts or tidy tables.

from cmdstanpy import CmdStanModel
import pandas as pd

model = CmdStanModel(stan_file="models/logistic_regression.stan")
fit = model.sample(
    data=stan_data,
    iter_warmup=1000,
    iter_sampling=1000,
    chains=4,
    seed=20251030,
)

summary = fit.summary()
rhat = summary.loc[:, ["R_hat"]].reset_index(names="variable")
rhat[rhat["R_hat"] > 1.01]

The last line filters any parameter or transformed quantity with suspicious R-hat. For richer reporting, convert the draws to ArviZ and use az.rhat():

import arviz as az

idata = fit.to_inference_data()
az.rhat(idata, method="rank")

Rank-normalized R-hat stabilizes the diagnostic for heavy-tailed posteriors and is what Stan reports by default.

When R-hat raises a flag

Thin effective samples: Chains might alternate between modes but never overlap enough.
Centering pathologies: Hierarchical models with centered parameterizations can trap chains.
Poor initialization: Outlying initial values create long warmup transients that contaminate the split chains.
Posterior multimodality: If the model has label switching or true multimodal structure, chains must visit every mode for R-hat to settle.

Looking at trace plots still helps diagnose the exact failure mode, but R-hat is the first red light.

Recover from inflated R-hat

Re-run with more warmup (iter_warmup=1500) so adaptation fully settles.
Reparameterize: move centered hierarchies to non-centered forms or rescale parameters to the unit scale Stan prefers.
Tighten priors to rule out implausible regions that the sampler keeps exploring.
Increase max_treedepth only after geometry fixes—long treedepths alone rarely drop R-hat.
If the model is multimodal, add identifiability constraints or marginalized likelihoods that collapse redundant modes.

Track R-hat after each change; the goal is every reported value at or below 1.01 (some teams enforce 1.005 for production runs).

Takeaways

R-hat compares between- and within-chain variation, so it is sensitive to both stuck chains and genuine multimodality.
Split-chain, rank-normalized R-hat is the default in Stan—treat 1.01 as a soft ceiling and investigate anything higher.
Automate R-hat checks in pipelines so failing diagnostics halt downstream decisions before they touch production.
If R-hat refuses to cooperate, reassess model geometry before simply adding iterations.