This post translates the notebook at content/ipynb/pymc_101.ipynb into a narrative walkthrough. We start with synthetic features, wrap them in a PyMC model, recover the parameters with NUTS, and finish by generating fresh predictions. The goal is to help you connect each code cell in the notebook to the corresponding idea in Bayesian workflow.
Learning goals:
- organize covariate data with coordinates so PyMC keeps dimensions straight,
- specify a linear regression with informative names, and
- move from prior predictive checks to posterior inference and posterior predictive draws.
0. Confirm your tooling
Open the notebook (or a Colab runtime) and make sure PyMC is available. Recording the version helps when collaborators try to reproduce your results.
import pymc as pm
pm.__version__
Notebook output:
'5.26.1'
I used PyMC 5.x; if you get something older, upgrade before running the rest of the notebook so that pm.do, pm.observe, and pm.set_data behave the same.
1. Sample synthetic covariates
The notebook manufactures 100 plant-growth trials with three features—sunlight, water, and soil nitrogen—so we control the ground truth.
seed = 42
x_dist = pm.Normal.dist(shape=(100, 3))
x_data = pm.draw(x_dist, random_seed=seed)
coords = {
"trial": range(100),
"features": ["sunlight hours", "water amount", "soil nitrogen"],
}
pm.draw gives us a NumPy array of shape (100, 3). Passing coords to the model keeps downstream arrays well-labeled when we inspect traces or posterior predictive samples.
2. Build the generative model
With data in hand, we write the regression in PyMC. The names match the science story: each feature controls plant growth through a coefficient stored in betas, and sigma captures leftover noise.
with pm.Model(coords=coords) as generative_model:
x = pm.Data("x", x_data, dims=["trial", "features"])
betas = pm.Normal("betas", dims="features")
sigma = pm.HalfNormal("sigma")
mu = x @ betas
plant_growth = pm.Normal("plant growth", mu, sigma, dims="trial")
pm.Data lets us later replace x without rebuilding the graph—handy for posterior predictive checks. The matrix multiply x @ betas automatically respects the named dimensions thanks to the coordinate metadata.
3. Generate synthetic observations
To test inference, we need ground-truth parameters. The notebook pins the coefficients and noise scale, then draws one set of outcomes from the prior predictive distribution.
fixed_parameters = {"betas": [5, 20, 2], "sigma": 0.5}
with pm.do(generative_model, fixed_parameters):
idata = pm.sample_prior_predictive(random_seed=seed)
synthetic_y = idata.prior["plant growth"].sel(draw=0, chain=0)
Notebook output:
Sampling: [plant growth]
pm.do is PyMC’s intervention helper—it overrides nodes without editing the model definition. Grabbing a single draw from idata.prior gives us a deterministic vector we can treat as observed data in the next step.
4. Condition on the fake experiment
Now we pretend the synthetic measurements were collected in the lab. pm.observe swaps the likelihood’s random variable with observed values, and pm.sample runs NUTS to recover the latent parameters.
with pm.observe(generative_model, {"plant growth": synthetic_y}) as inference_model:
idata = pm.sample(random_seed=seed)
summary = pm.stats.summary(idata, var_names=["betas", "sigma"])
print(summary)
Notebook output:
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, betas]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 2 seconds.
mean sd hdi_3% hdi_97% mcse_mean mcse_sd \
betas[sunlight hours] 4.973 0.054 4.875 5.076 0.001 0.001
betas[water amount] 19.963 0.050 19.868 20.059 0.001 0.001
betas[soil nitrogen] 1.996 0.056 1.897 2.103 0.001 0.001
sigma 0.512 0.038 0.443 0.582 0.001 0.001
ess_bulk ess_tail r_hat
betas[sunlight hours] 5458.0 3238.0 1.0
betas[water amount] 4853.0 2804.0 1.0
betas[soil nitrogen] 5103.0 3037.0 1.0
sigma 4906.0 3038.0 1.0
Expect the posterior means to sit near [5, 20, 2] and 0.5, with tight intervals because we generated 100 trials. The summary table confirms that the sampler can reconstruct the coefficients that produced the data.
5. Make posterior predictive checks
After inference we can reuse the fitted model to simulate new plants, possibly with different covariates. The notebook samples three fresh feature vectors, feeds them through the model, and extends the InferenceData object with predictive draws.
new_x_data = pm.draw(pm.Normal.dist(shape=(3, 3)), random_seed=seed)
new_coords = coords | {"trial": [0, 1, 2]}
with inference_model:
pm.set_data({"x": new_x_data}, coords=new_coords)
pm.sample_posterior_predictive(
idata,
predictions=True,
extend_inferencedata=True,
random_seed=seed,
)
pm.stats.summary(idata.predictions, kind="stats")
Notebook output:
Sampling: [plant growth]
mean sd hdi_3% hdi_97%
plant growth[0] 14.230 0.514 13.270 15.203
plant growth[1] 24.415 0.513 23.442 25.365
plant growth[2] -6.749 0.517 -7.744 -5.808
Two important tricks are hiding here:
pm.set_dataswaps the design matrix so we can predict on new feature values.extend_inferencedata=Truekeeps everything in oneInferenceData, which simplifies charting in ArviZ or exporting results.
The final pm.stats.summary reports the predictive mean and standard deviation for each of the three new trials. Their spreads reflect both observation noise (sigma) and posterior uncertainty in the betas.
Where to go next
- Plug the notebook into
hugo server --buildDrafts --buildFutureand render the Markdown side-by-side to ensure equations and code blocks look right. - Replace the synthetic Normal draws with measurements from your project; only the
pm.Datablock and coordinate names need to change. - Explore the full
idataobject in ArviZ (trace plots, posterior predictive plots) to develop intuition for diagnosing PyMC runs.