PyMC 101: Linear Regression With Synthetic Plants

This post translates the notebook at content/ipynb/pymc_101.ipynb into a narrative walkthrough. We start with synthetic features, wrap them in a PyMC model, recover the parameters with NUTS, and finish by generating fresh predictions. The goal is to help you connect each code cell in the notebook to the corresponding idea in Bayesian workflow.

Learning goals:

organize covariate data with coordinates so PyMC keeps dimensions straight,
specify a linear regression with informative names, and
move from prior predictive checks to posterior inference and posterior predictive draws.

0. Confirm your tooling

Open the notebook (or a Colab runtime) and make sure PyMC is available. Recording the version helps when collaborators try to reproduce your results.

import pymc as pm
pm.__version__

Notebook output:

'5.26.1'

I used PyMC 5.x; if you get something older, upgrade before running the rest of the notebook so that pm.do, pm.observe, and pm.set_data behave the same.

1. Sample synthetic covariates

The notebook manufactures 100 plant-growth trials with three features—sunlight, water, and soil nitrogen—so we control the ground truth.

seed = 42
x_dist = pm.Normal.dist(shape=(100, 3))
x_data = pm.draw(x_dist, random_seed=seed)

coords = {
    "trial": range(100),
    "features": ["sunlight hours", "water amount", "soil nitrogen"],
}

pm.draw gives us a NumPy array of shape (100, 3). Passing coords to the model keeps downstream arrays well-labeled when we inspect traces or posterior predictive samples.

2. Build the generative model

With data in hand, we write the regression in PyMC. The names match the science story: each feature controls plant growth through a coefficient stored in betas, and sigma captures leftover noise.

with pm.Model(coords=coords) as generative_model:
    x = pm.Data("x", x_data, dims=["trial", "features"])

    betas = pm.Normal("betas", dims="features")
    sigma = pm.HalfNormal("sigma")

    mu = x @ betas
    plant_growth = pm.Normal("plant growth", mu, sigma, dims="trial")

pm.Data lets us later replace x without rebuilding the graph—handy for posterior predictive checks. The matrix multiply x @ betas automatically respects the named dimensions thanks to the coordinate metadata.

3. Generate synthetic observations

To test inference, we need ground-truth parameters. The notebook pins the coefficients and noise scale, then draws one set of outcomes from the prior predictive distribution.

fixed_parameters = {"betas": [5, 20, 2], "sigma": 0.5}

with pm.do(generative_model, fixed_parameters):
    idata = pm.sample_prior_predictive(random_seed=seed)
    synthetic_y = idata.prior["plant growth"].sel(draw=0, chain=0)

Notebook output:

Sampling: [plant growth]

pm.do is PyMC’s intervention helper—it overrides nodes without editing the model definition. Grabbing a single draw from idata.prior gives us a deterministic vector we can treat as observed data in the next step.

4. Condition on the fake experiment

Now we pretend the synthetic measurements were collected in the lab. pm.observe swaps the likelihood’s random variable with observed values, and pm.sample runs NUTS to recover the latent parameters.

with pm.observe(generative_model, {"plant growth": synthetic_y}) as inference_model:
    idata = pm.sample(random_seed=seed)
    summary = pm.stats.summary(idata, var_names=["betas", "sigma"])
    print(summary)

Notebook output:

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, betas]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 2 seconds.
                         mean     sd  hdi_3%  hdi_97%  mcse_mean  mcse_sd  \
betas[sunlight hours]   4.973  0.054   4.875    5.076      0.001    0.001   
betas[water amount]    19.963  0.050  19.868   20.059      0.001    0.001   
betas[soil nitrogen]    1.996  0.056   1.897    2.103      0.001    0.001   
sigma                   0.512  0.038   0.443    0.582      0.001    0.001   

                       ess_bulk  ess_tail  r_hat  
betas[sunlight hours]    5458.0    3238.0    1.0  
betas[water amount]      4853.0    2804.0    1.0  
betas[soil nitrogen]     5103.0    3037.0    1.0  
sigma                    4906.0    3038.0    1.0

Expect the posterior means to sit near [5, 20, 2] and 0.5, with tight intervals because we generated 100 trials. The summary table confirms that the sampler can reconstruct the coefficients that produced the data.

5. Make posterior predictive checks

After inference we can reuse the fitted model to simulate new plants, possibly with different covariates. The notebook samples three fresh feature vectors, feeds them through the model, and extends the InferenceData object with predictive draws.

new_x_data = pm.draw(pm.Normal.dist(shape=(3, 3)), random_seed=seed)
new_coords = coords | {"trial": [0, 1, 2]}

with inference_model:
    pm.set_data({"x": new_x_data}, coords=new_coords)
    pm.sample_posterior_predictive(
        idata,
        predictions=True,
        extend_inferencedata=True,
        random_seed=seed,
    )

pm.stats.summary(idata.predictions, kind="stats")

Notebook output:

Sampling: [plant growth]
                   mean     sd  hdi_3%  hdi_97%
plant growth[0]  14.230  0.514  13.270   15.203
plant growth[1]  24.415  0.513  23.442   25.365
plant growth[2]  -6.749  0.517  -7.744   -5.808

Two important tricks are hiding here:

pm.set_data swaps the design matrix so we can predict on new feature values.
extend_inferencedata=True keeps everything in one InferenceData, which simplifies charting in ArviZ or exporting results.

The final pm.stats.summary reports the predictive mean and standard deviation for each of the three new trials. Their spreads reflect both observation noise (sigma) and posterior uncertainty in the betas.

Where to go next

Plug the notebook into hugo server --buildDrafts --buildFuture and render the Markdown side-by-side to ensure equations and code blocks look right.
Replace the synthetic Normal draws with measurements from your project; only the pm.Data block and coordinate names need to change.
Explore the full idata object in ArviZ (trace plots, posterior predictive plots) to develop intuition for diagnosing PyMC runs.