Chapter 18 – Computational Methods¶
The motivating problem. In Chapter 17 we enjoyed the luxury of conjugate priors: the posterior had a closed form. But real-world models are rarely so cooperative. Consider a hierarchical model for clinical trials across multiple hospitals. Hospital \(j\) has response rate \(\theta_j\), and we believe these rates come from a common population: \(\theta_j \sim \text{Beta}(\mu\kappa,\, (1-\mu)\kappa)\), where \(\mu\) and \(\kappa\) are unknown hyperparameters.
The posterior is:
We can evaluate this pointwise, but we cannot integrate it analytically. What now? This chapter builds the computational highway system that gets you to any posterior, no matter how complicated.
# The hierarchical model we will tackle throughout this chapter
import numpy as np
from scipy.special import gammaln
np.random.seed(42)
# Data: 8 hospitals, each with n_j patients and y_j responders
n_hospitals = 8
n_j = np.array([20, 30, 25, 40, 15, 35, 28, 22])
y_j = np.array([11, 16, 10, 25, 7, 19, 14, 9])
print("Hospital data:")
print(f" {'Hospital':>10} {'n_j':>5} {'y_j':>5} {'MLE':>8}")
print(" " + "-" * 30)
for j in range(n_hospitals):
print(f" {j+1:>10} {n_j[j]:>5} {y_j[j]:>5} {y_j[j]/n_j[j]:>8.3f}")
print(f"\n Hospital MLEs range from {min(y_j/n_j):.3f} to {max(y_j/n_j):.3f}")
print(" Can we borrow strength across hospitals? -> Chapter 18 shows how.")
def log_posterior(theta_vec, mu, kappa):
"""Unnormalized log-posterior for the hierarchical model."""
a = mu * kappa
b = (1 - mu) * kappa
# Likelihood
ll = np.sum(y_j * np.log(theta_vec) + (n_j - y_j) * np.log(1 - theta_vec))
# Prior on theta_j: Beta(a, b)
lp_theta = np.sum((a - 1)*np.log(theta_vec) + (b - 1)*np.log(1 - theta_vec)
- gammaln(a) - gammaln(b) + gammaln(a + b))
# Hyperpriors: mu ~ Beta(2,2), kappa ~ Gamma(2, 0.1)
lp_mu = (2-1)*np.log(mu) + (2-1)*np.log(1-mu) if 0 < mu < 1 else -1e10
lp_kappa = (2-1)*np.log(kappa) - 0.1*kappa if kappa > 0 else -1e10
return ll + lp_theta + lp_mu + lp_kappa
18.1 Monte Carlo Integration¶
Why we need this. Many quantities of interest — posterior means, marginal likelihoods, predictive distributions — are integrals of the form
When the integral has no closed form we resort to Monte Carlo methods: generate samples \(\theta^{(1)}, \ldots, \theta^{(S)}\) from \(p(\theta \mid \mathbf{x})\) and approximate
By the law of large numbers \(\hat{I} \to I\) as \(S \to \infty\), and the central limit theorem gives the Monte Carlo standard error:
The error decreases as \(1/\sqrt{S}\) regardless of the dimension of \(\theta\) — a crucial advantage over deterministic quadrature in high dimensions.
# Monte Carlo integration: watch the estimate converge
import numpy as np
np.random.seed(42)
# Goal: estimate E[theta] where theta ~ Beta(35, 25)
# (our clinical trial posterior from Chapter 17)
from scipy import stats
true_mean = 35 / (35 + 25)
samples = stats.beta.rvs(35, 25, size=10_000)
print(f"{'S':>7} {'MC estimate':>12} {'SE':>10} {'|Error|':>10}")
print("-" * 42)
for S in [10, 50, 100, 500, 1000, 5000, 10000]:
mc_est = samples[:S].mean()
mc_se = samples[:S].std() / np.sqrt(S)
print(f"{S:>7} {mc_est:>12.6f} {mc_se:>10.6f} "
f"{abs(mc_est - true_mean):>10.6f}")
print(f"{'True':>7} {true_mean:>12.6f}")
print("\nSE decreases as 1/sqrt(S) regardless of dimension.")
Intuition
Monte Carlo integration is essentially “estimation by random polling.” Just as a political poll of 1000 random voters can approximate the opinion of millions, 10,000 random draws from a posterior can approximate intractable integrals to several decimal places.
Importance sampling¶
Often we cannot sample directly from the target \(p(\theta \mid \mathbf{x})\) but we can sample from a proposal distribution \(q(\theta)\). Write
where the importance weight is
In plain English: the weight measures how much more (or less) probable \(\theta\) is under the target distribution compared to the proposal. Samples from regions where the proposal underrepresents the target get upweighted; samples from overrepresented regions get downweighted.
Drawing \(\theta^{(s)} \sim q\) and computing
gives a consistent (self-normalized) importance sampling estimator. The denominator normalizes the weights so they sum to one, which is why this is called “self-normalized” — we do not need to know the normalizing constants of \(p\) or \(q\). The quality depends on how well \(q\) covers the tails of \(p\); if \(q\) has lighter tails the variance can be infinite.
# Importance sampling: good vs bad proposal
import numpy as np
from scipy import stats
np.random.seed(42)
# Target: Beta(35, 25), want E[theta]
true_mean = 35 / 60
S = 10_000
# Good proposal: N(0.58, 0.07) -- covers the target well
# Bad proposal: N(0.3, 0.05) -- centered far from the target
for label, mu_q, sigma_q in [("Good proposal N(0.58, 0.07)", 0.58, 0.07),
("Bad proposal N(0.30, 0.05)", 0.30, 0.05)]:
z = np.random.normal(mu_q, sigma_q, S)
z = z[(z > 0.01) & (z < 0.99)] # keep in valid range
log_w = (stats.beta.logpdf(z, 35, 25)
- stats.norm.logpdf(z, mu_q, sigma_q))
log_w -= log_w.max() # for numerical stability
w = np.exp(log_w)
estimate = np.sum(w * z) / np.sum(w)
ess = np.sum(w)**2 / np.sum(w**2)
print(f"{label}:")
print(f" Estimate: {estimate:.6f} (true: {true_mean:.6f})")
print(f" ESS: {ess:.0f} / {len(z)} samples")
print(f" ESS ratio: {ess/len(z):.1%}\n")
Common Pitfall
If your proposal distribution \(q\) has thinner tails than the target \(p\), a few extreme importance weights will dominate the estimate, making it highly unstable. Always check the effective sample size of your importance weights: \(\text{ESS} = (\sum w_s)^2 / \sum w_s^2\).
18.2 The Metropolis–Hastings Algorithm¶
Why Markov chains? Importance sampling works well in low dimensions, but in high dimensions it is very hard to find a good proposal \(q\). Instead, we construct a Markov chain whose stationary distribution is the target posterior. After a burn-in period the chain produces (correlated) samples from the posterior.
We will build a Metropolis–Hastings sampler for our hierarchical model. But first, let us derive the algorithm from first principles.
Detailed balance¶
A Markov chain with transition kernel \(T(\theta' \mid \theta)\) has stationary distribution \(\pi(\theta) = p(\theta \mid \mathbf{x})\) if detailed balance holds:
This says the probability of being at \(\theta\) and moving to \(\theta'\) equals the probability of the reverse.
Derivation of the acceptance probability¶
We choose a proposal distribution \(q(\theta' \mid \theta)\) and set the transition as: propose \(\theta'\) from \(q\), then accept with probability \(\alpha(\theta, \theta')\). Thus
Substituting into the detailed balance condition:
Rearranging:
We need acceptance probabilities \(\alpha\) that satisfy this ratio and lie in \([0, 1]\). The Metropolis–Hastings choice picks the largest such probabilities:
Reading this formula: the numerator is the posterior density at the proposed point times the probability of proposing a move back to where we are; the denominator is the same thing in the opposite direction. If the proposed point is “better” (higher posterior density, easy to get back from), the ratio exceeds 1 and we always accept. If it is “worse,” we accept with a probability equal to the ratio, which gives worse points a fighting chance — this is essential for exploring the full posterior.
One can verify this satisfies the ratio above. Because \(\pi\) is the posterior, only the ratio \(\pi(\theta')/\pi(\theta)\) is needed — the normalizing constant cancels.
Why does this matter?
The cancellation of the normalizing constant is what makes MCMC so powerful. You never need to compute the marginal likelihood \(p(\mathbf{x})\). You only need to evaluate the unnormalized posterior — which is just the likelihood times the prior. This is the engine that makes Bayesian inference practical for complex models.
Algorithm¶
Initialize \(\theta^{(0)}\).
For \(t = 1, 2, \ldots\):
Draw \(\theta^* \sim q(\cdot \mid \theta^{(t-1)})\).
Compute
\[\alpha = \min\!\left(1,\; \frac{p(\theta^* \mid \mathbf{x})\, q(\theta^{(t-1)} \mid \theta^*)} {p(\theta^{(t-1)} \mid \mathbf{x})\, q(\theta^* \mid \theta^{(t-1)})}\right).\]With probability \(\alpha\) set \(\theta^{(t)} = \theta^*\); otherwise set \(\theta^{(t)} = \theta^{(t-1)}\).
Special case: random-walk Metropolis. If \(q(\theta' \mid \theta) = q(\theta \mid \theta')\) (symmetric proposal, e.g., \(\theta' = \theta + \epsilon\) with \(\epsilon \sim N(0, \sigma^2)\)) the proposal ratio cancels and
Now let us apply this to Bayesian logistic regression, printing the first 20 iterations so you can see the sampler in action.
# Metropolis-Hastings for logistic regression: iteration by iteration
import numpy as np
from scipy.special import expit
np.random.seed(42)
# Simulate: 50 patients, dosage x predicts recovery y
n = 50
x = np.random.normal(0, 1, n)
true_beta = np.array([0.5, 1.5])
prob = expit(true_beta[0] + true_beta[1] * x)
y = np.random.binomial(1, prob)
def log_posterior(beta):
eta = beta[0] + beta[1] * x
log_lik = np.sum(y * eta - np.log(1 + np.exp(eta)))
log_prior = -0.5 * np.sum(beta**2) / 100 # N(0, 10^2)
return log_lik + log_prior
# MH sampler
T = 20_000
samples = np.zeros((T, 2))
samples[0] = [0.0, 0.0]
sigma_prop = 0.3
accepted = 0
# Print header for first 20 iterations
print(f"{'t':>4} {'beta0*':>8} {'beta1*':>8} {'log_alpha':>10} "
f"{'Accept?':>8} {'beta0':>8} {'beta1':>8}")
print("-" * 60)
for t in range(1, T):
proposal = samples[t-1] + np.random.normal(0, sigma_prop, 2)
log_alpha = log_posterior(proposal) - log_posterior(samples[t-1])
u = np.random.uniform()
accept = np.log(u) < log_alpha
if accept:
samples[t] = proposal
accepted += 1
else:
samples[t] = samples[t-1]
if t <= 20:
print(f"{t:>4} {proposal[0]:>8.3f} {proposal[1]:>8.3f} "
f"{log_alpha:>10.3f} {'YES' if accept else 'no':>8} "
f"{samples[t, 0]:>8.3f} {samples[t, 1]:>8.3f}")
burn_in = 5000
post_samples = samples[burn_in:]
print(f"\n--- After {T} iterations (burn-in={burn_in}) ---")
print(f"Acceptance rate: {accepted / T:.3f}")
print(f"Posterior mean beta: [{post_samples[:,0].mean():.3f}, "
f"{post_samples[:,1].mean():.3f}]")
print(f"Posterior std beta: [{post_samples[:,0].std():.3f}, "
f"{post_samples[:,1].std():.3f}]")
print(f"True beta: [{true_beta[0]:.3f}, {true_beta[1]:.3f}]")
Proposal tuning¶
The step size \(\sigma\) controls the trade-off between exploring widely (large steps, many rejections) and staying local (small steps, high acceptance but slow exploration). The optimal acceptance rate for a \(d\)-dimensional Gaussian target is approximately 23.4% (Roberts, Gelman, and Gilks, 1997).
# Effect of proposal step size on acceptance rate and mixing
import numpy as np
np.random.seed(42)
# Target: standard bivariate normal
def log_target(theta):
return -0.5 * np.sum(theta**2)
def run_mh(sigma_prop, T=10_000, start=np.array([5.0, 5.0])):
samples = np.zeros((T, 2))
samples[0] = start
acc = 0
for t in range(1, T):
proposal = samples[t-1] + np.random.normal(0, sigma_prop, 2)
log_alpha = log_target(proposal) - log_target(samples[t-1])
if np.log(np.random.uniform()) < log_alpha:
samples[t] = proposal
acc += 1
else:
samples[t] = samples[t-1]
return samples, acc / T
print(f"{'sigma_prop':>12} {'Accept rate':>13} {'Post mean':>16} "
f"{'Post std':>12}")
print("-" * 56)
for sigma in [0.05, 0.3, 1.0, 2.4, 5.0, 20.0]:
samp, rate = run_mh(sigma)
post = samp[2000:]
print(f"{sigma:>12.2f} {rate:>13.3f} "
f"[{post[:,0].mean():>5.2f}, {post[:,1].mean():>5.2f}] "
f"[{post[:,0].std():>4.2f}, {post[:,1].std():>4.2f}]")
print("\nOptimal acceptance rate for d=2: ~23.4%")
print("sigma_prop ~ 2.4 hits the sweet spot.")
18.3 Gibbs Sampling¶
Why Gibbs? In many models the full conditional distribution of each parameter block — the distribution of one parameter given all others and the data — has a known form (especially with conjugate priors). Instead of proposing moves in all dimensions at once, we cycle through one dimension at a time, always drawing from the exact conditional distribution. Every draw is accepted.
Full conditional distributions¶
For a parameter vector \(\boldsymbol{\theta} = (\theta_1, \ldots, \theta_p)\), the full conditional of \(\theta_j\) is
treating all \(\theta_k\) for \(k \neq j\) as fixed. In conjugate models these are standard distributions.
Algorithm¶
Initialize \(\boldsymbol{\theta}^{(0)}\).
For \(t = 1, 2, \ldots\):
Draw \(\theta_1^{(t)} \sim p(\theta_1 \mid \theta_2^{(t-1)}, \ldots, \theta_p^{(t-1)}, \mathbf{x})\).
Draw \(\theta_2^{(t)} \sim p(\theta_2 \mid \theta_1^{(t)}, \theta_3^{(t-1)}, \ldots, \theta_p^{(t-1)}, \mathbf{x})\).
Continue cycling through all components.
Gibbs as a special case of Metropolis–Hastings¶
Derivation. In MH, set the proposal for updating \(\theta_j\) to be the full conditional itself: \(q(\theta_j' \mid \theta_j, \theta_{-j}) = p(\theta_j' \mid \theta_{-j}, \mathbf{x})\). Then the acceptance ratio is:
Every proposal is accepted! Gibbs sampling is MH with an optimal component-wise proposal.
Example: Normal model with unknown mean and variance. With \(x_i \sim N(\mu, \sigma^2)\), \(\mu \sim N(\mu_0, \sigma_0^2)\), and \(\sigma^2 \sim \text{Inv-Gamma}(a, b)\), the full conditionals are:
\(\mu \mid \sigma^2, \mathbf{x} \sim N(\mu_n, \sigma_n^2)\) (Normal–Normal conjugacy from Chapter 17).
\(\sigma^2 \mid \mu, \mathbf{x} \sim \text{Inv-Gamma}(a + n/2,\; b + \frac{1}{2}\sum(x_i - \mu)^2)\).
Let us implement this, printing every iteration for the first 15 to see the alternating updates converge.
# Gibbs sampler for Normal model: mu and sigma^2 with alternating updates
import numpy as np
from scipy import stats
np.random.seed(42)
# Generate data from N(mu=5, sigma^2=4)
true_mu, true_sigma2 = 5.0, 4.0
data = np.random.normal(true_mu, np.sqrt(true_sigma2), size=30)
n = len(data)
x_bar = data.mean()
S_data = np.sum((data - x_bar)**2)
# Hyperpriors
mu_0, sigma_0_sq = 0.0, 100.0 # mu ~ N(0, 100)
a_prior, b_prior = 0.01, 0.01 # sigma^2 ~ InvGamma(0.01, 0.01)
# Gibbs sampler
T = 5_000
mu_samples = np.zeros(T)
sigma2_samples = np.zeros(T)
mu_samples[0] = 0.0 # start far from truth
sigma2_samples[0] = 20.0 # start far from truth
print(f"{'t':>4} {'mu':>10} {'sigma2':>10} Full conditional used")
print("-" * 55)
print(f"{'0':>4} {mu_samples[0]:>10.4f} {sigma2_samples[0]:>10.4f} (initial)")
for t in range(1, T):
# -- Update mu | sigma^2, data --
prec_prior = 1.0 / sigma_0_sq
prec_data = n / sigma2_samples[t-1]
prec_post = prec_prior + prec_data
mu_post = (prec_prior * mu_0 + prec_data * x_bar) / prec_post
sigma_post = np.sqrt(1.0 / prec_post)
mu_samples[t] = np.random.normal(mu_post, sigma_post)
# -- Update sigma^2 | mu, data --
a_post = a_prior + n / 2.0
b_post = b_prior + 0.5 * np.sum((data - mu_samples[t])**2)
sigma2_samples[t] = 1.0 / np.random.gamma(a_post, 1.0 / b_post)
if t <= 15 or t == T-1:
label = f" N({mu_post:.2f}, {sigma_post:.2f}^2) -> " \
f"IG({a_post:.1f}, {b_post:.1f})"
print(f"{t:>4} {mu_samples[t]:>10.4f} {sigma2_samples[t]:>10.4f} {label}")
burn_in = 500
print(f"\n--- After {T} iterations (burn-in={burn_in}) ---")
print(f"Posterior E[mu]: {mu_samples[burn_in:].mean():.4f} (true={true_mu})")
print(f"Posterior E[sigma^2]: {sigma2_samples[burn_in:].mean():.4f} (true={true_sigma2})")
print(f"Posterior sd(mu): {mu_samples[burn_in:].std():.4f}")
print(f"Data mean: {x_bar:.4f}")
Now let us tackle the full hierarchical model from the chapter opening using a Gibbs sampler. Each \(\theta_j\) has a conjugate full conditional (Beta), while \(\mu\) and \(\kappa\) require Metropolis steps within the Gibbs.
# Gibbs sampler for the hierarchical Beta-Binomial model
import numpy as np
from scipy.special import gammaln
np.random.seed(42)
# Data from chapter opening
n_hospitals = 8
n_j = np.array([20, 30, 25, 40, 15, 35, 28, 22])
y_j = np.array([11, 16, 10, 25, 7, 19, 14, 9])
T = 10_000
theta_samples = np.zeros((T, n_hospitals))
mu_samples = np.zeros(T)
kappa_samples = np.zeros(T)
# Initialize
theta_samples[0] = y_j / n_j
mu_samples[0] = 0.5
kappa_samples[0] = 10.0
def log_beta_binomial_hyperprior(mu, kappa, thetas):
"""Log-density of the hyperprior part: Beta priors on theta_j + hyperpriors."""
if not (0.01 < mu < 0.99 and kappa > 0.5):
return -1e10
a = mu * kappa
b = (1 - mu) * kappa
lp = np.sum((a-1)*np.log(thetas) + (b-1)*np.log(1-thetas)
- gammaln(a) - gammaln(b) + gammaln(a+b))
lp += (2-1)*np.log(mu) + (2-1)*np.log(1-mu) # mu ~ Beta(2,2)
lp += (2-1)*np.log(kappa) - 0.1*kappa # kappa ~ Gamma(2,0.1)
return lp
accepted_mu, accepted_kappa = 0, 0
for t in range(1, T):
mu_cur = mu_samples[t-1]
kappa_cur = kappa_samples[t-1]
# -- Gibbs step: update each theta_j from its Beta full conditional --
a = mu_cur * kappa_cur + y_j
b = (1 - mu_cur) * kappa_cur + n_j - y_j
theta_samples[t] = np.random.beta(a, b)
# -- MH step: update mu --
mu_prop = mu_cur + np.random.normal(0, 0.05)
if 0.01 < mu_prop < 0.99:
log_r = (log_beta_binomial_hyperprior(mu_prop, kappa_cur, theta_samples[t])
- log_beta_binomial_hyperprior(mu_cur, kappa_cur, theta_samples[t]))
if np.log(np.random.uniform()) < log_r:
mu_cur = mu_prop
accepted_mu += 1
mu_samples[t] = mu_cur
# -- MH step: update kappa --
kappa_prop = kappa_cur * np.exp(np.random.normal(0, 0.2)) # log-scale proposal
log_r = (log_beta_binomial_hyperprior(mu_cur, kappa_prop, theta_samples[t])
- log_beta_binomial_hyperprior(mu_cur, kappa_cur, theta_samples[t])
+ np.log(kappa_prop) - np.log(kappa_cur)) # Jacobian
if np.log(np.random.uniform()) < log_r:
kappa_cur = kappa_prop
accepted_kappa += 1
kappa_samples[t] = kappa_cur
burn_in = 3000
print("Hierarchical Beta-Binomial: Gibbs + MH results")
print(f" mu acceptance rate: {accepted_mu/T:.3f}")
print(f" kappa acceptance rate: {accepted_kappa/T:.3f}")
print(f"\n E[mu]: {mu_samples[burn_in:].mean():.4f}")
print(f" E[kappa]: {kappa_samples[burn_in:].mean():.2f}")
print(f"\n {'Hospital':>10} {'MLE':>8} {'Post mean':>10} {'Shrinkage':>10}")
print(" " + "-" * 40)
for j in range(n_hospitals):
mle = y_j[j] / n_j[j]
pm = theta_samples[burn_in:, j].mean()
shrink = (mle - pm) / (mle - mu_samples[burn_in:].mean())
print(f" {j+1:>10} {mle:>8.3f} {pm:>10.3f} {shrink:>10.1%}")
print("\n Note: smaller hospitals are shrunk more toward the overall mean.")
18.4 Hamiltonian Monte Carlo¶
Why HMC? Random-walk Metropolis explores the parameter space by diffusion, which becomes extremely slow in high dimensions. HMC uses the gradient of the log-posterior to make large, directed moves while maintaining a high acceptance rate. Think of it as giving the sampler a “GPS” rather than letting it wander blindly.
Hamiltonian mechanics analogy¶
Imagine a frictionless puck sliding on a surface whose height is \(U(\theta) = -\log p(\theta \mid \mathbf{x})\). The puck has position \(\theta\) and momentum \(\mathbf{r}\). The total energy (Hamiltonian) is:
where \(K(\mathbf{r}) = \frac{1}{2}\mathbf{r}^T M^{-1} \mathbf{r}\) is the kinetic energy and \(M\) is a “mass matrix” (usually set to the identity or an estimate of the posterior covariance).
Hamilton’s equations give the dynamics:
In plain English: the first equation says the parameter moves in the direction of the momentum (the puck slides). The second equation says the momentum changes according to the gradient of the log-posterior — the puck accelerates downhill toward regions of high posterior density. Together, these create smooth, curving trajectories that efficiently explore the posterior landscape.
These equations are energy-preserving and volume-preserving (symplectic), which is exactly what we need for a valid MCMC proposal.
The leapfrog integrator¶
We cannot solve Hamilton’s equations analytically, so we use the leapfrog integrator with step size \(\varepsilon\) and \(L\) steps:
\(\mathbf{r} \leftarrow \mathbf{r} + \frac{\varepsilon}{2}\, \nabla \log p(\theta \mid \mathbf{x})\) (half-step for momentum).
For \(l = 1, \ldots, L\):
\(\theta \leftarrow \theta + \varepsilon\, M^{-1}\mathbf{r}\) (full step for position).
If \(l < L\): \(\mathbf{r} \leftarrow \mathbf{r} + \varepsilon\, \nabla \log p(\theta \mid \mathbf{x})\) (full step for momentum).
\(\mathbf{r} \leftarrow \mathbf{r} + \frac{\varepsilon}{2}\, \nabla \log p(\theta \mid \mathbf{x})\) (final half-step for momentum).
The leapfrog integrator is symplectic (volume-preserving) and time-reversible, so the MH acceptance probability is:
where \((\theta^*, \mathbf{r}^*)\) is the endpoint. If the integrator were exact, \(H\) would be conserved and \(\alpha = 1\). In practice, the discretization error is small and acceptance rates are typically above 80%.
# HMC sampler: sampling a 2D correlated Gaussian
import numpy as np
np.random.seed(42)
# Target: bivariate normal with correlation 0.95
rho = 0.95
Sigma = np.array([[1, rho], [rho, 1]])
Sigma_inv = np.linalg.inv(Sigma)
def log_prob(q):
return -0.5 * q @ Sigma_inv @ q
def grad_log_prob(q):
return -Sigma_inv @ q
def hmc_step(q, epsilon, L):
"""One HMC step with leapfrog integration."""
p = np.random.normal(size=2) # sample momentum
current_H = -log_prob(q) + 0.5 * p @ p
q_new = q.copy()
p_new = p.copy()
# Leapfrog
p_new += 0.5 * epsilon * grad_log_prob(q_new)
for l in range(L):
q_new += epsilon * p_new
if l < L - 1:
p_new += epsilon * grad_log_prob(q_new)
p_new += 0.5 * epsilon * grad_log_prob(q_new)
proposed_H = -log_prob(q_new) + 0.5 * p_new @ p_new
if np.log(np.random.uniform()) < current_H - proposed_H:
return q_new, True
return q, False
# Run HMC and compare with random-walk MH
T = 5_000
hmc_samples = np.zeros((T, 2))
hmc_samples[0] = [0.0, 0.0]
hmc_acc = 0
for t in range(1, T):
hmc_samples[t], accepted = hmc_step(hmc_samples[t-1], epsilon=0.15, L=20)
hmc_acc += accepted
# Random-walk MH for comparison
mh_samples = np.zeros((T, 2))
mh_samples[0] = [0.0, 0.0]
mh_acc = 0
for t in range(1, T):
prop = mh_samples[t-1] + np.random.normal(0, 0.3, 2)
if np.log(np.random.uniform()) < log_prob(prop) - log_prob(mh_samples[t-1]):
mh_samples[t] = prop
mh_acc += 1
else:
mh_samples[t] = mh_samples[t-1]
def ess_1d(chain):
n = len(chain)
rho1 = np.corrcoef(chain[:-1], chain[1:])[0, 1]
return max(1, n * (1 - rho1) / (1 + rho1))
burn = 500
print(f"{'':>14} {'HMC':>12} {'Random-walk MH':>16}")
print("-" * 44)
print(f"{'Accept rate':>14} {hmc_acc/T:>12.3f} {mh_acc/T:>16.3f}")
print(f"{'Mean[0]':>14} {hmc_samples[burn:,0].mean():>12.3f} "
f"{mh_samples[burn:,0].mean():>16.3f}")
print(f"{'Std[0]':>14} {hmc_samples[burn:,0].std():>12.3f} "
f"{mh_samples[burn:,0].std():>16.3f}")
print(f"{'Corr':>14} "
f"{np.corrcoef(hmc_samples[burn:].T)[0,1]:>12.3f} "
f"{np.corrcoef(mh_samples[burn:].T)[0,1]:>16.3f}")
print(f"{'ESS[0]':>14} {ess_1d(hmc_samples[burn:,0]):>12.0f} "
f"{ess_1d(mh_samples[burn:,0]):>16.0f}")
print(f"\nHMC achieves much higher ESS for correlated targets.")
The No-U-Turn Sampler (NUTS)¶
NUTS (Hoffman and Gelman, 2014) eliminates the need to hand-tune \(L\) by building a binary tree of leapfrog steps and stopping when the trajectory begins to “double back” (the U-turn criterion). Combined with dual averaging to adapt \(\varepsilon\), NUTS is the default sampler in Stan and PyMC.
18.5 Convergence Diagnostics¶
Markov chains are guaranteed to converge only in the limit. In practice we need diagnostics to assess whether our finite chains have mixed well. We want concrete numbers, not just “it looks okay.”
R-hat (Gelman–Rubin statistic)¶
Run \(m \ge 2\) chains from dispersed starting values. Let \(B\) be the between-chain variance of the chain means and \(W\) the average within-chain variance. The potential scale reduction factor is:
where \(n\) is the chain length.
Reading this formula: the numerator estimates the true variance of the posterior using a mixture of within-chain and between-chain information. If the chains have converged to the same distribution, the between-chain variance \(B\) will be small relative to \(W\), and the ratio will be close to 1. If the chains are still exploring different regions, \(B\) will be large, inflating \(\hat{R}\). Values close to 1 (typically \(\hat{R} < 1.01\)) indicate convergence.
Effective sample size (ESS)¶
Because MCMC samples are autocorrelated, \(S\) samples carry less information than \(S\) independent samples. The ESS is:
where \(\rho_k\) is the autocorrelation at lag \(k\).
In plain English: if consecutive MCMC samples are highly correlated (the chain moves slowly), the denominator grows large, and the ESS shrinks well below the nominal number of samples \(S\). For example, if every sample is nearly identical to the previous one (\(\rho_k \approx 1\) for many lags), you might have 10,000 samples but an ESS of only 50 — meaning your 10,000 correlated draws carry as much information as 50 independent ones. A low ESS means we need more iterations or a better sampler.
Let us compute R-hat and ESS from scratch, running multiple chains.
# Convergence diagnostics from scratch: R-hat and ESS
import numpy as np
np.random.seed(42)
# Target: N(3, 2^2). Run 4 chains from dispersed starts.
def mh_chain(T, sigma_prop, start, target_mu=3.0, target_sigma=2.0):
chain = np.zeros(T)
chain[0] = start
for t in range(1, T):
proposal = chain[t-1] + np.random.normal(0, sigma_prop)
log_alpha = (-0.5*((proposal - target_mu)/target_sigma)**2
+ 0.5*((chain[t-1] - target_mu)/target_sigma)**2)
if np.log(np.random.uniform()) < log_alpha:
chain[t] = proposal
else:
chain[t] = chain[t-1]
return chain
# 4 chains from dispersed starting values
T = 3000
starts = [-10, 0, 10, 20]
chains = [mh_chain(T, sigma_prop=2.5, start=s) for s in starts]
# Discard first half as burn-in
burn = T // 2
chains_post = [c[burn:] for c in chains]
m = len(chains_post)
n = len(chains_post[0])
# -- R-hat --
chain_means = np.array([c.mean() for c in chains_post])
chain_vars = np.array([c.var(ddof=1) for c in chains_post])
W = chain_vars.mean()
B = n * chain_means.var(ddof=1)
var_hat = (n - 1) / n * W + B / n
R_hat = np.sqrt(var_hat / W)
# -- ESS (using initial positive sequence estimator) --
def compute_ess(chain):
n = len(chain)
mean = chain.mean()
var = chain.var()
if var == 0:
return n
acf_sum = 0
for k in range(1, n):
rho_k = np.corrcoef(chain[:-k], chain[k:])[0, 1]
if rho_k < 0.05:
break
acf_sum += rho_k
return n / (1 + 2 * acf_sum)
ess_values = [compute_ess(c) for c in chains_post]
print("Convergence Diagnostics (target: N(3, 4))")
print(f" Chains: {m}, length: {T}, burn-in: {burn}")
print(f"\n {'Chain':>6} {'Start':>7} {'Mean':>8} {'Std':>8} {'ESS':>8}")
print(" " + "-" * 42)
for i, (c, s) in enumerate(zip(chains_post, starts)):
print(f" {i+1:>6} {s:>7.1f} {c.mean():>8.3f} {c.std():>8.3f} "
f"{ess_values[i]:>8.0f}")
print(f"\n R-hat: {R_hat:.4f} {'(converged)' if R_hat < 1.01 else '(NOT converged)'}")
print(f" Total ESS: {sum(ess_values):.0f}")
print(f" W (within): {W:.4f}")
print(f" B/n (between): {B/n:.4f}")
Trace plots¶
A simple visual diagnostic: plot each parameter against the iteration number. The chain should look like a “fuzzy caterpillar” — stationary and well-mixed — rather than showing trends or long excursions.
# Contrast good vs bad mixing by printing trajectory summaries
import numpy as np
np.random.seed(42)
# Well-tuned chain (sigma=2.5) vs poorly-tuned chain (sigma=0.05)
T = 5000
good = np.zeros(T)
bad = np.zeros(T)
good[0] = bad[0] = 10.0 # start away from target mean=3
for t in range(1, T):
# Good chain
prop = good[t-1] + np.random.normal(0, 2.5)
if np.log(np.random.uniform()) < -0.5*((prop-3)/2)**2 + 0.5*((good[t-1]-3)/2)**2:
good[t] = prop
else:
good[t] = good[t-1]
# Bad chain
prop = bad[t-1] + np.random.normal(0, 0.05)
if np.log(np.random.uniform()) < -0.5*((prop-3)/2)**2 + 0.5*((bad[t-1]-3)/2)**2:
bad[t] = prop
else:
bad[t] = bad[t-1]
# Print snapshots of trajectory
print(f"{'Iteration':>10} {'Good chain':>12} {'Bad chain':>12}")
print("-" * 36)
for t in [0, 10, 50, 100, 500, 1000, 2000, 4999]:
print(f"{t:>10} {good[t]:>12.3f} {bad[t]:>12.3f}")
print(f"\n{'Diagnostic':>18} {'Good':>10} {'Bad':>10}")
print("-" * 40)
rho_good = np.corrcoef(good[:-1], good[1:])[0, 1]
rho_bad = np.corrcoef(bad[:-1], bad[1:])[0, 1]
ess_good = T * (1 - rho_good) / (1 + rho_good)
ess_bad = T * (1 - rho_bad) / (1 + rho_bad)
print(f"{'Mean (last 3000)':>18} {good[2000:].mean():>10.3f} {bad[2000:].mean():>10.3f}")
print(f"{'Std (last 3000)':>18} {good[2000:].std():>10.3f} {bad[2000:].std():>10.3f}")
print(f"{'Lag-1 autocorr':>18} {rho_good:>10.3f} {rho_bad:>10.3f}")
print(f"{'ESS':>18} {ess_good:>10.0f} {ess_bad:>10.0f}")
print(f"{'True mean':>18} {'3.000':>10} {'3.000':>10}")
18.6 Variational Inference¶
Why variational methods? MCMC is asymptotically exact but can be slow — our hierarchical model takes thousands of iterations per second. What if we had a million observations? Variational inference (VI) turns the inference problem into an optimization problem, trading some accuracy for speed.
The ELBO¶
We want to approximate the posterior \(p(\theta \mid \mathbf{x})\) by a simpler distribution \(q(\theta)\) from some family \(\mathcal{Q}\). The KL divergence from \(q\) to the posterior is:
This is non-negative and zero only when \(q = p\). We want to minimize it, but it involves the intractable \(p(\theta \mid \mathbf{x})\).
Derivation of the ELBO. Write \(p(\theta \mid \mathbf{x}) = p(\mathbf{x}, \theta) / p(\mathbf{x})\):
Rearranging:
Because \(\text{KL} \ge 0\), we have:
The Evidence Lower BOund (ELBO) is:
Reading the second form: the ELBO has two competing terms. The first, \(E_q[\log p(\mathbf{x} \mid \theta)]\), rewards the approximation \(q\) for placing mass on parameter values that explain the data well (good fit). The second, \(\text{KL}(q \| p(\theta))\), penalizes \(q\) for straying too far from the prior (regularization). Maximizing the ELBO finds the best trade-off — the approximate posterior that fits the data as well as possible while remaining close to prior expectations.
Maximizing the ELBO is equivalent to minimizing \(\text{KL}(q \| p)\).
Let us implement variational inference for a Beta posterior, optimizing the ELBO with gradient ascent and printing the ELBO at each iteration.
# Variational inference: fit a Gaussian to a Beta(35,25) posterior
# Watch the ELBO increase iteration by iteration
import numpy as np
from scipy import stats
np.random.seed(42)
# True posterior: Beta(35, 25) from clinical trial
a_true, b_true = 35, 25
true_mean = a_true / (a_true + b_true)
true_std = np.sqrt(a_true * b_true / ((a_true+b_true)**2 * (a_true+b_true+1)))
# Variational family: q(theta) = N(mu_q, sigma_q^2) (on logit scale)
# Parameterize on logit scale for unconstrained optimization
mu_q = 0.0 # logit scale
log_sigma_q = -1.0 # log(sigma_q)
lr = 0.05
S = 5_000 # MC samples for ELBO gradient
print(f"{'Iter':>5} {'ELBO':>10} {'mu_q (logit)':>13} {'sigma_q':>9} "
f"{'mean(theta)':>12}")
print("-" * 52)
for iteration in range(80):
sigma_q = np.exp(log_sigma_q)
# Reparameterization trick: z = mu_q + sigma_q * eps
eps = np.random.normal(size=S)
z_logit = mu_q + sigma_q * eps
z = 1.0 / (1.0 + np.exp(-z_logit)) # inverse logit -> theta in (0,1)
z = np.clip(z, 1e-10, 1 - 1e-10)
# log p(x, theta) using Beta kernel
log_joint = (a_true - 1) * np.log(z) + (b_true - 1) * np.log(1 - z)
# log q(logit(theta)) with Jacobian
log_q = stats.norm.logpdf(z_logit, mu_q, sigma_q)
elbo = np.mean(log_joint - log_q)
# Gradients (score function estimator)
score_mu = (z_logit - mu_q) / sigma_q**2
score_log_sigma = ((z_logit - mu_q)**2 / sigma_q**2 - 1)
advantage = log_joint - log_q - elbo
grad_mu = np.mean(advantage * score_mu)
grad_log_sigma = np.mean(advantage * score_log_sigma)
mu_q += lr * grad_mu
log_sigma_q += lr * 0.1 * grad_log_sigma
if iteration < 10 or iteration % 10 == 0:
theta_mean = np.mean(z)
print(f"{iteration:>5} {elbo:>10.3f} {mu_q:>13.4f} "
f"{sigma_q:>9.4f} {theta_mean:>12.4f}")
print(f"\nTrue posterior: mean={true_mean:.4f}, std={true_std:.4f}")
print(f"VI approx: mean={np.mean(z):.4f}, std={np.std(z):.4f}")
Mean-field approximation¶
The simplest variational family assumes the parameters are independent:
Each factor \(q_j\) is optimized in turn, holding the others fixed.
Coordinate ascent VI (CAVI). The optimal update for \(q_j\) is:
where \(E_{-j}\) denotes expectation with respect to all \(q_k\) for \(k \neq j\).
In plain English: to update our approximation for one parameter \(\theta_j\), we take the log of the full joint distribution, then average out all the other parameters using their current approximate distributions. The result tells us the optimal shape for \(q_j\). This is analogous to fitting one variable at a time while holding the others fixed, which is why it is called “coordinate ascent.” This is derived by taking the functional derivative of the ELBO with respect to \(q_j\) and setting it to zero.
CAVI iterates over all components until the ELBO converges.
# Coordinate Ascent VI for a mixture of two Gaussians (mean-field)
import numpy as np
from scipy import stats
np.random.seed(42)
# Data from a Gaussian with unknown mean and precision
true_mu, true_tau = 3.0, 0.5 # tau = 1/sigma^2
data = np.random.normal(true_mu, 1/np.sqrt(true_tau), size=50)
n = len(data)
x_bar = data.mean()
x_var = data.var()
# Mean-field: q(mu, tau) = q_mu(mu) * q_tau(tau)
# q_mu = N(m, s^2), q_tau = Gamma(a, b)
# Hyperpriors: mu ~ N(0, 100), tau ~ Gamma(0.01, 0.01)
m, s2 = 0.0, 10.0
a_q, b_q = 1.0, 1.0
print(f"{'Iter':>5} {'E[mu]':>8} {'E[tau]':>8} {'E[sigma]':>9} {'ELBO':>10}")
print("-" * 43)
for it in range(20):
# Update q(mu): N(m, s^2)
E_tau = a_q / b_q
s2 = 1.0 / (1.0/100.0 + n * E_tau)
m = s2 * (0.0/100.0 + E_tau * n * x_bar)
# Update q(tau): Gamma(a_q, b_q)
E_mu = m
E_mu2 = s2 + m**2
a_q = 0.01 + n / 2.0
b_q = 0.01 + 0.5 * (np.sum(data**2) - 2 * E_mu * np.sum(data) + n * E_mu2)
# Compute ELBO (simplified)
elbo = (-n/2 * np.log(2*np.pi) + n/2 * (np.euler_gamma + np.log(2*b_q)
- 1/(a_q)) - 0.5 * n * (a_q/b_q) * x_var)
print(f"{it:>5} {m:>8.4f} {a_q/b_q:>8.4f} "
f"{1/np.sqrt(a_q/b_q):>9.4f} {elbo:>10.2f}")
print(f"\nTrue: mu={true_mu}, sigma={1/np.sqrt(true_tau):.3f}")
Stochastic variational inference¶
For large datasets, computing the ELBO over all data is expensive. Stochastic VI (Hoffman et al., 2013) uses mini-batches: subsample the data, compute a noisy gradient of the ELBO, and update the variational parameters via stochastic gradient ascent. This scales VI to millions of observations.
18.7 Integrated Nested Laplace Approximations (INLA)¶
Why INLA? For a large class of models — latent Gaussian models (LGMs) — INLA provides fast, deterministic, and accurate posterior approximations without MCMC.
A latent Gaussian model has three stages:
Observations: \(y_i \mid \eta_i, \boldsymbol{\psi}\) from an exponential family.
Latent field: \(\boldsymbol{\eta} \mid \boldsymbol{\psi} \sim N(\boldsymbol{\mu}_\psi, \boldsymbol{Q}_\psi^{-1})\) (Gaussian with a sparse precision matrix \(\boldsymbol{Q}\)).
Hyperparameters: \(\boldsymbol{\psi}\) with prior \(p(\boldsymbol{\psi})\).
The key insight: with a Gaussian latent field, the Laplace approximation (Section 17.5 in Chapter 17) is very accurate for the conditional posterior \(p(\boldsymbol{\eta} \mid \boldsymbol{\psi}, \mathbf{y})\). INLA then numerically integrates over the (typically low-dimensional) \(\boldsymbol{\psi}\).
When to use INLA.
INLA is the method of choice when the model fits the LGM framework: generalized
linear models, spatial models, time series, survival models, etc. It is
implemented in the R package R-INLA.
18.8 Comparison of Methods¶
Method |
Accuracy |
Speed |
Scalability |
When to use |
|---|---|---|---|---|
Laplace approx. |
Moderate |
Very fast |
High |
Unimodal, symmetric posteriors; quick screening |
MCMC (MH/Gibbs) |
Exact (asymptotic) |
Slow |
Low–moderate |
Gold standard; moderate dimensions |
HMC / NUTS |
Exact (asymptotic) |
Moderate |
Moderate–high |
Continuous parameters; up to thousands of dimensions |
Variational inference |
Approximate |
Fast |
Very high |
Large datasets; exploratory analysis; deep generative models |
INLA |
Very good |
Fast |
High |
Latent Gaussian models |
Let us run a horse race: apply three methods to the same problem and compare.
# Head-to-head comparison: Laplace vs MH vs Gibbs on Normal-Normal
import numpy as np
from scipy import stats, optimize
import time
np.random.seed(42)
# Data: N(mu, 1), mu ~ N(0, 10^2), observe 50 points
true_mu = 3.0
data = np.random.normal(true_mu, 1.0, size=50)
n = len(data)
x_bar = data.mean()
# --- Exact posterior (conjugate) ---
prec_post = 1/100 + n
mu_post_exact = (0/100 + n*x_bar) / prec_post
sd_post_exact = np.sqrt(1/prec_post)
# --- Method 1: Laplace approximation ---
t0 = time.time()
neg_log_post = lambda mu: 0.5*n*(mu - x_bar)**2 + 0.5*mu**2/100
res = optimize.minimize_scalar(neg_log_post)
mu_lap = res.x
h2 = n + 1/100
sd_lap = np.sqrt(1/h2)
t_lap = time.time() - t0
# --- Method 2: Metropolis-Hastings ---
t0 = time.time()
T_mh = 10_000
mh = np.zeros(T_mh)
mh[0] = 0.0
acc_mh = 0
for t in range(1, T_mh):
prop = mh[t-1] + np.random.normal(0, 0.3)
la = (-0.5*n*(prop-x_bar)**2 - 0.5*prop**2/100
+ 0.5*n*(mh[t-1]-x_bar)**2 + 0.5*mh[t-1]**2/100)
if np.log(np.random.uniform()) < la:
mh[t] = prop
acc_mh += 1
else:
mh[t] = mh[t-1]
t_mh = time.time() - t0
mh_post = mh[2000:]
# --- Method 3: Direct sampling (exact conjugate) ---
t0 = time.time()
exact_samples = np.random.normal(mu_post_exact, sd_post_exact, 10_000)
t_exact = time.time() - t0
print(f"{'Method':>18} {'Mean':>8} {'Std':>8} {'Time (ms)':>10}")
print("-" * 48)
print(f"{'Exact':>18} {mu_post_exact:>8.4f} {sd_post_exact:>8.4f} {t_exact*1000:>10.2f}")
print(f"{'Laplace':>18} {mu_lap:>8.4f} {sd_lap:>8.4f} {t_lap*1000:>10.2f}")
print(f"{'MH (10k iter)':>18} {mh_post.mean():>8.4f} {mh_post.std():>8.4f} "
f"{t_mh*1000:>10.2f}")
Rules of thumb.
If the model is a latent Gaussian model, try INLA first.
For moderate-dimensional continuous models, use HMC/NUTS (Stan, PyMC).
For massive datasets or deep generative models, use variational inference.
Use Gibbs sampling when full conditionals are available and the dimension is not too high.
Use random-walk Metropolis as a baseline or when gradients are unavailable.
18.9 Summary¶
Monte Carlo integration approximates integrals by sampling; importance sampling reweights samples from a proposal.
Metropolis–Hastings constructs a Markov chain satisfying detailed balance; the acceptance probability is derived from the stationarity condition.
Gibbs sampling is MH with full conditional proposals and acceptance rate 1.
HMC uses Hamiltonian dynamics and the leapfrog integrator to make large, gradient-informed moves; NUTS automates the trajectory length.
Convergence is assessed with \(\hat{R}\), ESS, and trace plots.
Variational inference turns inference into optimization via the ELBO.
INLA combines Laplace approximations with numerical integration for latent Gaussian models.
# Summary: which method for which problem?
methods = [
("Conjugate (Beta-Binomial)", "Exact", "< 1 ms", "Low-dim conjugate"),
("Laplace approximation", "Good", "< 10 ms", "Unimodal posteriors"),
("Metropolis-Hastings", "Exact (lim)", "seconds", "General, low-dim"),
("Gibbs sampling", "Exact (lim)", "seconds", "Conjugate conditionals"),
("HMC / NUTS", "Exact (lim)", "seconds", "Continuous, high-dim"),
("Variational inference", "Approximate", "< 1 sec", "Large data, deep models"),
]
print(f"{'Method':<30} {'Accuracy':<14} {'Speed':<12} {'Best for'}")
print("-" * 80)
for name, acc, speed, use in methods:
print(f"{name:<30} {acc:<14} {speed:<12} {use}")