Chapter 18 – Computational Methods

The motivating problem. In Chapter 17 we enjoyed the luxury of conjugate priors: the posterior had a closed form. But real-world models are rarely so cooperative. Consider a hierarchical model for clinical trials across multiple hospitals. Hospital \(j\) has response rate \(\theta_j\), and we believe these rates come from a common population: \(\theta_j \sim \text{Beta}(\mu\kappa,\, (1-\mu)\kappa)\), where \(\mu\) and \(\kappa\) are unknown hyperparameters.

The posterior is:

\[p(\mu, \kappa, \boldsymbol{\theta} \mid \mathbf{y}) \;\propto\; \prod_{j=1}^J \binom{n_j}{y_j} \theta_j^{y_j}(1-\theta_j)^{n_j - y_j} \;\cdot\; \prod_{j=1}^J \text{Beta}(\theta_j \mid \mu\kappa, (1-\mu)\kappa) \;\cdot\; p(\mu)\, p(\kappa).\]

We can evaluate this pointwise, but we cannot integrate it analytically. What now? This chapter builds the computational highway system that gets you to any posterior, no matter how complicated.

# The hierarchical model we will tackle throughout this chapter
import numpy as np
from scipy.special import gammaln

np.random.seed(42)

# Data: 8 hospitals, each with n_j patients and y_j responders
n_hospitals = 8
n_j = np.array([20, 30, 25, 40, 15, 35, 28, 22])
y_j = np.array([11, 16, 10, 25,  7, 19, 14,  9])

print("Hospital data:")
print(f"  {'Hospital':>10} {'n_j':>5} {'y_j':>5} {'MLE':>8}")
print("  " + "-" * 30)
for j in range(n_hospitals):
    print(f"  {j+1:>10} {n_j[j]:>5} {y_j[j]:>5} {y_j[j]/n_j[j]:>8.3f}")
print(f"\n  Hospital MLEs range from {min(y_j/n_j):.3f} to {max(y_j/n_j):.3f}")
print("  Can we borrow strength across hospitals?  -> Chapter 18 shows how.")

def log_posterior(theta_vec, mu, kappa):
    """Unnormalized log-posterior for the hierarchical model."""
    a = mu * kappa
    b = (1 - mu) * kappa
    # Likelihood
    ll = np.sum(y_j * np.log(theta_vec) + (n_j - y_j) * np.log(1 - theta_vec))
    # Prior on theta_j: Beta(a, b)
    lp_theta = np.sum((a - 1)*np.log(theta_vec) + (b - 1)*np.log(1 - theta_vec)
                      - gammaln(a) - gammaln(b) + gammaln(a + b))
    # Hyperpriors: mu ~ Beta(2,2), kappa ~ Gamma(2, 0.1)
    lp_mu = (2-1)*np.log(mu) + (2-1)*np.log(1-mu) if 0 < mu < 1 else -1e10
    lp_kappa = (2-1)*np.log(kappa) - 0.1*kappa if kappa > 0 else -1e10
    return ll + lp_theta + lp_mu + lp_kappa

18.1 Monte Carlo Integration

Why we need this. Many quantities of interest — posterior means, marginal likelihoods, predictive distributions — are integrals of the form

\[I = \int g(\theta)\, p(\theta \mid \mathbf{x})\, d\theta.\]

When the integral has no closed form we resort to Monte Carlo methods: generate samples \(\theta^{(1)}, \ldots, \theta^{(S)}\) from \(p(\theta \mid \mathbf{x})\) and approximate

\[\hat{I} = \frac{1}{S}\sum_{s=1}^{S} g(\theta^{(s)}).\]

By the law of large numbers \(\hat{I} \to I\) as \(S \to \infty\), and the central limit theorem gives the Monte Carlo standard error:

\[\text{SE}(\hat{I}) = \frac{\text{sd}(g(\theta^{(s)}))}{\sqrt{S}}.\]

The error decreases as \(1/\sqrt{S}\) regardless of the dimension of \(\theta\) — a crucial advantage over deterministic quadrature in high dimensions.

# Monte Carlo integration: watch the estimate converge
import numpy as np

np.random.seed(42)

# Goal: estimate E[theta] where theta ~ Beta(35, 25)
# (our clinical trial posterior from Chapter 17)
from scipy import stats
true_mean = 35 / (35 + 25)

samples = stats.beta.rvs(35, 25, size=10_000)

print(f"{'S':>7} {'MC estimate':>12} {'SE':>10} {'|Error|':>10}")
print("-" * 42)
for S in [10, 50, 100, 500, 1000, 5000, 10000]:
    mc_est = samples[:S].mean()
    mc_se = samples[:S].std() / np.sqrt(S)
    print(f"{S:>7} {mc_est:>12.6f} {mc_se:>10.6f} "
          f"{abs(mc_est - true_mean):>10.6f}")
print(f"{'True':>7} {true_mean:>12.6f}")
print("\nSE decreases as 1/sqrt(S) regardless of dimension.")

Intuition

Monte Carlo integration is essentially “estimation by random polling.” Just as a political poll of 1000 random voters can approximate the opinion of millions, 10,000 random draws from a posterior can approximate intractable integrals to several decimal places.

Importance sampling

Often we cannot sample directly from the target \(p(\theta \mid \mathbf{x})\) but we can sample from a proposal distribution \(q(\theta)\). Write

\[I = \int g(\theta)\, \frac{p(\theta \mid \mathbf{x})}{q(\theta)}\, q(\theta)\, d\theta = E_q\!\left[g(\theta)\, w(\theta)\right],\]

where the importance weight is

\[w(\theta) = \frac{p(\theta \mid \mathbf{x})}{q(\theta)}.\]

In plain English: the weight measures how much more (or less) probable \(\theta\) is under the target distribution compared to the proposal. Samples from regions where the proposal underrepresents the target get upweighted; samples from overrepresented regions get downweighted.

Drawing \(\theta^{(s)} \sim q\) and computing

\[\hat{I} = \frac{\sum_{s=1}^S w(\theta^{(s)})\, g(\theta^{(s)})} {\sum_{s=1}^S w(\theta^{(s)})}\]

gives a consistent (self-normalized) importance sampling estimator. The denominator normalizes the weights so they sum to one, which is why this is called “self-normalized” — we do not need to know the normalizing constants of \(p\) or \(q\). The quality depends on how well \(q\) covers the tails of \(p\); if \(q\) has lighter tails the variance can be infinite.

# Importance sampling: good vs bad proposal
import numpy as np
from scipy import stats

np.random.seed(42)

# Target: Beta(35, 25), want E[theta]
true_mean = 35 / 60
S = 10_000

# Good proposal: N(0.58, 0.07) -- covers the target well
# Bad proposal:  N(0.3, 0.05) -- centered far from the target
for label, mu_q, sigma_q in [("Good proposal N(0.58, 0.07)", 0.58, 0.07),
                               ("Bad proposal  N(0.30, 0.05)", 0.30, 0.05)]:
    z = np.random.normal(mu_q, sigma_q, S)
    z = z[(z > 0.01) & (z < 0.99)]  # keep in valid range

    log_w = (stats.beta.logpdf(z, 35, 25)
             - stats.norm.logpdf(z, mu_q, sigma_q))
    log_w -= log_w.max()  # for numerical stability
    w = np.exp(log_w)

    estimate = np.sum(w * z) / np.sum(w)
    ess = np.sum(w)**2 / np.sum(w**2)

    print(f"{label}:")
    print(f"  Estimate: {estimate:.6f} (true: {true_mean:.6f})")
    print(f"  ESS:      {ess:.0f} / {len(z)} samples")
    print(f"  ESS ratio: {ess/len(z):.1%}\n")

Common Pitfall

If your proposal distribution \(q\) has thinner tails than the target \(p\), a few extreme importance weights will dominate the estimate, making it highly unstable. Always check the effective sample size of your importance weights: \(\text{ESS} = (\sum w_s)^2 / \sum w_s^2\).

18.2 The Metropolis–Hastings Algorithm

Why Markov chains? Importance sampling works well in low dimensions, but in high dimensions it is very hard to find a good proposal \(q\). Instead, we construct a Markov chain whose stationary distribution is the target posterior. After a burn-in period the chain produces (correlated) samples from the posterior.

We will build a Metropolis–Hastings sampler for our hierarchical model. But first, let us derive the algorithm from first principles.

Detailed balance

A Markov chain with transition kernel \(T(\theta' \mid \theta)\) has stationary distribution \(\pi(\theta) = p(\theta \mid \mathbf{x})\) if detailed balance holds:

\[\pi(\theta)\, T(\theta' \mid \theta) = \pi(\theta')\, T(\theta \mid \theta').\]

This says the probability of being at \(\theta\) and moving to \(\theta'\) equals the probability of the reverse.

Derivation of the acceptance probability

We choose a proposal distribution \(q(\theta' \mid \theta)\) and set the transition as: propose \(\theta'\) from \(q\), then accept with probability \(\alpha(\theta, \theta')\). Thus

\[T(\theta' \mid \theta) = q(\theta' \mid \theta)\, \alpha(\theta, \theta').\]

Substituting into the detailed balance condition:

\[\pi(\theta)\, q(\theta' \mid \theta)\, \alpha(\theta, \theta') = \pi(\theta')\, q(\theta \mid \theta')\, \alpha(\theta', \theta).\]

Rearranging:

\[\frac{\alpha(\theta, \theta')}{\alpha(\theta', \theta)} = \frac{\pi(\theta')\, q(\theta \mid \theta')} {\pi(\theta)\, q(\theta' \mid \theta)}.\]

We need acceptance probabilities \(\alpha\) that satisfy this ratio and lie in \([0, 1]\). The Metropolis–Hastings choice picks the largest such probabilities:

\[\alpha(\theta, \theta') = \min\!\left(1,\; \frac{\pi(\theta')\, q(\theta \mid \theta')} {\pi(\theta)\, q(\theta' \mid \theta)}\right).\]

Reading this formula: the numerator is the posterior density at the proposed point times the probability of proposing a move back to where we are; the denominator is the same thing in the opposite direction. If the proposed point is “better” (higher posterior density, easy to get back from), the ratio exceeds 1 and we always accept. If it is “worse,” we accept with a probability equal to the ratio, which gives worse points a fighting chance — this is essential for exploring the full posterior.

One can verify this satisfies the ratio above. Because \(\pi\) is the posterior, only the ratio \(\pi(\theta')/\pi(\theta)\) is needed — the normalizing constant cancels.

Why does this matter?

The cancellation of the normalizing constant is what makes MCMC so powerful. You never need to compute the marginal likelihood \(p(\mathbf{x})\). You only need to evaluate the unnormalized posterior — which is just the likelihood times the prior. This is the engine that makes Bayesian inference practical for complex models.

Algorithm

  1. Initialize \(\theta^{(0)}\).

  2. For \(t = 1, 2, \ldots\):

    1. Draw \(\theta^* \sim q(\cdot \mid \theta^{(t-1)})\).

    2. Compute

      \[\alpha = \min\!\left(1,\; \frac{p(\theta^* \mid \mathbf{x})\, q(\theta^{(t-1)} \mid \theta^*)} {p(\theta^{(t-1)} \mid \mathbf{x})\, q(\theta^* \mid \theta^{(t-1)})}\right).\]
    3. With probability \(\alpha\) set \(\theta^{(t)} = \theta^*\); otherwise set \(\theta^{(t)} = \theta^{(t-1)}\).

Special case: random-walk Metropolis. If \(q(\theta' \mid \theta) = q(\theta \mid \theta')\) (symmetric proposal, e.g., \(\theta' = \theta + \epsilon\) with \(\epsilon \sim N(0, \sigma^2)\)) the proposal ratio cancels and

\[\alpha = \min\!\left(1,\; \frac{p(\theta^* \mid \mathbf{x})} {p(\theta^{(t-1)} \mid \mathbf{x})}\right).\]

Now let us apply this to Bayesian logistic regression, printing the first 20 iterations so you can see the sampler in action.

# Metropolis-Hastings for logistic regression: iteration by iteration
import numpy as np
from scipy.special import expit

np.random.seed(42)

# Simulate: 50 patients, dosage x predicts recovery y
n = 50
x = np.random.normal(0, 1, n)
true_beta = np.array([0.5, 1.5])
prob = expit(true_beta[0] + true_beta[1] * x)
y = np.random.binomial(1, prob)

def log_posterior(beta):
    eta = beta[0] + beta[1] * x
    log_lik = np.sum(y * eta - np.log(1 + np.exp(eta)))
    log_prior = -0.5 * np.sum(beta**2) / 100  # N(0, 10^2)
    return log_lik + log_prior

# MH sampler
T = 20_000
samples = np.zeros((T, 2))
samples[0] = [0.0, 0.0]
sigma_prop = 0.3
accepted = 0

# Print header for first 20 iterations
print(f"{'t':>4} {'beta0*':>8} {'beta1*':>8} {'log_alpha':>10} "
      f"{'Accept?':>8} {'beta0':>8} {'beta1':>8}")
print("-" * 60)

for t in range(1, T):
    proposal = samples[t-1] + np.random.normal(0, sigma_prop, 2)
    log_alpha = log_posterior(proposal) - log_posterior(samples[t-1])
    u = np.random.uniform()
    accept = np.log(u) < log_alpha

    if accept:
        samples[t] = proposal
        accepted += 1
    else:
        samples[t] = samples[t-1]

    if t <= 20:
        print(f"{t:>4} {proposal[0]:>8.3f} {proposal[1]:>8.3f} "
              f"{log_alpha:>10.3f} {'YES' if accept else 'no':>8} "
              f"{samples[t, 0]:>8.3f} {samples[t, 1]:>8.3f}")

burn_in = 5000
post_samples = samples[burn_in:]
print(f"\n--- After {T} iterations (burn-in={burn_in}) ---")
print(f"Acceptance rate:     {accepted / T:.3f}")
print(f"Posterior mean beta: [{post_samples[:,0].mean():.3f}, "
      f"{post_samples[:,1].mean():.3f}]")
print(f"Posterior std beta:  [{post_samples[:,0].std():.3f}, "
      f"{post_samples[:,1].std():.3f}]")
print(f"True beta:           [{true_beta[0]:.3f}, {true_beta[1]:.3f}]")

Proposal tuning

The step size \(\sigma\) controls the trade-off between exploring widely (large steps, many rejections) and staying local (small steps, high acceptance but slow exploration). The optimal acceptance rate for a \(d\)-dimensional Gaussian target is approximately 23.4% (Roberts, Gelman, and Gilks, 1997).

# Effect of proposal step size on acceptance rate and mixing
import numpy as np

np.random.seed(42)

# Target: standard bivariate normal
def log_target(theta):
    return -0.5 * np.sum(theta**2)

def run_mh(sigma_prop, T=10_000, start=np.array([5.0, 5.0])):
    samples = np.zeros((T, 2))
    samples[0] = start
    acc = 0
    for t in range(1, T):
        proposal = samples[t-1] + np.random.normal(0, sigma_prop, 2)
        log_alpha = log_target(proposal) - log_target(samples[t-1])
        if np.log(np.random.uniform()) < log_alpha:
            samples[t] = proposal
            acc += 1
        else:
            samples[t] = samples[t-1]
    return samples, acc / T

print(f"{'sigma_prop':>12} {'Accept rate':>13} {'Post mean':>16} "
      f"{'Post std':>12}")
print("-" * 56)
for sigma in [0.05, 0.3, 1.0, 2.4, 5.0, 20.0]:
    samp, rate = run_mh(sigma)
    post = samp[2000:]
    print(f"{sigma:>12.2f} {rate:>13.3f} "
          f"[{post[:,0].mean():>5.2f}, {post[:,1].mean():>5.2f}] "
          f"[{post[:,0].std():>4.2f}, {post[:,1].std():>4.2f}]")

print("\nOptimal acceptance rate for d=2: ~23.4%")
print("sigma_prop ~ 2.4 hits the sweet spot.")

18.3 Gibbs Sampling

Why Gibbs? In many models the full conditional distribution of each parameter block — the distribution of one parameter given all others and the data — has a known form (especially with conjugate priors). Instead of proposing moves in all dimensions at once, we cycle through one dimension at a time, always drawing from the exact conditional distribution. Every draw is accepted.

Full conditional distributions

For a parameter vector \(\boldsymbol{\theta} = (\theta_1, \ldots, \theta_p)\), the full conditional of \(\theta_j\) is

\[p(\theta_j \mid \theta_{-j}, \mathbf{x}) \;\propto\; p(\mathbf{x} \mid \boldsymbol{\theta})\, p(\boldsymbol{\theta}),\]

treating all \(\theta_k\) for \(k \neq j\) as fixed. In conjugate models these are standard distributions.

Algorithm

  1. Initialize \(\boldsymbol{\theta}^{(0)}\).

  2. For \(t = 1, 2, \ldots\):

    1. Draw \(\theta_1^{(t)} \sim p(\theta_1 \mid \theta_2^{(t-1)}, \ldots, \theta_p^{(t-1)}, \mathbf{x})\).

    2. Draw \(\theta_2^{(t)} \sim p(\theta_2 \mid \theta_1^{(t)}, \theta_3^{(t-1)}, \ldots, \theta_p^{(t-1)}, \mathbf{x})\).

    3. Continue cycling through all components.

Gibbs as a special case of Metropolis–Hastings

Derivation. In MH, set the proposal for updating \(\theta_j\) to be the full conditional itself: \(q(\theta_j' \mid \theta_j, \theta_{-j}) = p(\theta_j' \mid \theta_{-j}, \mathbf{x})\). Then the acceptance ratio is:

\[\begin{split}\alpha &= \frac{p(\theta_j', \theta_{-j} \mid \mathbf{x})\; p(\theta_j \mid \theta_{-j}, \mathbf{x})} {p(\theta_j, \theta_{-j} \mid \mathbf{x})\; p(\theta_j' \mid \theta_{-j}, \mathbf{x})} \\ &= \frac{p(\theta_j' \mid \theta_{-j}, \mathbf{x})\, p(\theta_{-j} \mid \mathbf{x})\, p(\theta_j \mid \theta_{-j}, \mathbf{x})} {p(\theta_j \mid \theta_{-j}, \mathbf{x})\, p(\theta_{-j} \mid \mathbf{x})\, p(\theta_j' \mid \theta_{-j}, \mathbf{x})} \\ &= 1.\end{split}\]

Every proposal is accepted! Gibbs sampling is MH with an optimal component-wise proposal.

Example: Normal model with unknown mean and variance. With \(x_i \sim N(\mu, \sigma^2)\), \(\mu \sim N(\mu_0, \sigma_0^2)\), and \(\sigma^2 \sim \text{Inv-Gamma}(a, b)\), the full conditionals are:

  • \(\mu \mid \sigma^2, \mathbf{x} \sim N(\mu_n, \sigma_n^2)\) (Normal–Normal conjugacy from Chapter 17).

  • \(\sigma^2 \mid \mu, \mathbf{x} \sim \text{Inv-Gamma}(a + n/2,\; b + \frac{1}{2}\sum(x_i - \mu)^2)\).

Let us implement this, printing every iteration for the first 15 to see the alternating updates converge.

# Gibbs sampler for Normal model: mu and sigma^2 with alternating updates
import numpy as np
from scipy import stats

np.random.seed(42)

# Generate data from N(mu=5, sigma^2=4)
true_mu, true_sigma2 = 5.0, 4.0
data = np.random.normal(true_mu, np.sqrt(true_sigma2), size=30)
n = len(data)
x_bar = data.mean()
S_data = np.sum((data - x_bar)**2)

# Hyperpriors
mu_0, sigma_0_sq = 0.0, 100.0     # mu ~ N(0, 100)
a_prior, b_prior = 0.01, 0.01     # sigma^2 ~ InvGamma(0.01, 0.01)

# Gibbs sampler
T = 5_000
mu_samples = np.zeros(T)
sigma2_samples = np.zeros(T)
mu_samples[0] = 0.0       # start far from truth
sigma2_samples[0] = 20.0  # start far from truth

print(f"{'t':>4} {'mu':>10} {'sigma2':>10}   Full conditional used")
print("-" * 55)
print(f"{'0':>4} {mu_samples[0]:>10.4f} {sigma2_samples[0]:>10.4f}   (initial)")

for t in range(1, T):
    # -- Update mu | sigma^2, data --
    prec_prior = 1.0 / sigma_0_sq
    prec_data = n / sigma2_samples[t-1]
    prec_post = prec_prior + prec_data
    mu_post = (prec_prior * mu_0 + prec_data * x_bar) / prec_post
    sigma_post = np.sqrt(1.0 / prec_post)
    mu_samples[t] = np.random.normal(mu_post, sigma_post)

    # -- Update sigma^2 | mu, data --
    a_post = a_prior + n / 2.0
    b_post = b_prior + 0.5 * np.sum((data - mu_samples[t])**2)
    sigma2_samples[t] = 1.0 / np.random.gamma(a_post, 1.0 / b_post)

    if t <= 15 or t == T-1:
        label = f"  N({mu_post:.2f}, {sigma_post:.2f}^2) -> " \
                f"IG({a_post:.1f}, {b_post:.1f})"
        print(f"{t:>4} {mu_samples[t]:>10.4f} {sigma2_samples[t]:>10.4f} {label}")

burn_in = 500
print(f"\n--- After {T} iterations (burn-in={burn_in}) ---")
print(f"Posterior E[mu]:      {mu_samples[burn_in:].mean():.4f}  (true={true_mu})")
print(f"Posterior E[sigma^2]: {sigma2_samples[burn_in:].mean():.4f}  (true={true_sigma2})")
print(f"Posterior sd(mu):     {mu_samples[burn_in:].std():.4f}")
print(f"Data mean:            {x_bar:.4f}")

Now let us tackle the full hierarchical model from the chapter opening using a Gibbs sampler. Each \(\theta_j\) has a conjugate full conditional (Beta), while \(\mu\) and \(\kappa\) require Metropolis steps within the Gibbs.

# Gibbs sampler for the hierarchical Beta-Binomial model
import numpy as np
from scipy.special import gammaln

np.random.seed(42)

# Data from chapter opening
n_hospitals = 8
n_j = np.array([20, 30, 25, 40, 15, 35, 28, 22])
y_j = np.array([11, 16, 10, 25,  7, 19, 14,  9])

T = 10_000
theta_samples = np.zeros((T, n_hospitals))
mu_samples = np.zeros(T)
kappa_samples = np.zeros(T)

# Initialize
theta_samples[0] = y_j / n_j
mu_samples[0] = 0.5
kappa_samples[0] = 10.0

def log_beta_binomial_hyperprior(mu, kappa, thetas):
    """Log-density of the hyperprior part: Beta priors on theta_j + hyperpriors."""
    if not (0.01 < mu < 0.99 and kappa > 0.5):
        return -1e10
    a = mu * kappa
    b = (1 - mu) * kappa
    lp = np.sum((a-1)*np.log(thetas) + (b-1)*np.log(1-thetas)
                 - gammaln(a) - gammaln(b) + gammaln(a+b))
    lp += (2-1)*np.log(mu) + (2-1)*np.log(1-mu)    # mu ~ Beta(2,2)
    lp += (2-1)*np.log(kappa) - 0.1*kappa           # kappa ~ Gamma(2,0.1)
    return lp

accepted_mu, accepted_kappa = 0, 0

for t in range(1, T):
    mu_cur = mu_samples[t-1]
    kappa_cur = kappa_samples[t-1]

    # -- Gibbs step: update each theta_j from its Beta full conditional --
    a = mu_cur * kappa_cur + y_j
    b = (1 - mu_cur) * kappa_cur + n_j - y_j
    theta_samples[t] = np.random.beta(a, b)

    # -- MH step: update mu --
    mu_prop = mu_cur + np.random.normal(0, 0.05)
    if 0.01 < mu_prop < 0.99:
        log_r = (log_beta_binomial_hyperprior(mu_prop, kappa_cur, theta_samples[t])
                 - log_beta_binomial_hyperprior(mu_cur, kappa_cur, theta_samples[t]))
        if np.log(np.random.uniform()) < log_r:
            mu_cur = mu_prop
            accepted_mu += 1
    mu_samples[t] = mu_cur

    # -- MH step: update kappa --
    kappa_prop = kappa_cur * np.exp(np.random.normal(0, 0.2))  # log-scale proposal
    log_r = (log_beta_binomial_hyperprior(mu_cur, kappa_prop, theta_samples[t])
             - log_beta_binomial_hyperprior(mu_cur, kappa_cur, theta_samples[t])
             + np.log(kappa_prop) - np.log(kappa_cur))  # Jacobian
    if np.log(np.random.uniform()) < log_r:
        kappa_cur = kappa_prop
        accepted_kappa += 1
    kappa_samples[t] = kappa_cur

burn_in = 3000
print("Hierarchical Beta-Binomial: Gibbs + MH results")
print(f"  mu acceptance rate:    {accepted_mu/T:.3f}")
print(f"  kappa acceptance rate: {accepted_kappa/T:.3f}")
print(f"\n  E[mu]:     {mu_samples[burn_in:].mean():.4f}")
print(f"  E[kappa]:  {kappa_samples[burn_in:].mean():.2f}")
print(f"\n  {'Hospital':>10} {'MLE':>8} {'Post mean':>10} {'Shrinkage':>10}")
print("  " + "-" * 40)
for j in range(n_hospitals):
    mle = y_j[j] / n_j[j]
    pm = theta_samples[burn_in:, j].mean()
    shrink = (mle - pm) / (mle - mu_samples[burn_in:].mean())
    print(f"  {j+1:>10} {mle:>8.3f} {pm:>10.3f} {shrink:>10.1%}")
print("\n  Note: smaller hospitals are shrunk more toward the overall mean.")

18.4 Hamiltonian Monte Carlo

Why HMC? Random-walk Metropolis explores the parameter space by diffusion, which becomes extremely slow in high dimensions. HMC uses the gradient of the log-posterior to make large, directed moves while maintaining a high acceptance rate. Think of it as giving the sampler a “GPS” rather than letting it wander blindly.

Hamiltonian mechanics analogy

Imagine a frictionless puck sliding on a surface whose height is \(U(\theta) = -\log p(\theta \mid \mathbf{x})\). The puck has position \(\theta\) and momentum \(\mathbf{r}\). The total energy (Hamiltonian) is:

\[H(\theta, \mathbf{r}) = U(\theta) + K(\mathbf{r}),\]

where \(K(\mathbf{r}) = \frac{1}{2}\mathbf{r}^T M^{-1} \mathbf{r}\) is the kinetic energy and \(M\) is a “mass matrix” (usually set to the identity or an estimate of the posterior covariance).

Hamilton’s equations give the dynamics:

\[\begin{split}\frac{d\theta}{dt} &= \frac{\partial H}{\partial \mathbf{r}} = M^{-1}\mathbf{r}, \\ \frac{d\mathbf{r}}{dt} &= -\frac{\partial H}{\partial \theta} = -\nabla U(\theta) = \nabla \log p(\theta \mid \mathbf{x}).\end{split}\]

In plain English: the first equation says the parameter moves in the direction of the momentum (the puck slides). The second equation says the momentum changes according to the gradient of the log-posterior — the puck accelerates downhill toward regions of high posterior density. Together, these create smooth, curving trajectories that efficiently explore the posterior landscape.

These equations are energy-preserving and volume-preserving (symplectic), which is exactly what we need for a valid MCMC proposal.

The leapfrog integrator

We cannot solve Hamilton’s equations analytically, so we use the leapfrog integrator with step size \(\varepsilon\) and \(L\) steps:

  1. \(\mathbf{r} \leftarrow \mathbf{r} + \frac{\varepsilon}{2}\, \nabla \log p(\theta \mid \mathbf{x})\) (half-step for momentum).

  2. For \(l = 1, \ldots, L\):

    1. \(\theta \leftarrow \theta + \varepsilon\, M^{-1}\mathbf{r}\) (full step for position).

    2. If \(l < L\): \(\mathbf{r} \leftarrow \mathbf{r} + \varepsilon\, \nabla \log p(\theta \mid \mathbf{x})\) (full step for momentum).

  3. \(\mathbf{r} \leftarrow \mathbf{r} + \frac{\varepsilon}{2}\, \nabla \log p(\theta \mid \mathbf{x})\) (final half-step for momentum).

The leapfrog integrator is symplectic (volume-preserving) and time-reversible, so the MH acceptance probability is:

\[\alpha = \min\!\bigl(1,\; \exp\!\bigl[-H(\theta^*, \mathbf{r}^*) + H(\theta, \mathbf{r})\bigr]\bigr),\]

where \((\theta^*, \mathbf{r}^*)\) is the endpoint. If the integrator were exact, \(H\) would be conserved and \(\alpha = 1\). In practice, the discretization error is small and acceptance rates are typically above 80%.

# HMC sampler: sampling a 2D correlated Gaussian
import numpy as np

np.random.seed(42)

# Target: bivariate normal with correlation 0.95
rho = 0.95
Sigma = np.array([[1, rho], [rho, 1]])
Sigma_inv = np.linalg.inv(Sigma)

def log_prob(q):
    return -0.5 * q @ Sigma_inv @ q

def grad_log_prob(q):
    return -Sigma_inv @ q

def hmc_step(q, epsilon, L):
    """One HMC step with leapfrog integration."""
    p = np.random.normal(size=2)  # sample momentum
    current_H = -log_prob(q) + 0.5 * p @ p

    q_new = q.copy()
    p_new = p.copy()

    # Leapfrog
    p_new += 0.5 * epsilon * grad_log_prob(q_new)
    for l in range(L):
        q_new += epsilon * p_new
        if l < L - 1:
            p_new += epsilon * grad_log_prob(q_new)
    p_new += 0.5 * epsilon * grad_log_prob(q_new)

    proposed_H = -log_prob(q_new) + 0.5 * p_new @ p_new
    if np.log(np.random.uniform()) < current_H - proposed_H:
        return q_new, True
    return q, False

# Run HMC and compare with random-walk MH
T = 5_000
hmc_samples = np.zeros((T, 2))
hmc_samples[0] = [0.0, 0.0]
hmc_acc = 0

for t in range(1, T):
    hmc_samples[t], accepted = hmc_step(hmc_samples[t-1], epsilon=0.15, L=20)
    hmc_acc += accepted

# Random-walk MH for comparison
mh_samples = np.zeros((T, 2))
mh_samples[0] = [0.0, 0.0]
mh_acc = 0
for t in range(1, T):
    prop = mh_samples[t-1] + np.random.normal(0, 0.3, 2)
    if np.log(np.random.uniform()) < log_prob(prop) - log_prob(mh_samples[t-1]):
        mh_samples[t] = prop
        mh_acc += 1
    else:
        mh_samples[t] = mh_samples[t-1]

def ess_1d(chain):
    n = len(chain)
    rho1 = np.corrcoef(chain[:-1], chain[1:])[0, 1]
    return max(1, n * (1 - rho1) / (1 + rho1))

burn = 500
print(f"{'':>14} {'HMC':>12} {'Random-walk MH':>16}")
print("-" * 44)
print(f"{'Accept rate':>14} {hmc_acc/T:>12.3f} {mh_acc/T:>16.3f}")
print(f"{'Mean[0]':>14} {hmc_samples[burn:,0].mean():>12.3f} "
      f"{mh_samples[burn:,0].mean():>16.3f}")
print(f"{'Std[0]':>14} {hmc_samples[burn:,0].std():>12.3f} "
      f"{mh_samples[burn:,0].std():>16.3f}")
print(f"{'Corr':>14} "
      f"{np.corrcoef(hmc_samples[burn:].T)[0,1]:>12.3f} "
      f"{np.corrcoef(mh_samples[burn:].T)[0,1]:>16.3f}")
print(f"{'ESS[0]':>14} {ess_1d(hmc_samples[burn:,0]):>12.0f} "
      f"{ess_1d(mh_samples[burn:,0]):>16.0f}")
print(f"\nHMC achieves much higher ESS for correlated targets.")

The No-U-Turn Sampler (NUTS)

NUTS (Hoffman and Gelman, 2014) eliminates the need to hand-tune \(L\) by building a binary tree of leapfrog steps and stopping when the trajectory begins to “double back” (the U-turn criterion). Combined with dual averaging to adapt \(\varepsilon\), NUTS is the default sampler in Stan and PyMC.

18.5 Convergence Diagnostics

Markov chains are guaranteed to converge only in the limit. In practice we need diagnostics to assess whether our finite chains have mixed well. We want concrete numbers, not just “it looks okay.”

R-hat (Gelman–Rubin statistic)

Run \(m \ge 2\) chains from dispersed starting values. Let \(B\) be the between-chain variance of the chain means and \(W\) the average within-chain variance. The potential scale reduction factor is:

\[\hat{R} = \sqrt{\frac{(n-1)/n \cdot W + B/n}{W}},\]

where \(n\) is the chain length.

Reading this formula: the numerator estimates the true variance of the posterior using a mixture of within-chain and between-chain information. If the chains have converged to the same distribution, the between-chain variance \(B\) will be small relative to \(W\), and the ratio will be close to 1. If the chains are still exploring different regions, \(B\) will be large, inflating \(\hat{R}\). Values close to 1 (typically \(\hat{R} < 1.01\)) indicate convergence.

Effective sample size (ESS)

Because MCMC samples are autocorrelated, \(S\) samples carry less information than \(S\) independent samples. The ESS is:

\[\text{ESS} = \frac{S}{1 + 2\sum_{k=1}^\infty \rho_k},\]

where \(\rho_k\) is the autocorrelation at lag \(k\).

In plain English: if consecutive MCMC samples are highly correlated (the chain moves slowly), the denominator grows large, and the ESS shrinks well below the nominal number of samples \(S\). For example, if every sample is nearly identical to the previous one (\(\rho_k \approx 1\) for many lags), you might have 10,000 samples but an ESS of only 50 — meaning your 10,000 correlated draws carry as much information as 50 independent ones. A low ESS means we need more iterations or a better sampler.

Let us compute R-hat and ESS from scratch, running multiple chains.

# Convergence diagnostics from scratch: R-hat and ESS
import numpy as np

np.random.seed(42)

# Target: N(3, 2^2).  Run 4 chains from dispersed starts.
def mh_chain(T, sigma_prop, start, target_mu=3.0, target_sigma=2.0):
    chain = np.zeros(T)
    chain[0] = start
    for t in range(1, T):
        proposal = chain[t-1] + np.random.normal(0, sigma_prop)
        log_alpha = (-0.5*((proposal - target_mu)/target_sigma)**2
                     + 0.5*((chain[t-1] - target_mu)/target_sigma)**2)
        if np.log(np.random.uniform()) < log_alpha:
            chain[t] = proposal
        else:
            chain[t] = chain[t-1]
    return chain

# 4 chains from dispersed starting values
T = 3000
starts = [-10, 0, 10, 20]
chains = [mh_chain(T, sigma_prop=2.5, start=s) for s in starts]

# Discard first half as burn-in
burn = T // 2
chains_post = [c[burn:] for c in chains]
m = len(chains_post)
n = len(chains_post[0])

# -- R-hat --
chain_means = np.array([c.mean() for c in chains_post])
chain_vars = np.array([c.var(ddof=1) for c in chains_post])
W = chain_vars.mean()
B = n * chain_means.var(ddof=1)
var_hat = (n - 1) / n * W + B / n
R_hat = np.sqrt(var_hat / W)

# -- ESS (using initial positive sequence estimator) --
def compute_ess(chain):
    n = len(chain)
    mean = chain.mean()
    var = chain.var()
    if var == 0:
        return n
    acf_sum = 0
    for k in range(1, n):
        rho_k = np.corrcoef(chain[:-k], chain[k:])[0, 1]
        if rho_k < 0.05:
            break
        acf_sum += rho_k
    return n / (1 + 2 * acf_sum)

ess_values = [compute_ess(c) for c in chains_post]

print("Convergence Diagnostics (target: N(3, 4))")
print(f"  Chains: {m}, length: {T}, burn-in: {burn}")
print(f"\n  {'Chain':>6} {'Start':>7} {'Mean':>8} {'Std':>8} {'ESS':>8}")
print("  " + "-" * 42)
for i, (c, s) in enumerate(zip(chains_post, starts)):
    print(f"  {i+1:>6} {s:>7.1f} {c.mean():>8.3f} {c.std():>8.3f} "
          f"{ess_values[i]:>8.0f}")

print(f"\n  R-hat:         {R_hat:.4f}  {'(converged)' if R_hat < 1.01 else '(NOT converged)'}")
print(f"  Total ESS:     {sum(ess_values):.0f}")
print(f"  W (within):    {W:.4f}")
print(f"  B/n (between): {B/n:.4f}")

Trace plots

A simple visual diagnostic: plot each parameter against the iteration number. The chain should look like a “fuzzy caterpillar” — stationary and well-mixed — rather than showing trends or long excursions.

# Contrast good vs bad mixing by printing trajectory summaries
import numpy as np

np.random.seed(42)

# Well-tuned chain (sigma=2.5) vs poorly-tuned chain (sigma=0.05)
T = 5000
good = np.zeros(T)
bad = np.zeros(T)
good[0] = bad[0] = 10.0  # start away from target mean=3

for t in range(1, T):
    # Good chain
    prop = good[t-1] + np.random.normal(0, 2.5)
    if np.log(np.random.uniform()) < -0.5*((prop-3)/2)**2 + 0.5*((good[t-1]-3)/2)**2:
        good[t] = prop
    else:
        good[t] = good[t-1]
    # Bad chain
    prop = bad[t-1] + np.random.normal(0, 0.05)
    if np.log(np.random.uniform()) < -0.5*((prop-3)/2)**2 + 0.5*((bad[t-1]-3)/2)**2:
        bad[t] = prop
    else:
        bad[t] = bad[t-1]

# Print snapshots of trajectory
print(f"{'Iteration':>10} {'Good chain':>12} {'Bad chain':>12}")
print("-" * 36)
for t in [0, 10, 50, 100, 500, 1000, 2000, 4999]:
    print(f"{t:>10} {good[t]:>12.3f} {bad[t]:>12.3f}")

print(f"\n{'Diagnostic':>18} {'Good':>10} {'Bad':>10}")
print("-" * 40)
rho_good = np.corrcoef(good[:-1], good[1:])[0, 1]
rho_bad = np.corrcoef(bad[:-1], bad[1:])[0, 1]
ess_good = T * (1 - rho_good) / (1 + rho_good)
ess_bad = T * (1 - rho_bad) / (1 + rho_bad)
print(f"{'Mean (last 3000)':>18} {good[2000:].mean():>10.3f} {bad[2000:].mean():>10.3f}")
print(f"{'Std (last 3000)':>18} {good[2000:].std():>10.3f} {bad[2000:].std():>10.3f}")
print(f"{'Lag-1 autocorr':>18} {rho_good:>10.3f} {rho_bad:>10.3f}")
print(f"{'ESS':>18} {ess_good:>10.0f} {ess_bad:>10.0f}")
print(f"{'True mean':>18} {'3.000':>10} {'3.000':>10}")

18.6 Variational Inference

Why variational methods? MCMC is asymptotically exact but can be slow — our hierarchical model takes thousands of iterations per second. What if we had a million observations? Variational inference (VI) turns the inference problem into an optimization problem, trading some accuracy for speed.

The ELBO

We want to approximate the posterior \(p(\theta \mid \mathbf{x})\) by a simpler distribution \(q(\theta)\) from some family \(\mathcal{Q}\). The KL divergence from \(q\) to the posterior is:

\[\text{KL}(q \| p) = \int q(\theta)\, \log \frac{q(\theta)}{p(\theta \mid \mathbf{x})}\, d\theta.\]

This is non-negative and zero only when \(q = p\). We want to minimize it, but it involves the intractable \(p(\theta \mid \mathbf{x})\).

Derivation of the ELBO. Write \(p(\theta \mid \mathbf{x}) = p(\mathbf{x}, \theta) / p(\mathbf{x})\):

\[\begin{split}\text{KL}(q \| p) &= \int q(\theta)\, \log \frac{q(\theta)\, p(\mathbf{x})}{p(\mathbf{x}, \theta)}\, d\theta \\ &= \log p(\mathbf{x}) + \int q(\theta)\, \log \frac{q(\theta)}{p(\mathbf{x}, \theta)}\, d\theta.\end{split}\]

Rearranging:

\[\log p(\mathbf{x}) = \text{KL}(q \| p) - \int q(\theta)\, \log \frac{q(\theta)}{p(\mathbf{x}, \theta)}\, d\theta.\]

Because \(\text{KL} \ge 0\), we have:

\[\log p(\mathbf{x}) \;\ge\; \underbrace{\int q(\theta)\, \log \frac{p(\mathbf{x}, \theta)}{q(\theta)}\, d\theta} _{\text{ELBO}(\,q\,)}.\]

The Evidence Lower BOund (ELBO) is:

\[\text{ELBO}(q) = E_q[\log p(\mathbf{x}, \theta)] - E_q[\log q(\theta)] = E_q[\log p(\mathbf{x} \mid \theta)] - \text{KL}(q \| p(\theta)).\]

Reading the second form: the ELBO has two competing terms. The first, \(E_q[\log p(\mathbf{x} \mid \theta)]\), rewards the approximation \(q\) for placing mass on parameter values that explain the data well (good fit). The second, \(\text{KL}(q \| p(\theta))\), penalizes \(q\) for straying too far from the prior (regularization). Maximizing the ELBO finds the best trade-off — the approximate posterior that fits the data as well as possible while remaining close to prior expectations.

Maximizing the ELBO is equivalent to minimizing \(\text{KL}(q \| p)\).

Let us implement variational inference for a Beta posterior, optimizing the ELBO with gradient ascent and printing the ELBO at each iteration.

# Variational inference: fit a Gaussian to a Beta(35,25) posterior
# Watch the ELBO increase iteration by iteration
import numpy as np
from scipy import stats

np.random.seed(42)

# True posterior: Beta(35, 25) from clinical trial
a_true, b_true = 35, 25
true_mean = a_true / (a_true + b_true)
true_std = np.sqrt(a_true * b_true / ((a_true+b_true)**2 * (a_true+b_true+1)))

# Variational family: q(theta) = N(mu_q, sigma_q^2) (on logit scale)
# Parameterize on logit scale for unconstrained optimization
mu_q = 0.0       # logit scale
log_sigma_q = -1.0  # log(sigma_q)

lr = 0.05
S = 5_000  # MC samples for ELBO gradient

print(f"{'Iter':>5} {'ELBO':>10} {'mu_q (logit)':>13} {'sigma_q':>9} "
      f"{'mean(theta)':>12}")
print("-" * 52)

for iteration in range(80):
    sigma_q = np.exp(log_sigma_q)

    # Reparameterization trick: z = mu_q + sigma_q * eps
    eps = np.random.normal(size=S)
    z_logit = mu_q + sigma_q * eps
    z = 1.0 / (1.0 + np.exp(-z_logit))  # inverse logit -> theta in (0,1)
    z = np.clip(z, 1e-10, 1 - 1e-10)

    # log p(x, theta) using Beta kernel
    log_joint = (a_true - 1) * np.log(z) + (b_true - 1) * np.log(1 - z)
    # log q(logit(theta)) with Jacobian
    log_q = stats.norm.logpdf(z_logit, mu_q, sigma_q)

    elbo = np.mean(log_joint - log_q)

    # Gradients (score function estimator)
    score_mu = (z_logit - mu_q) / sigma_q**2
    score_log_sigma = ((z_logit - mu_q)**2 / sigma_q**2 - 1)
    advantage = log_joint - log_q - elbo

    grad_mu = np.mean(advantage * score_mu)
    grad_log_sigma = np.mean(advantage * score_log_sigma)

    mu_q += lr * grad_mu
    log_sigma_q += lr * 0.1 * grad_log_sigma

    if iteration < 10 or iteration % 10 == 0:
        theta_mean = np.mean(z)
        print(f"{iteration:>5} {elbo:>10.3f} {mu_q:>13.4f} "
              f"{sigma_q:>9.4f} {theta_mean:>12.4f}")

print(f"\nTrue posterior: mean={true_mean:.4f}, std={true_std:.4f}")
print(f"VI approx:     mean={np.mean(z):.4f}, std={np.std(z):.4f}")

Mean-field approximation

The simplest variational family assumes the parameters are independent:

\[q(\boldsymbol{\theta}) = \prod_{j=1}^p q_j(\theta_j).\]

Each factor \(q_j\) is optimized in turn, holding the others fixed.

Coordinate ascent VI (CAVI). The optimal update for \(q_j\) is:

\[\log q_j^*(\theta_j) = E_{-j}[\log p(\mathbf{x}, \boldsymbol{\theta})] + \text{const},\]

where \(E_{-j}\) denotes expectation with respect to all \(q_k\) for \(k \neq j\).

In plain English: to update our approximation for one parameter \(\theta_j\), we take the log of the full joint distribution, then average out all the other parameters using their current approximate distributions. The result tells us the optimal shape for \(q_j\). This is analogous to fitting one variable at a time while holding the others fixed, which is why it is called “coordinate ascent.” This is derived by taking the functional derivative of the ELBO with respect to \(q_j\) and setting it to zero.

CAVI iterates over all components until the ELBO converges.

# Coordinate Ascent VI for a mixture of two Gaussians (mean-field)
import numpy as np
from scipy import stats

np.random.seed(42)

# Data from a Gaussian with unknown mean and precision
true_mu, true_tau = 3.0, 0.5  # tau = 1/sigma^2
data = np.random.normal(true_mu, 1/np.sqrt(true_tau), size=50)
n = len(data)
x_bar = data.mean()
x_var = data.var()

# Mean-field: q(mu, tau) = q_mu(mu) * q_tau(tau)
# q_mu = N(m, s^2), q_tau = Gamma(a, b)
# Hyperpriors: mu ~ N(0, 100), tau ~ Gamma(0.01, 0.01)

m, s2 = 0.0, 10.0
a_q, b_q = 1.0, 1.0

print(f"{'Iter':>5} {'E[mu]':>8} {'E[tau]':>8} {'E[sigma]':>9} {'ELBO':>10}")
print("-" * 43)
for it in range(20):
    # Update q(mu): N(m, s^2)
    E_tau = a_q / b_q
    s2 = 1.0 / (1.0/100.0 + n * E_tau)
    m = s2 * (0.0/100.0 + E_tau * n * x_bar)

    # Update q(tau): Gamma(a_q, b_q)
    E_mu = m
    E_mu2 = s2 + m**2
    a_q = 0.01 + n / 2.0
    b_q = 0.01 + 0.5 * (np.sum(data**2) - 2 * E_mu * np.sum(data) + n * E_mu2)

    # Compute ELBO (simplified)
    elbo = (-n/2 * np.log(2*np.pi) + n/2 * (np.euler_gamma + np.log(2*b_q)
            - 1/(a_q)) - 0.5 * n * (a_q/b_q) * x_var)

    print(f"{it:>5} {m:>8.4f} {a_q/b_q:>8.4f} "
          f"{1/np.sqrt(a_q/b_q):>9.4f} {elbo:>10.2f}")

print(f"\nTrue: mu={true_mu}, sigma={1/np.sqrt(true_tau):.3f}")

Stochastic variational inference

For large datasets, computing the ELBO over all data is expensive. Stochastic VI (Hoffman et al., 2013) uses mini-batches: subsample the data, compute a noisy gradient of the ELBO, and update the variational parameters via stochastic gradient ascent. This scales VI to millions of observations.

18.7 Integrated Nested Laplace Approximations (INLA)

Why INLA? For a large class of models — latent Gaussian models (LGMs) — INLA provides fast, deterministic, and accurate posterior approximations without MCMC.

A latent Gaussian model has three stages:

  1. Observations: \(y_i \mid \eta_i, \boldsymbol{\psi}\) from an exponential family.

  2. Latent field: \(\boldsymbol{\eta} \mid \boldsymbol{\psi} \sim N(\boldsymbol{\mu}_\psi, \boldsymbol{Q}_\psi^{-1})\) (Gaussian with a sparse precision matrix \(\boldsymbol{Q}\)).

  3. Hyperparameters: \(\boldsymbol{\psi}\) with prior \(p(\boldsymbol{\psi})\).

The key insight: with a Gaussian latent field, the Laplace approximation (Section 17.5 in Chapter 17) is very accurate for the conditional posterior \(p(\boldsymbol{\eta} \mid \boldsymbol{\psi}, \mathbf{y})\). INLA then numerically integrates over the (typically low-dimensional) \(\boldsymbol{\psi}\).

When to use INLA. INLA is the method of choice when the model fits the LGM framework: generalized linear models, spatial models, time series, survival models, etc. It is implemented in the R package R-INLA.

18.8 Comparison of Methods

Method

Accuracy

Speed

Scalability

When to use

Laplace approx.

Moderate

Very fast

High

Unimodal, symmetric posteriors; quick screening

MCMC (MH/Gibbs)

Exact (asymptotic)

Slow

Low–moderate

Gold standard; moderate dimensions

HMC / NUTS

Exact (asymptotic)

Moderate

Moderate–high

Continuous parameters; up to thousands of dimensions

Variational inference

Approximate

Fast

Very high

Large datasets; exploratory analysis; deep generative models

INLA

Very good

Fast

High

Latent Gaussian models

Let us run a horse race: apply three methods to the same problem and compare.

# Head-to-head comparison: Laplace vs MH vs Gibbs on Normal-Normal
import numpy as np
from scipy import stats, optimize
import time

np.random.seed(42)

# Data: N(mu, 1), mu ~ N(0, 10^2), observe 50 points
true_mu = 3.0
data = np.random.normal(true_mu, 1.0, size=50)
n = len(data)
x_bar = data.mean()

# --- Exact posterior (conjugate) ---
prec_post = 1/100 + n
mu_post_exact = (0/100 + n*x_bar) / prec_post
sd_post_exact = np.sqrt(1/prec_post)

# --- Method 1: Laplace approximation ---
t0 = time.time()
neg_log_post = lambda mu: 0.5*n*(mu - x_bar)**2 + 0.5*mu**2/100
res = optimize.minimize_scalar(neg_log_post)
mu_lap = res.x
h2 = n + 1/100
sd_lap = np.sqrt(1/h2)
t_lap = time.time() - t0

# --- Method 2: Metropolis-Hastings ---
t0 = time.time()
T_mh = 10_000
mh = np.zeros(T_mh)
mh[0] = 0.0
acc_mh = 0
for t in range(1, T_mh):
    prop = mh[t-1] + np.random.normal(0, 0.3)
    la = (-0.5*n*(prop-x_bar)**2 - 0.5*prop**2/100
          + 0.5*n*(mh[t-1]-x_bar)**2 + 0.5*mh[t-1]**2/100)
    if np.log(np.random.uniform()) < la:
        mh[t] = prop
        acc_mh += 1
    else:
        mh[t] = mh[t-1]
t_mh = time.time() - t0
mh_post = mh[2000:]

# --- Method 3: Direct sampling (exact conjugate) ---
t0 = time.time()
exact_samples = np.random.normal(mu_post_exact, sd_post_exact, 10_000)
t_exact = time.time() - t0

print(f"{'Method':>18} {'Mean':>8} {'Std':>8} {'Time (ms)':>10}")
print("-" * 48)
print(f"{'Exact':>18} {mu_post_exact:>8.4f} {sd_post_exact:>8.4f} {t_exact*1000:>10.2f}")
print(f"{'Laplace':>18} {mu_lap:>8.4f} {sd_lap:>8.4f} {t_lap*1000:>10.2f}")
print(f"{'MH (10k iter)':>18} {mh_post.mean():>8.4f} {mh_post.std():>8.4f} "
      f"{t_mh*1000:>10.2f}")

Rules of thumb.

  • If the model is a latent Gaussian model, try INLA first.

  • For moderate-dimensional continuous models, use HMC/NUTS (Stan, PyMC).

  • For massive datasets or deep generative models, use variational inference.

  • Use Gibbs sampling when full conditionals are available and the dimension is not too high.

  • Use random-walk Metropolis as a baseline or when gradients are unavailable.

18.9 Summary

  • Monte Carlo integration approximates integrals by sampling; importance sampling reweights samples from a proposal.

  • Metropolis–Hastings constructs a Markov chain satisfying detailed balance; the acceptance probability is derived from the stationarity condition.

  • Gibbs sampling is MH with full conditional proposals and acceptance rate 1.

  • HMC uses Hamiltonian dynamics and the leapfrog integrator to make large, gradient-informed moves; NUTS automates the trajectory length.

  • Convergence is assessed with \(\hat{R}\), ESS, and trace plots.

  • Variational inference turns inference into optimization via the ELBO.

  • INLA combines Laplace approximations with numerical integration for latent Gaussian models.

# Summary: which method for which problem?
methods = [
    ("Conjugate (Beta-Binomial)",  "Exact",       "< 1 ms",   "Low-dim conjugate"),
    ("Laplace approximation",      "Good",        "< 10 ms",  "Unimodal posteriors"),
    ("Metropolis-Hastings",        "Exact (lim)", "seconds",   "General, low-dim"),
    ("Gibbs sampling",             "Exact (lim)", "seconds",   "Conjugate conditionals"),
    ("HMC / NUTS",                 "Exact (lim)", "seconds",   "Continuous, high-dim"),
    ("Variational inference",      "Approximate", "< 1 sec",  "Large data, deep models"),
]
print(f"{'Method':<30} {'Accuracy':<14} {'Speed':<12} {'Best for'}")
print("-" * 80)
for name, acc, speed, use in methods:
    print(f"{name:<30} {acc:<14} {speed:<12} {use}")