.. _ch18_computational: ========================================== Chapter 18 -- Computational Methods ========================================== **The motivating problem.** In Chapter 17 we enjoyed the luxury of conjugate priors: the posterior had a closed form. But real-world models are rarely so cooperative. Consider a hierarchical model for clinical trials across multiple hospitals. Hospital :math:`j` has response rate :math:`\theta_j`, and we believe these rates come from a common population: :math:`\theta_j \sim \text{Beta}(\mu\kappa,\, (1-\mu)\kappa)`, where :math:`\mu` and :math:`\kappa` are unknown hyperparameters. The posterior is: .. math:: p(\mu, \kappa, \boldsymbol{\theta} \mid \mathbf{y}) \;\propto\; \prod_{j=1}^J \binom{n_j}{y_j} \theta_j^{y_j}(1-\theta_j)^{n_j - y_j} \;\cdot\; \prod_{j=1}^J \text{Beta}(\theta_j \mid \mu\kappa, (1-\mu)\kappa) \;\cdot\; p(\mu)\, p(\kappa). We can evaluate this pointwise, but we cannot integrate it analytically. What now? This chapter builds the computational highway system that gets you to *any* posterior, no matter how complicated. .. code-block:: python # The hierarchical model we will tackle throughout this chapter import numpy as np from scipy.special import gammaln np.random.seed(42) # Data: 8 hospitals, each with n_j patients and y_j responders n_hospitals = 8 n_j = np.array([20, 30, 25, 40, 15, 35, 28, 22]) y_j = np.array([11, 16, 10, 25, 7, 19, 14, 9]) print("Hospital data:") print(f" {'Hospital':>10} {'n_j':>5} {'y_j':>5} {'MLE':>8}") print(" " + "-" * 30) for j in range(n_hospitals): print(f" {j+1:>10} {n_j[j]:>5} {y_j[j]:>5} {y_j[j]/n_j[j]:>8.3f}") print(f"\n Hospital MLEs range from {min(y_j/n_j):.3f} to {max(y_j/n_j):.3f}") print(" Can we borrow strength across hospitals? -> Chapter 18 shows how.") def log_posterior(theta_vec, mu, kappa): """Unnormalized log-posterior for the hierarchical model.""" a = mu * kappa b = (1 - mu) * kappa # Likelihood ll = np.sum(y_j * np.log(theta_vec) + (n_j - y_j) * np.log(1 - theta_vec)) # Prior on theta_j: Beta(a, b) lp_theta = np.sum((a - 1)*np.log(theta_vec) + (b - 1)*np.log(1 - theta_vec) - gammaln(a) - gammaln(b) + gammaln(a + b)) # Hyperpriors: mu ~ Beta(2,2), kappa ~ Gamma(2, 0.1) lp_mu = (2-1)*np.log(mu) + (2-1)*np.log(1-mu) if 0 < mu < 1 else -1e10 lp_kappa = (2-1)*np.log(kappa) - 0.1*kappa if kappa > 0 else -1e10 return ll + lp_theta + lp_mu + lp_kappa 18.1 Monte Carlo Integration ------------------------------- **Why we need this.** Many quantities of interest --- posterior means, marginal likelihoods, predictive distributions --- are integrals of the form .. math:: I = \int g(\theta)\, p(\theta \mid \mathbf{x})\, d\theta. When the integral has no closed form we resort to *Monte Carlo* methods: generate samples :math:`\theta^{(1)}, \ldots, \theta^{(S)}` from :math:`p(\theta \mid \mathbf{x})` and approximate .. math:: \hat{I} = \frac{1}{S}\sum_{s=1}^{S} g(\theta^{(s)}). By the law of large numbers :math:`\hat{I} \to I` as :math:`S \to \infty`, and the central limit theorem gives the Monte Carlo standard error: .. math:: \text{SE}(\hat{I}) = \frac{\text{sd}(g(\theta^{(s)}))}{\sqrt{S}}. The error decreases as :math:`1/\sqrt{S}` regardless of the dimension of :math:`\theta` --- a crucial advantage over deterministic quadrature in high dimensions. .. code-block:: python # Monte Carlo integration: watch the estimate converge import numpy as np np.random.seed(42) # Goal: estimate E[theta] where theta ~ Beta(35, 25) # (our clinical trial posterior from Chapter 17) from scipy import stats true_mean = 35 / (35 + 25) samples = stats.beta.rvs(35, 25, size=10_000) print(f"{'S':>7} {'MC estimate':>12} {'SE':>10} {'|Error|':>10}") print("-" * 42) for S in [10, 50, 100, 500, 1000, 5000, 10000]: mc_est = samples[:S].mean() mc_se = samples[:S].std() / np.sqrt(S) print(f"{S:>7} {mc_est:>12.6f} {mc_se:>10.6f} " f"{abs(mc_est - true_mean):>10.6f}") print(f"{'True':>7} {true_mean:>12.6f}") print("\nSE decreases as 1/sqrt(S) regardless of dimension.") .. admonition:: Intuition Monte Carlo integration is essentially "estimation by random polling." Just as a political poll of 1000 random voters can approximate the opinion of millions, 10,000 random draws from a posterior can approximate intractable integrals to several decimal places. Importance sampling ^^^^^^^^^^^^^^^^^^^^ Often we cannot sample directly from the target :math:`p(\theta \mid \mathbf{x})` but we can sample from a *proposal* distribution :math:`q(\theta)`. Write .. math:: I = \int g(\theta)\, \frac{p(\theta \mid \mathbf{x})}{q(\theta)}\, q(\theta)\, d\theta = E_q\!\left[g(\theta)\, w(\theta)\right], where the *importance weight* is .. math:: w(\theta) = \frac{p(\theta \mid \mathbf{x})}{q(\theta)}. In plain English: the weight measures how much more (or less) probable :math:`\theta` is under the target distribution compared to the proposal. Samples from regions where the proposal underrepresents the target get upweighted; samples from overrepresented regions get downweighted. Drawing :math:`\theta^{(s)} \sim q` and computing .. math:: \hat{I} = \frac{\sum_{s=1}^S w(\theta^{(s)})\, g(\theta^{(s)})} {\sum_{s=1}^S w(\theta^{(s)})} gives a consistent (self-normalized) importance sampling estimator. The denominator normalizes the weights so they sum to one, which is why this is called "self-normalized" --- we do not need to know the normalizing constants of :math:`p` or :math:`q`. The quality depends on how well :math:`q` covers the tails of :math:`p`; if :math:`q` has lighter tails the variance can be infinite. .. code-block:: python # Importance sampling: good vs bad proposal import numpy as np from scipy import stats np.random.seed(42) # Target: Beta(35, 25), want E[theta] true_mean = 35 / 60 S = 10_000 # Good proposal: N(0.58, 0.07) -- covers the target well # Bad proposal: N(0.3, 0.05) -- centered far from the target for label, mu_q, sigma_q in [("Good proposal N(0.58, 0.07)", 0.58, 0.07), ("Bad proposal N(0.30, 0.05)", 0.30, 0.05)]: z = np.random.normal(mu_q, sigma_q, S) z = z[(z > 0.01) & (z < 0.99)] # keep in valid range log_w = (stats.beta.logpdf(z, 35, 25) - stats.norm.logpdf(z, mu_q, sigma_q)) log_w -= log_w.max() # for numerical stability w = np.exp(log_w) estimate = np.sum(w * z) / np.sum(w) ess = np.sum(w)**2 / np.sum(w**2) print(f"{label}:") print(f" Estimate: {estimate:.6f} (true: {true_mean:.6f})") print(f" ESS: {ess:.0f} / {len(z)} samples") print(f" ESS ratio: {ess/len(z):.1%}\n") .. admonition:: Common Pitfall If your proposal distribution :math:`q` has thinner tails than the target :math:`p`, a few extreme importance weights will dominate the estimate, making it highly unstable. Always check the effective sample size of your importance weights: :math:`\text{ESS} = (\sum w_s)^2 / \sum w_s^2`. 18.2 The Metropolis--Hastings Algorithm ----------------------------------------- **Why Markov chains?** Importance sampling works well in low dimensions, but in high dimensions it is very hard to find a good proposal :math:`q`. Instead, we construct a *Markov chain* whose stationary distribution is the target posterior. After a burn-in period the chain produces (correlated) samples from the posterior. We will build a Metropolis--Hastings sampler for our hierarchical model. But first, let us derive the algorithm from first principles. Detailed balance ^^^^^^^^^^^^^^^^^^ A Markov chain with transition kernel :math:`T(\theta' \mid \theta)` has stationary distribution :math:`\pi(\theta) = p(\theta \mid \mathbf{x})` if *detailed balance* holds: .. math:: \pi(\theta)\, T(\theta' \mid \theta) = \pi(\theta')\, T(\theta \mid \theta'). This says the probability of being at :math:`\theta` and moving to :math:`\theta'` equals the probability of the reverse. Derivation of the acceptance probability ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We choose a proposal distribution :math:`q(\theta' \mid \theta)` and set the transition as: propose :math:`\theta'` from :math:`q`, then accept with probability :math:`\alpha(\theta, \theta')`. Thus .. math:: T(\theta' \mid \theta) = q(\theta' \mid \theta)\, \alpha(\theta, \theta'). Substituting into the detailed balance condition: .. math:: \pi(\theta)\, q(\theta' \mid \theta)\, \alpha(\theta, \theta') = \pi(\theta')\, q(\theta \mid \theta')\, \alpha(\theta', \theta). Rearranging: .. math:: \frac{\alpha(\theta, \theta')}{\alpha(\theta', \theta)} = \frac{\pi(\theta')\, q(\theta \mid \theta')} {\pi(\theta)\, q(\theta' \mid \theta)}. We need acceptance probabilities :math:`\alpha` that satisfy this ratio and lie in :math:`[0, 1]`. The Metropolis--Hastings choice picks the largest such probabilities: .. math:: \alpha(\theta, \theta') = \min\!\left(1,\; \frac{\pi(\theta')\, q(\theta \mid \theta')} {\pi(\theta)\, q(\theta' \mid \theta)}\right). Reading this formula: the numerator is the posterior density at the proposed point times the probability of proposing a move *back* to where we are; the denominator is the same thing in the opposite direction. If the proposed point is "better" (higher posterior density, easy to get back from), the ratio exceeds 1 and we always accept. If it is "worse," we accept with a probability equal to the ratio, which gives worse points a fighting chance --- this is essential for exploring the full posterior. One can verify this satisfies the ratio above. Because :math:`\pi` is the posterior, only the ratio :math:`\pi(\theta')/\pi(\theta)` is needed --- the normalizing constant cancels. .. admonition:: Why does this matter? The cancellation of the normalizing constant is what makes MCMC so powerful. You never need to compute the marginal likelihood :math:`p(\mathbf{x})`. You only need to evaluate the *unnormalized* posterior --- which is just the likelihood times the prior. This is the engine that makes Bayesian inference practical for complex models. Algorithm ^^^^^^^^^^ 1. Initialize :math:`\theta^{(0)}`. 2. For :math:`t = 1, 2, \ldots`: a. Draw :math:`\theta^* \sim q(\cdot \mid \theta^{(t-1)})`. b. Compute .. math:: \alpha = \min\!\left(1,\; \frac{p(\theta^* \mid \mathbf{x})\, q(\theta^{(t-1)} \mid \theta^*)} {p(\theta^{(t-1)} \mid \mathbf{x})\, q(\theta^* \mid \theta^{(t-1)})}\right). c. With probability :math:`\alpha` set :math:`\theta^{(t)} = \theta^*`; otherwise set :math:`\theta^{(t)} = \theta^{(t-1)}`. **Special case: random-walk Metropolis.** If :math:`q(\theta' \mid \theta) = q(\theta \mid \theta')` (symmetric proposal, e.g., :math:`\theta' = \theta + \epsilon` with :math:`\epsilon \sim N(0, \sigma^2)`) the proposal ratio cancels and .. math:: \alpha = \min\!\left(1,\; \frac{p(\theta^* \mid \mathbf{x})} {p(\theta^{(t-1)} \mid \mathbf{x})}\right). Now let us apply this to Bayesian logistic regression, printing the first 20 iterations so you can see the sampler in action. .. code-block:: python # Metropolis-Hastings for logistic regression: iteration by iteration import numpy as np from scipy.special import expit np.random.seed(42) # Simulate: 50 patients, dosage x predicts recovery y n = 50 x = np.random.normal(0, 1, n) true_beta = np.array([0.5, 1.5]) prob = expit(true_beta[0] + true_beta[1] * x) y = np.random.binomial(1, prob) def log_posterior(beta): eta = beta[0] + beta[1] * x log_lik = np.sum(y * eta - np.log(1 + np.exp(eta))) log_prior = -0.5 * np.sum(beta**2) / 100 # N(0, 10^2) return log_lik + log_prior # MH sampler T = 20_000 samples = np.zeros((T, 2)) samples[0] = [0.0, 0.0] sigma_prop = 0.3 accepted = 0 # Print header for first 20 iterations print(f"{'t':>4} {'beta0*':>8} {'beta1*':>8} {'log_alpha':>10} " f"{'Accept?':>8} {'beta0':>8} {'beta1':>8}") print("-" * 60) for t in range(1, T): proposal = samples[t-1] + np.random.normal(0, sigma_prop, 2) log_alpha = log_posterior(proposal) - log_posterior(samples[t-1]) u = np.random.uniform() accept = np.log(u) < log_alpha if accept: samples[t] = proposal accepted += 1 else: samples[t] = samples[t-1] if t <= 20: print(f"{t:>4} {proposal[0]:>8.3f} {proposal[1]:>8.3f} " f"{log_alpha:>10.3f} {'YES' if accept else 'no':>8} " f"{samples[t, 0]:>8.3f} {samples[t, 1]:>8.3f}") burn_in = 5000 post_samples = samples[burn_in:] print(f"\n--- After {T} iterations (burn-in={burn_in}) ---") print(f"Acceptance rate: {accepted / T:.3f}") print(f"Posterior mean beta: [{post_samples[:,0].mean():.3f}, " f"{post_samples[:,1].mean():.3f}]") print(f"Posterior std beta: [{post_samples[:,0].std():.3f}, " f"{post_samples[:,1].std():.3f}]") print(f"True beta: [{true_beta[0]:.3f}, {true_beta[1]:.3f}]") Proposal tuning ^^^^^^^^^^^^^^^^ The step size :math:`\sigma` controls the trade-off between exploring widely (large steps, many rejections) and staying local (small steps, high acceptance but slow exploration). The optimal acceptance rate for a :math:`d`-dimensional Gaussian target is approximately 23.4\% (Roberts, Gelman, and Gilks, 1997). .. code-block:: python # Effect of proposal step size on acceptance rate and mixing import numpy as np np.random.seed(42) # Target: standard bivariate normal def log_target(theta): return -0.5 * np.sum(theta**2) def run_mh(sigma_prop, T=10_000, start=np.array([5.0, 5.0])): samples = np.zeros((T, 2)) samples[0] = start acc = 0 for t in range(1, T): proposal = samples[t-1] + np.random.normal(0, sigma_prop, 2) log_alpha = log_target(proposal) - log_target(samples[t-1]) if np.log(np.random.uniform()) < log_alpha: samples[t] = proposal acc += 1 else: samples[t] = samples[t-1] return samples, acc / T print(f"{'sigma_prop':>12} {'Accept rate':>13} {'Post mean':>16} " f"{'Post std':>12}") print("-" * 56) for sigma in [0.05, 0.3, 1.0, 2.4, 5.0, 20.0]: samp, rate = run_mh(sigma) post = samp[2000:] print(f"{sigma:>12.2f} {rate:>13.3f} " f"[{post[:,0].mean():>5.2f}, {post[:,1].mean():>5.2f}] " f"[{post[:,0].std():>4.2f}, {post[:,1].std():>4.2f}]") print("\nOptimal acceptance rate for d=2: ~23.4%") print("sigma_prop ~ 2.4 hits the sweet spot.") 18.3 Gibbs Sampling --------------------- **Why Gibbs?** In many models the *full conditional* distribution of each parameter block --- the distribution of one parameter given all others and the data --- has a known form (especially with conjugate priors). Instead of proposing moves in all dimensions at once, we cycle through one dimension at a time, always drawing from the exact conditional distribution. Every draw is accepted. Full conditional distributions ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For a parameter vector :math:`\boldsymbol{\theta} = (\theta_1, \ldots, \theta_p)`, the full conditional of :math:`\theta_j` is .. math:: p(\theta_j \mid \theta_{-j}, \mathbf{x}) \;\propto\; p(\mathbf{x} \mid \boldsymbol{\theta})\, p(\boldsymbol{\theta}), treating all :math:`\theta_k` for :math:`k \neq j` as fixed. In conjugate models these are standard distributions. Algorithm ^^^^^^^^^^ 1. Initialize :math:`\boldsymbol{\theta}^{(0)}`. 2. For :math:`t = 1, 2, \ldots`: a. Draw :math:`\theta_1^{(t)} \sim p(\theta_1 \mid \theta_2^{(t-1)}, \ldots, \theta_p^{(t-1)}, \mathbf{x})`. b. Draw :math:`\theta_2^{(t)} \sim p(\theta_2 \mid \theta_1^{(t)}, \theta_3^{(t-1)}, \ldots, \theta_p^{(t-1)}, \mathbf{x})`. c. Continue cycling through all components. Gibbs as a special case of Metropolis--Hastings ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Derivation.** In MH, set the proposal for updating :math:`\theta_j` to be the full conditional itself: :math:`q(\theta_j' \mid \theta_j, \theta_{-j}) = p(\theta_j' \mid \theta_{-j}, \mathbf{x})`. Then the acceptance ratio is: .. math:: \alpha &= \frac{p(\theta_j', \theta_{-j} \mid \mathbf{x})\; p(\theta_j \mid \theta_{-j}, \mathbf{x})} {p(\theta_j, \theta_{-j} \mid \mathbf{x})\; p(\theta_j' \mid \theta_{-j}, \mathbf{x})} \\ &= \frac{p(\theta_j' \mid \theta_{-j}, \mathbf{x})\, p(\theta_{-j} \mid \mathbf{x})\, p(\theta_j \mid \theta_{-j}, \mathbf{x})} {p(\theta_j \mid \theta_{-j}, \mathbf{x})\, p(\theta_{-j} \mid \mathbf{x})\, p(\theta_j' \mid \theta_{-j}, \mathbf{x})} \\ &= 1. Every proposal is accepted! Gibbs sampling is MH with an optimal component-wise proposal. **Example: Normal model with unknown mean and variance.** With :math:`x_i \sim N(\mu, \sigma^2)`, :math:`\mu \sim N(\mu_0, \sigma_0^2)`, and :math:`\sigma^2 \sim \text{Inv-Gamma}(a, b)`, the full conditionals are: * :math:`\mu \mid \sigma^2, \mathbf{x} \sim N(\mu_n, \sigma_n^2)` (Normal--Normal conjugacy from :ref:`Chapter 17 `). * :math:`\sigma^2 \mid \mu, \mathbf{x} \sim \text{Inv-Gamma}(a + n/2,\; b + \frac{1}{2}\sum(x_i - \mu)^2)`. Let us implement this, printing every iteration for the first 15 to see the alternating updates converge. .. code-block:: python # Gibbs sampler for Normal model: mu and sigma^2 with alternating updates import numpy as np from scipy import stats np.random.seed(42) # Generate data from N(mu=5, sigma^2=4) true_mu, true_sigma2 = 5.0, 4.0 data = np.random.normal(true_mu, np.sqrt(true_sigma2), size=30) n = len(data) x_bar = data.mean() S_data = np.sum((data - x_bar)**2) # Hyperpriors mu_0, sigma_0_sq = 0.0, 100.0 # mu ~ N(0, 100) a_prior, b_prior = 0.01, 0.01 # sigma^2 ~ InvGamma(0.01, 0.01) # Gibbs sampler T = 5_000 mu_samples = np.zeros(T) sigma2_samples = np.zeros(T) mu_samples[0] = 0.0 # start far from truth sigma2_samples[0] = 20.0 # start far from truth print(f"{'t':>4} {'mu':>10} {'sigma2':>10} Full conditional used") print("-" * 55) print(f"{'0':>4} {mu_samples[0]:>10.4f} {sigma2_samples[0]:>10.4f} (initial)") for t in range(1, T): # -- Update mu | sigma^2, data -- prec_prior = 1.0 / sigma_0_sq prec_data = n / sigma2_samples[t-1] prec_post = prec_prior + prec_data mu_post = (prec_prior * mu_0 + prec_data * x_bar) / prec_post sigma_post = np.sqrt(1.0 / prec_post) mu_samples[t] = np.random.normal(mu_post, sigma_post) # -- Update sigma^2 | mu, data -- a_post = a_prior + n / 2.0 b_post = b_prior + 0.5 * np.sum((data - mu_samples[t])**2) sigma2_samples[t] = 1.0 / np.random.gamma(a_post, 1.0 / b_post) if t <= 15 or t == T-1: label = f" N({mu_post:.2f}, {sigma_post:.2f}^2) -> " \ f"IG({a_post:.1f}, {b_post:.1f})" print(f"{t:>4} {mu_samples[t]:>10.4f} {sigma2_samples[t]:>10.4f} {label}") burn_in = 500 print(f"\n--- After {T} iterations (burn-in={burn_in}) ---") print(f"Posterior E[mu]: {mu_samples[burn_in:].mean():.4f} (true={true_mu})") print(f"Posterior E[sigma^2]: {sigma2_samples[burn_in:].mean():.4f} (true={true_sigma2})") print(f"Posterior sd(mu): {mu_samples[burn_in:].std():.4f}") print(f"Data mean: {x_bar:.4f}") Now let us tackle the full hierarchical model from the chapter opening using a Gibbs sampler. Each :math:`\theta_j` has a conjugate full conditional (Beta), while :math:`\mu` and :math:`\kappa` require Metropolis steps within the Gibbs. .. code-block:: python # Gibbs sampler for the hierarchical Beta-Binomial model import numpy as np from scipy.special import gammaln np.random.seed(42) # Data from chapter opening n_hospitals = 8 n_j = np.array([20, 30, 25, 40, 15, 35, 28, 22]) y_j = np.array([11, 16, 10, 25, 7, 19, 14, 9]) T = 10_000 theta_samples = np.zeros((T, n_hospitals)) mu_samples = np.zeros(T) kappa_samples = np.zeros(T) # Initialize theta_samples[0] = y_j / n_j mu_samples[0] = 0.5 kappa_samples[0] = 10.0 def log_beta_binomial_hyperprior(mu, kappa, thetas): """Log-density of the hyperprior part: Beta priors on theta_j + hyperpriors.""" if not (0.01 < mu < 0.99 and kappa > 0.5): return -1e10 a = mu * kappa b = (1 - mu) * kappa lp = np.sum((a-1)*np.log(thetas) + (b-1)*np.log(1-thetas) - gammaln(a) - gammaln(b) + gammaln(a+b)) lp += (2-1)*np.log(mu) + (2-1)*np.log(1-mu) # mu ~ Beta(2,2) lp += (2-1)*np.log(kappa) - 0.1*kappa # kappa ~ Gamma(2,0.1) return lp accepted_mu, accepted_kappa = 0, 0 for t in range(1, T): mu_cur = mu_samples[t-1] kappa_cur = kappa_samples[t-1] # -- Gibbs step: update each theta_j from its Beta full conditional -- a = mu_cur * kappa_cur + y_j b = (1 - mu_cur) * kappa_cur + n_j - y_j theta_samples[t] = np.random.beta(a, b) # -- MH step: update mu -- mu_prop = mu_cur + np.random.normal(0, 0.05) if 0.01 < mu_prop < 0.99: log_r = (log_beta_binomial_hyperprior(mu_prop, kappa_cur, theta_samples[t]) - log_beta_binomial_hyperprior(mu_cur, kappa_cur, theta_samples[t])) if np.log(np.random.uniform()) < log_r: mu_cur = mu_prop accepted_mu += 1 mu_samples[t] = mu_cur # -- MH step: update kappa -- kappa_prop = kappa_cur * np.exp(np.random.normal(0, 0.2)) # log-scale proposal log_r = (log_beta_binomial_hyperprior(mu_cur, kappa_prop, theta_samples[t]) - log_beta_binomial_hyperprior(mu_cur, kappa_cur, theta_samples[t]) + np.log(kappa_prop) - np.log(kappa_cur)) # Jacobian if np.log(np.random.uniform()) < log_r: kappa_cur = kappa_prop accepted_kappa += 1 kappa_samples[t] = kappa_cur burn_in = 3000 print("Hierarchical Beta-Binomial: Gibbs + MH results") print(f" mu acceptance rate: {accepted_mu/T:.3f}") print(f" kappa acceptance rate: {accepted_kappa/T:.3f}") print(f"\n E[mu]: {mu_samples[burn_in:].mean():.4f}") print(f" E[kappa]: {kappa_samples[burn_in:].mean():.2f}") print(f"\n {'Hospital':>10} {'MLE':>8} {'Post mean':>10} {'Shrinkage':>10}") print(" " + "-" * 40) for j in range(n_hospitals): mle = y_j[j] / n_j[j] pm = theta_samples[burn_in:, j].mean() shrink = (mle - pm) / (mle - mu_samples[burn_in:].mean()) print(f" {j+1:>10} {mle:>8.3f} {pm:>10.3f} {shrink:>10.1%}") print("\n Note: smaller hospitals are shrunk more toward the overall mean.") 18.4 Hamiltonian Monte Carlo ------------------------------- **Why HMC?** Random-walk Metropolis explores the parameter space by diffusion, which becomes extremely slow in high dimensions. HMC uses the *gradient* of the log-posterior to make large, directed moves while maintaining a high acceptance rate. Think of it as giving the sampler a "GPS" rather than letting it wander blindly. Hamiltonian mechanics analogy ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Imagine a frictionless puck sliding on a surface whose height is :math:`U(\theta) = -\log p(\theta \mid \mathbf{x})`. The puck has position :math:`\theta` and momentum :math:`\mathbf{r}`. The total energy (Hamiltonian) is: .. math:: H(\theta, \mathbf{r}) = U(\theta) + K(\mathbf{r}), where :math:`K(\mathbf{r}) = \frac{1}{2}\mathbf{r}^T M^{-1} \mathbf{r}` is the kinetic energy and :math:`M` is a "mass matrix" (usually set to the identity or an estimate of the posterior covariance). Hamilton's equations give the dynamics: .. math:: \frac{d\theta}{dt} &= \frac{\partial H}{\partial \mathbf{r}} = M^{-1}\mathbf{r}, \\ \frac{d\mathbf{r}}{dt} &= -\frac{\partial H}{\partial \theta} = -\nabla U(\theta) = \nabla \log p(\theta \mid \mathbf{x}). In plain English: the first equation says the parameter moves in the direction of the momentum (the puck slides). The second equation says the momentum changes according to the gradient of the log-posterior --- the puck accelerates downhill toward regions of high posterior density. Together, these create smooth, curving trajectories that efficiently explore the posterior landscape. These equations are *energy-preserving* and *volume-preserving* (symplectic), which is exactly what we need for a valid MCMC proposal. The leapfrog integrator ^^^^^^^^^^^^^^^^^^^^^^^^^ We cannot solve Hamilton's equations analytically, so we use the *leapfrog* integrator with step size :math:`\varepsilon` and :math:`L` steps: 1. :math:`\mathbf{r} \leftarrow \mathbf{r} + \frac{\varepsilon}{2}\, \nabla \log p(\theta \mid \mathbf{x})` (half-step for momentum). 2. For :math:`l = 1, \ldots, L`: a. :math:`\theta \leftarrow \theta + \varepsilon\, M^{-1}\mathbf{r}` (full step for position). b. If :math:`l < L`: :math:`\mathbf{r} \leftarrow \mathbf{r} + \varepsilon\, \nabla \log p(\theta \mid \mathbf{x})` (full step for momentum). 3. :math:`\mathbf{r} \leftarrow \mathbf{r} + \frac{\varepsilon}{2}\, \nabla \log p(\theta \mid \mathbf{x})` (final half-step for momentum). The leapfrog integrator is symplectic (volume-preserving) and time-reversible, so the MH acceptance probability is: .. math:: \alpha = \min\!\bigl(1,\; \exp\!\bigl[-H(\theta^*, \mathbf{r}^*) + H(\theta, \mathbf{r})\bigr]\bigr), where :math:`(\theta^*, \mathbf{r}^*)` is the endpoint. If the integrator were exact, :math:`H` would be conserved and :math:`\alpha = 1`. In practice, the discretization error is small and acceptance rates are typically above 80\%. .. code-block:: python # HMC sampler: sampling a 2D correlated Gaussian import numpy as np np.random.seed(42) # Target: bivariate normal with correlation 0.95 rho = 0.95 Sigma = np.array([[1, rho], [rho, 1]]) Sigma_inv = np.linalg.inv(Sigma) def log_prob(q): return -0.5 * q @ Sigma_inv @ q def grad_log_prob(q): return -Sigma_inv @ q def hmc_step(q, epsilon, L): """One HMC step with leapfrog integration.""" p = np.random.normal(size=2) # sample momentum current_H = -log_prob(q) + 0.5 * p @ p q_new = q.copy() p_new = p.copy() # Leapfrog p_new += 0.5 * epsilon * grad_log_prob(q_new) for l in range(L): q_new += epsilon * p_new if l < L - 1: p_new += epsilon * grad_log_prob(q_new) p_new += 0.5 * epsilon * grad_log_prob(q_new) proposed_H = -log_prob(q_new) + 0.5 * p_new @ p_new if np.log(np.random.uniform()) < current_H - proposed_H: return q_new, True return q, False # Run HMC and compare with random-walk MH T = 5_000 hmc_samples = np.zeros((T, 2)) hmc_samples[0] = [0.0, 0.0] hmc_acc = 0 for t in range(1, T): hmc_samples[t], accepted = hmc_step(hmc_samples[t-1], epsilon=0.15, L=20) hmc_acc += accepted # Random-walk MH for comparison mh_samples = np.zeros((T, 2)) mh_samples[0] = [0.0, 0.0] mh_acc = 0 for t in range(1, T): prop = mh_samples[t-1] + np.random.normal(0, 0.3, 2) if np.log(np.random.uniform()) < log_prob(prop) - log_prob(mh_samples[t-1]): mh_samples[t] = prop mh_acc += 1 else: mh_samples[t] = mh_samples[t-1] def ess_1d(chain): n = len(chain) rho1 = np.corrcoef(chain[:-1], chain[1:])[0, 1] return max(1, n * (1 - rho1) / (1 + rho1)) burn = 500 print(f"{'':>14} {'HMC':>12} {'Random-walk MH':>16}") print("-" * 44) print(f"{'Accept rate':>14} {hmc_acc/T:>12.3f} {mh_acc/T:>16.3f}") print(f"{'Mean[0]':>14} {hmc_samples[burn:,0].mean():>12.3f} " f"{mh_samples[burn:,0].mean():>16.3f}") print(f"{'Std[0]':>14} {hmc_samples[burn:,0].std():>12.3f} " f"{mh_samples[burn:,0].std():>16.3f}") print(f"{'Corr':>14} " f"{np.corrcoef(hmc_samples[burn:].T)[0,1]:>12.3f} " f"{np.corrcoef(mh_samples[burn:].T)[0,1]:>16.3f}") print(f"{'ESS[0]':>14} {ess_1d(hmc_samples[burn:,0]):>12.0f} " f"{ess_1d(mh_samples[burn:,0]):>16.0f}") print(f"\nHMC achieves much higher ESS for correlated targets.") The No-U-Turn Sampler (NUTS) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ NUTS (Hoffman and Gelman, 2014) eliminates the need to hand-tune :math:`L` by building a binary tree of leapfrog steps and stopping when the trajectory begins to "double back" (the U-turn criterion). Combined with dual averaging to adapt :math:`\varepsilon`, NUTS is the default sampler in Stan and PyMC. 18.5 Convergence Diagnostics ------------------------------- Markov chains are guaranteed to converge only in the limit. In practice we need diagnostics to assess whether our finite chains have mixed well. We want concrete numbers, not just "it looks okay." R-hat (Gelman--Rubin statistic) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Run :math:`m \ge 2` chains from dispersed starting values. Let :math:`B` be the between-chain variance of the chain means and :math:`W` the average within-chain variance. The potential scale reduction factor is: .. math:: \hat{R} = \sqrt{\frac{(n-1)/n \cdot W + B/n}{W}}, where :math:`n` is the chain length. Reading this formula: the numerator estimates the true variance of the posterior using a mixture of within-chain and between-chain information. If the chains have converged to the same distribution, the between-chain variance :math:`B` will be small relative to :math:`W`, and the ratio will be close to 1. If the chains are still exploring different regions, :math:`B` will be large, inflating :math:`\hat{R}`. Values close to 1 (typically :math:`\hat{R} < 1.01`) indicate convergence. Effective sample size (ESS) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Because MCMC samples are autocorrelated, :math:`S` samples carry less information than :math:`S` independent samples. The ESS is: .. math:: \text{ESS} = \frac{S}{1 + 2\sum_{k=1}^\infty \rho_k}, where :math:`\rho_k` is the autocorrelation at lag :math:`k`. In plain English: if consecutive MCMC samples are highly correlated (the chain moves slowly), the denominator grows large, and the ESS shrinks well below the nominal number of samples :math:`S`. For example, if every sample is nearly identical to the previous one (:math:`\rho_k \approx 1` for many lags), you might have 10,000 samples but an ESS of only 50 --- meaning your 10,000 correlated draws carry as much information as 50 independent ones. A low ESS means we need more iterations or a better sampler. Let us compute R-hat and ESS from scratch, running multiple chains. .. code-block:: python # Convergence diagnostics from scratch: R-hat and ESS import numpy as np np.random.seed(42) # Target: N(3, 2^2). Run 4 chains from dispersed starts. def mh_chain(T, sigma_prop, start, target_mu=3.0, target_sigma=2.0): chain = np.zeros(T) chain[0] = start for t in range(1, T): proposal = chain[t-1] + np.random.normal(0, sigma_prop) log_alpha = (-0.5*((proposal - target_mu)/target_sigma)**2 + 0.5*((chain[t-1] - target_mu)/target_sigma)**2) if np.log(np.random.uniform()) < log_alpha: chain[t] = proposal else: chain[t] = chain[t-1] return chain # 4 chains from dispersed starting values T = 3000 starts = [-10, 0, 10, 20] chains = [mh_chain(T, sigma_prop=2.5, start=s) for s in starts] # Discard first half as burn-in burn = T // 2 chains_post = [c[burn:] for c in chains] m = len(chains_post) n = len(chains_post[0]) # -- R-hat -- chain_means = np.array([c.mean() for c in chains_post]) chain_vars = np.array([c.var(ddof=1) for c in chains_post]) W = chain_vars.mean() B = n * chain_means.var(ddof=1) var_hat = (n - 1) / n * W + B / n R_hat = np.sqrt(var_hat / W) # -- ESS (using initial positive sequence estimator) -- def compute_ess(chain): n = len(chain) mean = chain.mean() var = chain.var() if var == 0: return n acf_sum = 0 for k in range(1, n): rho_k = np.corrcoef(chain[:-k], chain[k:])[0, 1] if rho_k < 0.05: break acf_sum += rho_k return n / (1 + 2 * acf_sum) ess_values = [compute_ess(c) for c in chains_post] print("Convergence Diagnostics (target: N(3, 4))") print(f" Chains: {m}, length: {T}, burn-in: {burn}") print(f"\n {'Chain':>6} {'Start':>7} {'Mean':>8} {'Std':>8} {'ESS':>8}") print(" " + "-" * 42) for i, (c, s) in enumerate(zip(chains_post, starts)): print(f" {i+1:>6} {s:>7.1f} {c.mean():>8.3f} {c.std():>8.3f} " f"{ess_values[i]:>8.0f}") print(f"\n R-hat: {R_hat:.4f} {'(converged)' if R_hat < 1.01 else '(NOT converged)'}") print(f" Total ESS: {sum(ess_values):.0f}") print(f" W (within): {W:.4f}") print(f" B/n (between): {B/n:.4f}") Trace plots ^^^^^^^^^^^^ A simple visual diagnostic: plot each parameter against the iteration number. The chain should look like a "fuzzy caterpillar" --- stationary and well-mixed --- rather than showing trends or long excursions. .. code-block:: python # Contrast good vs bad mixing by printing trajectory summaries import numpy as np np.random.seed(42) # Well-tuned chain (sigma=2.5) vs poorly-tuned chain (sigma=0.05) T = 5000 good = np.zeros(T) bad = np.zeros(T) good[0] = bad[0] = 10.0 # start away from target mean=3 for t in range(1, T): # Good chain prop = good[t-1] + np.random.normal(0, 2.5) if np.log(np.random.uniform()) < -0.5*((prop-3)/2)**2 + 0.5*((good[t-1]-3)/2)**2: good[t] = prop else: good[t] = good[t-1] # Bad chain prop = bad[t-1] + np.random.normal(0, 0.05) if np.log(np.random.uniform()) < -0.5*((prop-3)/2)**2 + 0.5*((bad[t-1]-3)/2)**2: bad[t] = prop else: bad[t] = bad[t-1] # Print snapshots of trajectory print(f"{'Iteration':>10} {'Good chain':>12} {'Bad chain':>12}") print("-" * 36) for t in [0, 10, 50, 100, 500, 1000, 2000, 4999]: print(f"{t:>10} {good[t]:>12.3f} {bad[t]:>12.3f}") print(f"\n{'Diagnostic':>18} {'Good':>10} {'Bad':>10}") print("-" * 40) rho_good = np.corrcoef(good[:-1], good[1:])[0, 1] rho_bad = np.corrcoef(bad[:-1], bad[1:])[0, 1] ess_good = T * (1 - rho_good) / (1 + rho_good) ess_bad = T * (1 - rho_bad) / (1 + rho_bad) print(f"{'Mean (last 3000)':>18} {good[2000:].mean():>10.3f} {bad[2000:].mean():>10.3f}") print(f"{'Std (last 3000)':>18} {good[2000:].std():>10.3f} {bad[2000:].std():>10.3f}") print(f"{'Lag-1 autocorr':>18} {rho_good:>10.3f} {rho_bad:>10.3f}") print(f"{'ESS':>18} {ess_good:>10.0f} {ess_bad:>10.0f}") print(f"{'True mean':>18} {'3.000':>10} {'3.000':>10}") 18.6 Variational Inference ----------------------------- **Why variational methods?** MCMC is asymptotically exact but can be slow --- our hierarchical model takes thousands of iterations per second. What if we had a million observations? Variational inference (VI) turns the inference problem into an *optimization* problem, trading some accuracy for speed. The ELBO ^^^^^^^^^ We want to approximate the posterior :math:`p(\theta \mid \mathbf{x})` by a simpler distribution :math:`q(\theta)` from some family :math:`\mathcal{Q}`. The KL divergence from :math:`q` to the posterior is: .. math:: \text{KL}(q \| p) = \int q(\theta)\, \log \frac{q(\theta)}{p(\theta \mid \mathbf{x})}\, d\theta. This is non-negative and zero only when :math:`q = p`. We want to minimize it, but it involves the intractable :math:`p(\theta \mid \mathbf{x})`. **Derivation of the ELBO.** Write :math:`p(\theta \mid \mathbf{x}) = p(\mathbf{x}, \theta) / p(\mathbf{x})`: .. math:: \text{KL}(q \| p) &= \int q(\theta)\, \log \frac{q(\theta)\, p(\mathbf{x})}{p(\mathbf{x}, \theta)}\, d\theta \\ &= \log p(\mathbf{x}) + \int q(\theta)\, \log \frac{q(\theta)}{p(\mathbf{x}, \theta)}\, d\theta. Rearranging: .. math:: \log p(\mathbf{x}) = \text{KL}(q \| p) - \int q(\theta)\, \log \frac{q(\theta)}{p(\mathbf{x}, \theta)}\, d\theta. Because :math:`\text{KL} \ge 0`, we have: .. math:: \log p(\mathbf{x}) \;\ge\; \underbrace{\int q(\theta)\, \log \frac{p(\mathbf{x}, \theta)}{q(\theta)}\, d\theta} _{\text{ELBO}(\,q\,)}. The **Evidence Lower BOund** (ELBO) is: .. math:: \text{ELBO}(q) = E_q[\log p(\mathbf{x}, \theta)] - E_q[\log q(\theta)] = E_q[\log p(\mathbf{x} \mid \theta)] - \text{KL}(q \| p(\theta)). Reading the second form: the ELBO has two competing terms. The first, :math:`E_q[\log p(\mathbf{x} \mid \theta)]`, rewards the approximation :math:`q` for placing mass on parameter values that explain the data well (good fit). The second, :math:`\text{KL}(q \| p(\theta))`, penalizes :math:`q` for straying too far from the prior (regularization). Maximizing the ELBO finds the best trade-off --- the approximate posterior that fits the data as well as possible while remaining close to prior expectations. Maximizing the ELBO is equivalent to minimizing :math:`\text{KL}(q \| p)`. Let us implement variational inference for a Beta posterior, optimizing the ELBO with gradient ascent and printing the ELBO at each iteration. .. code-block:: python # Variational inference: fit a Gaussian to a Beta(35,25) posterior # Watch the ELBO increase iteration by iteration import numpy as np from scipy import stats np.random.seed(42) # True posterior: Beta(35, 25) from clinical trial a_true, b_true = 35, 25 true_mean = a_true / (a_true + b_true) true_std = np.sqrt(a_true * b_true / ((a_true+b_true)**2 * (a_true+b_true+1))) # Variational family: q(theta) = N(mu_q, sigma_q^2) (on logit scale) # Parameterize on logit scale for unconstrained optimization mu_q = 0.0 # logit scale log_sigma_q = -1.0 # log(sigma_q) lr = 0.05 S = 5_000 # MC samples for ELBO gradient print(f"{'Iter':>5} {'ELBO':>10} {'mu_q (logit)':>13} {'sigma_q':>9} " f"{'mean(theta)':>12}") print("-" * 52) for iteration in range(80): sigma_q = np.exp(log_sigma_q) # Reparameterization trick: z = mu_q + sigma_q * eps eps = np.random.normal(size=S) z_logit = mu_q + sigma_q * eps z = 1.0 / (1.0 + np.exp(-z_logit)) # inverse logit -> theta in (0,1) z = np.clip(z, 1e-10, 1 - 1e-10) # log p(x, theta) using Beta kernel log_joint = (a_true - 1) * np.log(z) + (b_true - 1) * np.log(1 - z) # log q(logit(theta)) with Jacobian log_q = stats.norm.logpdf(z_logit, mu_q, sigma_q) elbo = np.mean(log_joint - log_q) # Gradients (score function estimator) score_mu = (z_logit - mu_q) / sigma_q**2 score_log_sigma = ((z_logit - mu_q)**2 / sigma_q**2 - 1) advantage = log_joint - log_q - elbo grad_mu = np.mean(advantage * score_mu) grad_log_sigma = np.mean(advantage * score_log_sigma) mu_q += lr * grad_mu log_sigma_q += lr * 0.1 * grad_log_sigma if iteration < 10 or iteration % 10 == 0: theta_mean = np.mean(z) print(f"{iteration:>5} {elbo:>10.3f} {mu_q:>13.4f} " f"{sigma_q:>9.4f} {theta_mean:>12.4f}") print(f"\nTrue posterior: mean={true_mean:.4f}, std={true_std:.4f}") print(f"VI approx: mean={np.mean(z):.4f}, std={np.std(z):.4f}") Mean-field approximation ^^^^^^^^^^^^^^^^^^^^^^^^^^ The simplest variational family assumes the parameters are independent: .. math:: q(\boldsymbol{\theta}) = \prod_{j=1}^p q_j(\theta_j). Each factor :math:`q_j` is optimized in turn, holding the others fixed. **Coordinate ascent VI (CAVI).** The optimal update for :math:`q_j` is: .. math:: \log q_j^*(\theta_j) = E_{-j}[\log p(\mathbf{x}, \boldsymbol{\theta})] + \text{const}, where :math:`E_{-j}` denotes expectation with respect to all :math:`q_k` for :math:`k \neq j`. In plain English: to update our approximation for one parameter :math:`\theta_j`, we take the log of the full joint distribution, then average out all the other parameters using their current approximate distributions. The result tells us the optimal shape for :math:`q_j`. This is analogous to fitting one variable at a time while holding the others fixed, which is why it is called "coordinate ascent." This is derived by taking the functional derivative of the ELBO with respect to :math:`q_j` and setting it to zero. CAVI iterates over all components until the ELBO converges. .. code-block:: python # Coordinate Ascent VI for a mixture of two Gaussians (mean-field) import numpy as np from scipy import stats np.random.seed(42) # Data from a Gaussian with unknown mean and precision true_mu, true_tau = 3.0, 0.5 # tau = 1/sigma^2 data = np.random.normal(true_mu, 1/np.sqrt(true_tau), size=50) n = len(data) x_bar = data.mean() x_var = data.var() # Mean-field: q(mu, tau) = q_mu(mu) * q_tau(tau) # q_mu = N(m, s^2), q_tau = Gamma(a, b) # Hyperpriors: mu ~ N(0, 100), tau ~ Gamma(0.01, 0.01) m, s2 = 0.0, 10.0 a_q, b_q = 1.0, 1.0 print(f"{'Iter':>5} {'E[mu]':>8} {'E[tau]':>8} {'E[sigma]':>9} {'ELBO':>10}") print("-" * 43) for it in range(20): # Update q(mu): N(m, s^2) E_tau = a_q / b_q s2 = 1.0 / (1.0/100.0 + n * E_tau) m = s2 * (0.0/100.0 + E_tau * n * x_bar) # Update q(tau): Gamma(a_q, b_q) E_mu = m E_mu2 = s2 + m**2 a_q = 0.01 + n / 2.0 b_q = 0.01 + 0.5 * (np.sum(data**2) - 2 * E_mu * np.sum(data) + n * E_mu2) # Compute ELBO (simplified) elbo = (-n/2 * np.log(2*np.pi) + n/2 * (np.euler_gamma + np.log(2*b_q) - 1/(a_q)) - 0.5 * n * (a_q/b_q) * x_var) print(f"{it:>5} {m:>8.4f} {a_q/b_q:>8.4f} " f"{1/np.sqrt(a_q/b_q):>9.4f} {elbo:>10.2f}") print(f"\nTrue: mu={true_mu}, sigma={1/np.sqrt(true_tau):.3f}") Stochastic variational inference ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For large datasets, computing the ELBO over all data is expensive. Stochastic VI (Hoffman et al., 2013) uses mini-batches: subsample the data, compute a noisy gradient of the ELBO, and update the variational parameters via stochastic gradient ascent. This scales VI to millions of observations. 18.7 Integrated Nested Laplace Approximations (INLA) ------------------------------------------------------- **Why INLA?** For a large class of models --- *latent Gaussian models* (LGMs) --- INLA provides fast, deterministic, and accurate posterior approximations without MCMC. A latent Gaussian model has three stages: 1. **Observations:** :math:`y_i \mid \eta_i, \boldsymbol{\psi}` from an exponential family. 2. **Latent field:** :math:`\boldsymbol{\eta} \mid \boldsymbol{\psi} \sim N(\boldsymbol{\mu}_\psi, \boldsymbol{Q}_\psi^{-1})` (Gaussian with a sparse precision matrix :math:`\boldsymbol{Q}`). 3. **Hyperparameters:** :math:`\boldsymbol{\psi}` with prior :math:`p(\boldsymbol{\psi})`. The key insight: with a Gaussian latent field, the Laplace approximation (Section 17.5 in :ref:`Chapter 17 `) is very accurate for the conditional posterior :math:`p(\boldsymbol{\eta} \mid \boldsymbol{\psi}, \mathbf{y})`. INLA then numerically integrates over the (typically low-dimensional) :math:`\boldsymbol{\psi}`. **When to use INLA.** INLA is the method of choice when the model fits the LGM framework: generalized linear models, spatial models, time series, survival models, etc. It is implemented in the R package ``R-INLA``. 18.8 Comparison of Methods ---------------------------- .. list-table:: :header-rows: 1 :widths: 20 20 20 20 20 * - Method - Accuracy - Speed - Scalability - When to use * - Laplace approx. - Moderate - Very fast - High - Unimodal, symmetric posteriors; quick screening * - MCMC (MH/Gibbs) - Exact (asymptotic) - Slow - Low--moderate - Gold standard; moderate dimensions * - HMC / NUTS - Exact (asymptotic) - Moderate - Moderate--high - Continuous parameters; up to thousands of dimensions * - Variational inference - Approximate - Fast - Very high - Large datasets; exploratory analysis; deep generative models * - INLA - Very good - Fast - High - Latent Gaussian models Let us run a horse race: apply three methods to the same problem and compare. .. code-block:: python # Head-to-head comparison: Laplace vs MH vs Gibbs on Normal-Normal import numpy as np from scipy import stats, optimize import time np.random.seed(42) # Data: N(mu, 1), mu ~ N(0, 10^2), observe 50 points true_mu = 3.0 data = np.random.normal(true_mu, 1.0, size=50) n = len(data) x_bar = data.mean() # --- Exact posterior (conjugate) --- prec_post = 1/100 + n mu_post_exact = (0/100 + n*x_bar) / prec_post sd_post_exact = np.sqrt(1/prec_post) # --- Method 1: Laplace approximation --- t0 = time.time() neg_log_post = lambda mu: 0.5*n*(mu - x_bar)**2 + 0.5*mu**2/100 res = optimize.minimize_scalar(neg_log_post) mu_lap = res.x h2 = n + 1/100 sd_lap = np.sqrt(1/h2) t_lap = time.time() - t0 # --- Method 2: Metropolis-Hastings --- t0 = time.time() T_mh = 10_000 mh = np.zeros(T_mh) mh[0] = 0.0 acc_mh = 0 for t in range(1, T_mh): prop = mh[t-1] + np.random.normal(0, 0.3) la = (-0.5*n*(prop-x_bar)**2 - 0.5*prop**2/100 + 0.5*n*(mh[t-1]-x_bar)**2 + 0.5*mh[t-1]**2/100) if np.log(np.random.uniform()) < la: mh[t] = prop acc_mh += 1 else: mh[t] = mh[t-1] t_mh = time.time() - t0 mh_post = mh[2000:] # --- Method 3: Direct sampling (exact conjugate) --- t0 = time.time() exact_samples = np.random.normal(mu_post_exact, sd_post_exact, 10_000) t_exact = time.time() - t0 print(f"{'Method':>18} {'Mean':>8} {'Std':>8} {'Time (ms)':>10}") print("-" * 48) print(f"{'Exact':>18} {mu_post_exact:>8.4f} {sd_post_exact:>8.4f} {t_exact*1000:>10.2f}") print(f"{'Laplace':>18} {mu_lap:>8.4f} {sd_lap:>8.4f} {t_lap*1000:>10.2f}") print(f"{'MH (10k iter)':>18} {mh_post.mean():>8.4f} {mh_post.std():>8.4f} " f"{t_mh*1000:>10.2f}") **Rules of thumb.** * If the model is a latent Gaussian model, try INLA first. * For moderate-dimensional continuous models, use HMC/NUTS (Stan, PyMC). * For massive datasets or deep generative models, use variational inference. * Use Gibbs sampling when full conditionals are available and the dimension is not too high. * Use random-walk Metropolis as a baseline or when gradients are unavailable. 18.9 Summary -------------- * Monte Carlo integration approximates integrals by sampling; importance sampling reweights samples from a proposal. * Metropolis--Hastings constructs a Markov chain satisfying detailed balance; the acceptance probability is derived from the stationarity condition. * Gibbs sampling is MH with full conditional proposals and acceptance rate 1. * HMC uses Hamiltonian dynamics and the leapfrog integrator to make large, gradient-informed moves; NUTS automates the trajectory length. * Convergence is assessed with :math:`\hat{R}`, ESS, and trace plots. * Variational inference turns inference into optimization via the ELBO. * INLA combines Laplace approximations with numerical integration for latent Gaussian models. .. code-block:: python # Summary: which method for which problem? methods = [ ("Conjugate (Beta-Binomial)", "Exact", "< 1 ms", "Low-dim conjugate"), ("Laplace approximation", "Good", "< 10 ms", "Unimodal posteriors"), ("Metropolis-Hastings", "Exact (lim)", "seconds", "General, low-dim"), ("Gibbs sampling", "Exact (lim)", "seconds", "Conjugate conditionals"), ("HMC / NUTS", "Exact (lim)", "seconds", "Continuous, high-dim"), ("Variational inference", "Approximate", "< 1 sec", "Large data, deep models"), ] print(f"{'Method':<30} {'Accuracy':<14} {'Speed':<12} {'Best for'}") print("-" * 80) for name, acc, speed, use in methods: print(f"{name:<30} {acc:<14} {speed:<12} {use}")