Chapter 20 – Modern Research Frontiers

Classical likelihood theory assumes we can write down and evaluate the likelihood function \(L(\theta) = p(\mathbf{x} \mid \theta)\). Modern research pushes far beyond this assumption. We now have models where the likelihood is intractable, data sets so large that exact computation is infeasible, and neural networks flexible enough to learn entire probability distributions. This chapter surveys these frontiers, always tracing the thread back to the likelihood ideas developed in earlier chapters.

If the preceding chapters built a solid house, this chapter shows you the expanding neighborhood. The foundation—the likelihood function—remains the same. What has changed is the scale of ambition: from a handful of parameters to millions, from closed-form densities to opaque simulators, and from hand-derived gradients to automatic ones.

20.1 Automatic Differentiation

Why this enables everything else. Nearly every optimization and sampling algorithm in this guide uses derivatives of the log-likelihood. Computing these derivatives by hand is error-prone and impractical for complex models. Automatic differentiation (AD) computes exact (up to floating point) derivatives of arbitrary computer programs.

Forward mode

Forward-mode AD propagates dual numbers through the computation graph. A dual number augments each value \(v\) with its derivative \(\dot{v} = dv/d\theta\):

\[\begin{split}(v, \dot{v}) + (u, \dot{u}) &= (v + u,\; \dot{v} + \dot{u}), \\ (v, \dot{v}) \times (u, \dot{u}) &= (v \cdot u,\; v\,\dot{u} + u\,\dot{v}).\end{split}\]

One forward pass computes \(f(\theta)\) and \(df/d\theta_j\) for one input variable \(\theta_j\). The cost is \(O(p)\) forward passes for a gradient with respect to \(p\) parameters.

Intuition

Think of dual numbers as carrying both a value and a “derivative tag” along for the ride. As arithmetic flows through the program, the derivative tag updates itself automatically by the chain rule. You never need to derive a single formula—the computer does it for you, exactly.

Real-World Example

Forward-mode AD with dual numbers.

Let’s implement dual numbers from scratch and use them to differentiate a log-likelihood function. This is exactly what libraries like ForwardDiff.jl (Julia) do under the hood.

Pseudocode:

1. Define a DualNumber class with (value, derivative) pairs
2. Implement __add__, __mul__, __truediv__, __pow__, exp, log
3. Write the Bernoulli log-likelihood using dual numbers
4. Seed theta = DualNumber(theta_val, 1.0)  <- derivative of theta w.r.t. itself is 1
5. Evaluate log-likelihood; the derivative comes out automatically

Python implementation:

# Forward-mode AD with dual numbers: differentiate a log-likelihood
import math

class Dual:
    """A dual number (value, derivative) for forward-mode AD."""
    def __init__(self, val, der=0.0):
        self.val = val
        self.der = der
    def __add__(self, other):
        o = other if isinstance(other, Dual) else Dual(other)
        return Dual(self.val + o.val, self.der + o.der)
    def __radd__(self, other):
        return self.__add__(other)
    def __mul__(self, other):
        o = other if isinstance(other, Dual) else Dual(other)
        return Dual(self.val * o.val, self.val * o.der + self.der * o.val)
    def __rmul__(self, other):
        return self.__mul__(other)
    def __sub__(self, other):
        o = other if isinstance(other, Dual) else Dual(other)
        return Dual(self.val - o.val, self.der - o.der)
    def __rsub__(self, other):
        o = other if isinstance(other, Dual) else Dual(other)
        return Dual(o.val - self.val, o.der - self.der)

def dual_log(x):
    return Dual(math.log(x.val), x.der / x.val)

# Bernoulli log-likelihood: l(theta) = s*log(theta) + (n-s)*log(1-theta)
s, n = 7, 10

theta_val = 0.6
theta = Dual(theta_val, 1.0)  # seed derivative = 1

log_lik = s * dual_log(theta) + (n - s) * dual_log(1 - theta)

# Compare with analytical derivative: s/theta - (n-s)/(1-theta)
analytical = s / theta_val - (n - s) / (1 - theta_val)

print(f"Log-likelihood:     {log_lik.val:.6f}")
print(f"AD derivative:      {log_lik.der:.6f}")
print(f"Analytical deriv.:  {analytical:.6f}")
print(f"Match: {abs(log_lik.der - analytical) < 1e-10}")

Extending dual numbers to richer functions. The Bernoulli example is clean but simple. Real likelihood functions use \(\exp\), \(\text{pow}\), division, and composition of many operations. Let us extend the dual number class and differentiate a Normal log-likelihood, comparing the result with both the analytical and numerical gradients.

# Extended dual numbers: Normal log-likelihood differentiation
import math

class Dual:
    """Dual number with full arithmetic for forward-mode AD."""
    def __init__(self, val, der=0.0):
        self.val = val
        self.der = der
    def __repr__(self):
        return f"Dual({self.val:.6f}, {self.der:.6f})"
    def __add__(self, o):
        o = o if isinstance(o, Dual) else Dual(o)
        return Dual(self.val + o.val, self.der + o.der)
    __radd__ = lambda s, o: s.__add__(o)
    def __sub__(self, o):
        o = o if isinstance(o, Dual) else Dual(o)
        return Dual(self.val - o.val, self.der - o.der)
    def __rsub__(self, o):
        o = o if isinstance(o, Dual) else Dual(o)
        return Dual(o.val - self.val, o.der - self.der)
    def __mul__(self, o):
        o = o if isinstance(o, Dual) else Dual(o)
        return Dual(self.val * o.val, self.val * o.der + self.der * o.val)
    __rmul__ = lambda s, o: s.__mul__(o)
    def __truediv__(self, o):
        o = o if isinstance(o, Dual) else Dual(o)
        return Dual(self.val / o.val,
                    (self.der * o.val - self.val * o.der) / o.val**2)
    def __rtruediv__(self, o):
        o = o if isinstance(o, Dual) else Dual(o)
        return o.__truediv__(self)
    def __pow__(self, n):
        return Dual(self.val**n, n * self.val**(n-1) * self.der)
    def __neg__(self):
        return Dual(-self.val, -self.der)

def dual_log(x):
    return Dual(math.log(x.val), x.der / x.val)

def dual_exp(x):
    e = math.exp(x.val)
    return Dual(e, e * x.der)

# Normal log-likelihood: -n/2 log(2*pi*sigma^2) - sum((x-mu)^2)/(2*sigma^2)
# Differentiate w.r.t. mu at a fixed sigma
import numpy as np
np.random.seed(42)
data = np.random.normal(3.0, 1.5, size=50)

sigma_val = 1.5
mu_val = 2.8

# Using dual numbers (seed derivative of mu = 1)
mu = Dual(mu_val, 1.0)
n = len(data)
log_lik = Dual(-n/2 * math.log(2 * math.pi * sigma_val**2))
for xi in data:
    log_lik = log_lik - (xi - mu)**2 / (2 * sigma_val**2)

# Analytical gradient: sum(x - mu) / sigma^2
grad_analytical = np.sum(data - mu_val) / sigma_val**2

# Numerical gradient (central differences)
h = 1e-7
def ll_numpy(mu_v):
    return -n/2 * np.log(2*np.pi*sigma_val**2) - np.sum((data-mu_v)**2)/(2*sigma_val**2)
grad_numerical = (ll_numpy(mu_val + h) - ll_numpy(mu_val - h)) / (2 * h)

print(f"Normal log-lik at mu={mu_val}: {log_lik.val:.6f}")
print(f"\nGradient d(loglik)/d(mu):")
print(f"  Forward-mode AD: {log_lik.der:.10f}")
print(f"  Analytical:      {grad_analytical:.10f}")
print(f"  Numerical:       {grad_numerical:.10f}")
print(f"\nAD vs analytical error: {abs(log_lik.der - grad_analytical):.2e}")
print(f"Numerical vs analytical error: {abs(grad_numerical - grad_analytical):.2e}")
print("AD matches the analytical gradient to machine precision.")

Reverse mode (backpropagation)

Reverse-mode AD first evaluates the function (forward pass), recording the computation graph, then propagates adjoint values backward:

\[\bar{v}_i = \frac{\partial f}{\partial v_i} = \sum_{j \,:\, v_i \to v_j} \bar{v}_j \,\frac{\partial v_j}{\partial v_i}.\]

Reading this formula: \(\bar{v}_i\) is the sensitivity of the final output \(f\) to the intermediate value \(v_i\). To compute it, we look at every node \(v_j\) that directly depends on \(v_i\) in the computation graph, and sum up how changes in \(v_i\) propagate through each of those paths. This is just the multi-variable chain rule, applied systematically backward through the graph.

One reverse pass computes the full gradient \(\nabla_\theta f\) regardless of the dimension \(p\). The cost is \(O(1)\) reverse passes (plus storage for the computation graph).

Consequence for likelihood optimization. Reverse-mode AD makes it practical to optimize log-likelihoods with thousands or millions of parameters. Libraries such as JAX, PyTorch, TensorFlow, and Stan all provide reverse-mode AD.

Reverse-mode AD concept: gradient of a multi-layer function. Forward mode is efficient when you have few inputs and many outputs. Reverse mode is efficient when you have many inputs and one output—exactly the situation in likelihood optimization. Below we implement a minimal reverse-mode AD engine and use it to differentiate a composed function with three parameters.

# Reverse-mode AD concept: a minimal tape-based implementation
import math

class Var:
    """A variable in a computation graph for reverse-mode AD."""
    def __init__(self, val, children=(), grad_fns=()):
        self.val = val
        self.grad = 0.0
        self.children = children
        self.grad_fns = grad_fns

    def __add__(self, other):
        other = other if isinstance(other, Var) else Var(other)
        out = Var(self.val + other.val, (self, other),
                  (lambda g: g, lambda g: g))
        return out
    __radd__ = lambda s, o: s.__add__(o)

    def __mul__(self, other):
        other = other if isinstance(other, Var) else Var(other)
        sv, ov = self.val, other.val
        out = Var(sv * ov, (self, other),
                  (lambda g, ov=ov: g * ov, lambda g, sv=sv: g * sv))
        return out
    __rmul__ = lambda s, o: s.__mul__(o)

    def __neg__(self):
        return Var(-self.val, (self,), (lambda g: -g,))

    def __sub__(self, other):
        return self + (-other)

    def backward(self):
        """Backpropagate gradients through the computation graph."""
        # Topological sort
        order, visited = [], set()
        def topo(v):
            if id(v) not in visited:
                visited.add(id(v))
                for c in v.children:
                    topo(c)
                order.append(v)
        topo(self)
        self.grad = 1.0
        for v in reversed(order):
            for child, grad_fn in zip(v.children, v.grad_fns):
                child.grad += grad_fn(v.grad)

def var_log(x):
    out = Var(math.log(x.val), (x,),
              (lambda g, xv=x.val: g / xv,))
    return out

def var_exp(x):
    e = math.exp(x.val)
    out = Var(e, (x,), (lambda g, e=e: g * e,))
    return out

# Function: f(a,b,c) = log(a*b + exp(c))
# Analytical: df/da = b/(a*b+exp(c)), df/db = a/(a*b+exp(c)),
#             df/dc = exp(c)/(a*b+exp(c))
a = Var(2.0)
b = Var(3.0)
c = Var(1.0)

# Forward pass
f = var_log(a * b + var_exp(c))

# Backward pass: one call gives ALL gradients
f.backward()

# Analytical gradients
denom = 2.0 * 3.0 + math.exp(1.0)
print(f"f(2,3,1) = log(2*3 + exp(1)) = {f.val:.6f}")
print(f"\nReverse-mode gradients (one backward pass for all 3 params):")
print(f"  df/da: AD={a.grad:.6f}, analytical={3.0/denom:.6f}")
print(f"  df/db: AD={b.grad:.6f}, analytical={2.0/denom:.6f}")
print(f"  df/dc: AD={c.grad:.6f}, analytical={math.exp(1.0)/denom:.6f}")
print(f"\nForward mode would need 3 passes (one per parameter).")
print(f"Reverse mode needed just 1 pass. This is why it scales to millions of params.")

Autodiff of a Normal log-likelihood: analytical vs numerical vs AD. Here we bring together all three gradient methods to differentiate the same Normal log-likelihood. This comparison shows that AD gives the exact analytical gradient, while numerical differentiation introduces a small error.

# Autodiff of Normal log-likelihood: analytical vs numerical vs AD
import numpy as np

np.random.seed(42)

# Data
data = np.random.normal(loc=3.0, scale=1.5, size=50)

def log_likelihood(mu, sigma, data):
    """Normal log-likelihood."""
    n = len(data)
    return -n/2 * np.log(2 * np.pi * sigma**2) - np.sum((data - mu)**2) / (2 * sigma**2)

# Analytical gradients for comparison
mu_val, sigma_val = 2.8, 1.6
n = len(data)
grad_mu_analytical = np.sum(data - mu_val) / sigma_val**2
grad_sigma_analytical = -n / sigma_val + np.sum((data - mu_val)**2) / sigma_val**3

# Numerical gradient (central differences)
h = 1e-7
num_grad_mu = (log_likelihood(mu_val + h, sigma_val, data)
               - log_likelihood(mu_val - h, sigma_val, data)) / (2 * h)
num_grad_sigma = (log_likelihood(mu_val, sigma_val + h, data)
                  - log_likelihood(mu_val, sigma_val - h, data)) / (2 * h)

# "AD" via complex-step (machine-precision, like true AD)
def ll_complex(mu, sigma):
    n = len(data)
    return -n/2 * np.log(2 * np.pi * sigma**2) - np.sum((data - mu)**2) / (2 * sigma**2)

h_cs = 1e-30
ad_grad_mu = np.imag(ll_complex(mu_val + 1j*h_cs, sigma_val)) / h_cs
ad_grad_sigma = np.imag(ll_complex(mu_val, sigma_val + 1j*h_cs)) / h_cs

print(f"Log-likelihood at (mu={mu_val}, sigma={sigma_val}): "
      f"{log_likelihood(mu_val, sigma_val, data):.4f}")
print(f"\n{'Method':<14} {'d/d(mu)':<18} {'d/d(sigma)':<18}")
print("-" * 50)
print(f"{'Analytical':<14} {grad_mu_analytical:<18.10f} {grad_sigma_analytical:<18.10f}")
print(f"{'Numerical':<14} {num_grad_mu:<18.10f} {num_grad_sigma:<18.10f}")
print(f"{'AD (complex)':<14} {ad_grad_mu:<18.10f} {ad_grad_sigma:<18.10f}")
print(f"\nNumerical error (mu):    {abs(num_grad_mu - grad_mu_analytical):.2e}")
print(f"AD error (mu):           {abs(ad_grad_mu - grad_mu_analytical):.2e}")
print(f"Numerical error (sigma): {abs(num_grad_sigma - grad_sigma_analytical):.2e}")
print(f"AD error (sigma):        {abs(ad_grad_sigma - grad_sigma_analytical):.2e}")
print("# With JAX: grad_fn = jax.grad(log_likelihood, argnums=(0, 1))")
print("# One call gives exact gradients via reverse-mode AD.")

Connection to Fisher information. Second-order methods (Newton, natural gradient) need the Hessian or Fisher information matrix. AD can compute Hessian–vector products efficiently, enabling quasi-Newton methods and second-order optimization at scale.

20.2 Neural Density Estimation

Why parameterize distributions with neural networks? Traditional parametric families (Normal, Gamma, etc.) are convenient but limited: the true data-generating distribution may not belong to any simple family. Neural networks can approximate any density function (under mild conditions), giving us a flexible “likelihood machine.”

Mixture density networks

A mixture density network (MDN; Bishop, 1994) outputs the parameters of a Gaussian mixture:

\[p(y \mid \mathbf{x}; \mathbf{w}) = \sum_{k=1}^K \pi_k(\mathbf{x}; \mathbf{w})\; \mathcal{N}\!\bigl(y;\; \mu_k(\mathbf{x}; \mathbf{w}),\; \sigma_k^2(\mathbf{x}; \mathbf{w})\bigr),\]

where \(\pi_k, \mu_k, \sigma_k^2\) are neural network outputs. Training maximizes the log-likelihood \(\sum_i \log p(y_i \mid \mathbf{x}_i; \mathbf{w})\) — exactly the MLE principle from Part I.

Why does this matter?

Standard regression assumes a single predicted value with Gaussian noise. But many real-world relationships are multimodal: given the same input, the output could take several distinct values. MDNs capture this by predicting an entire mixture distribution, not just a point.

Mixture density network: fitting a multimodal relationship. A standard regression predicts a single value for each input. But what if the relationship is one-to-many? For example, the inverse kinematics of a robot arm: multiple joint configurations can produce the same end-effector position. An MDN predicts an entire mixture distribution for each input.

# Mixture density network: fitting a multimodal conditional distribution
import numpy as np
from scipy.special import logsumexp

np.random.seed(42)

# Generate multimodal data: y = x + noise, but with two modes
n = 1000
x = np.random.uniform(-1, 1, n)
mode = np.random.binomial(1, 0.5, n)
y = np.where(mode == 0, np.sin(3*x) + 0.2*np.random.randn(n),
                         -np.sin(3*x) + 0.2*np.random.randn(n))

# Simple MDN with K=2 components
# Parameters: for each x, predict (pi, mu_1, mu_2, sigma_1, sigma_2)
# We use a simple linear model for illustration (real MDN uses neural net)
K = 2

# Feature matrix: polynomial basis
deg = 5
Phi = np.column_stack([x**d for d in range(deg+1)])

# Initialize weights: each output head has (deg+1) weights
n_features = deg + 1
np.random.seed(123)
W_mu = np.random.randn(K, n_features) * 0.1     # means
W_log_sigma = np.zeros((K, n_features))           # log-std
W_logit_pi = np.zeros((K, n_features))             # mixing logits

def mdn_log_likelihood(W_mu, W_log_sigma, W_logit_pi, Phi, y):
    """Compute MDN log-likelihood."""
    mu = Phi @ W_mu.T          # (n, K)
    log_sigma = Phi @ W_log_sigma.T
    sigma = np.exp(np.clip(log_sigma, -5, 5))
    logits = Phi @ W_logit_pi.T
    log_pi = logits - logsumexp(logits, axis=1, keepdims=True)

    # Log-density of each component for each observation
    log_comp = (log_pi - 0.5*np.log(2*np.pi)
                - log_sigma - 0.5*((y[:,None] - mu)/sigma)**2)

    # Log-likelihood: sum of logsumexp over components
    return np.sum(logsumexp(log_comp, axis=1))

# Simple gradient descent (numerical gradients for clarity)
lr = 1e-4
h = 1e-5
print(f"{'Iter':<8} {'Log-likelihood':<18}")
print("-" * 26)

for iteration in range(201):
    ll = mdn_log_likelihood(W_mu, W_log_sigma, W_logit_pi, Phi, y)
    if iteration % 50 == 0:
        print(f"{iteration:<8} {ll:<18.2f}")

    # Numerical gradient for W_mu (simplified)
    for k in range(K):
        for j in range(n_features):
            W_mu[k,j] += h
            ll_plus = mdn_log_likelihood(W_mu, W_log_sigma, W_logit_pi, Phi, y)
            W_mu[k,j] -= h
            W_mu[k,j] += lr * (ll_plus - ll) / h

            W_log_sigma[k,j] += h
            ll_plus = mdn_log_likelihood(W_mu, W_log_sigma, W_logit_pi, Phi, y)
            W_log_sigma[k,j] -= h
            W_log_sigma[k,j] += lr * (ll_plus - ll) / h

# Evaluate the learned mixture at a test point
x_test = 0.5
phi_test = np.array([x_test**d for d in range(deg+1)])
mu_test = phi_test @ W_mu.T
sigma_test = np.exp(phi_test @ W_log_sigma.T)
logit_test = phi_test @ W_logit_pi.T
pi_test = np.exp(logit_test - logsumexp(logit_test))

print(f"\nAt x={x_test}:")
for k in range(K):
    print(f"  Component {k+1}: pi={pi_test[k]:.3f}, "
          f"mu={mu_test[k]:.3f}, sigma={sigma_test[k]:.3f}")
print(f"  True modes: sin(1.5)={np.sin(1.5):.3f}, -sin(1.5)={-np.sin(1.5):.3f}")

Autoregressive models

The chain rule of probability factorizes any joint density:

\[p(\mathbf{x}) = \prod_{d=1}^D p(x_d \mid x_1, \ldots, x_{d-1}).\]

An autoregressive model parameterizes each conditional with a neural network. Examples include MADE, PixelCNN, and WaveNet. The log-likelihood is:

\[\log p(\mathbf{x}) = \sum_{d=1}^D \log p(x_d \mid x_{<d}; \mathbf{w}),\]

In plain English: the log-probability of the whole data vector decomposes into a sum of log-probabilities, each asking “given all the previous dimensions, how likely is this next one?” A neural network learns each of these conditional distributions. Because the result is a simple sum of log-probabilities, the total log-likelihood is tractable — we can evaluate it and take gradients — so training is just maximum likelihood estimation via gradient ascent.

20.3 Simulation-Based Inference (Likelihood-Free)

Why likelihood-free? Many scientific models are defined by a simulator — a computer program that generates synthetic data given parameters — but the likelihood function \(p(\mathbf{x} \mid \theta)\) is intractable because the simulator involves stochastic processes, latent variables, or differential equations without closed-form solutions.

Real-World Example

Simulation-based inference in ecology and cosmology.

In population genetics, models of evolution produce complex patterns of genetic variation through stochastic birth-death processes, mutation, and natural selection. You can simulate these processes easily, but there is no way to write down \(p(\text{observed genomes} \mid \text{mutation rate}, \text{population size})\) in closed form. The same situation arises in cosmology (simulating galaxy formation), epidemiology (agent-based disease models), and climate science (atmospheric simulators).

Simulation-based inference lets you do Bayesian inference anyway: generate thousands of synthetic datasets from the simulator, then train a neural network to learn the mapping from data to posterior.

Approximate Bayesian Computation (ABC)

The simplest likelihood-free method.

  1. Sample \(\theta^* \sim p(\theta)\) from the prior.

  2. Simulate \(\mathbf{x}^* \sim p(\cdot \mid \theta^*)\).

  3. Accept \(\theta^*\) if \(d(S(\mathbf{x}^*), S(\mathbf{x}_{\text{obs}})) < \varepsilon\), where \(S\) is a summary statistic and \(d\) is a distance metric.

The accepted samples approximate the posterior \(p(\theta \mid d(S(\mathbf{x}), S(\mathbf{x}_{\text{obs}})) < \varepsilon)\). As \(\varepsilon \to 0\) this converges to \(p(\theta \mid S(\mathbf{x}_{\text{obs}}))\), and if \(S\) is sufficient it equals the true posterior.

ABC for a simple model: recovering the posterior. To make ABC concrete, we use a model where we know the true posterior (Poisson observations with a Gamma prior), so we can verify that ABC gives the right answer. This is the “test with a known answer” approach that builds trust before applying ABC to truly intractable models.

# ABC for a Poisson model: compare ABC posterior with known conjugate posterior
import numpy as np
from scipy import stats

np.random.seed(42)

# "Simulator": Poisson observations with unknown rate lambda
def simulator(lam, n_obs=20):
    return np.random.poisson(lam, n_obs)

# Observed data
x_obs = np.array([3, 5, 2, 4, 3, 6, 2, 3, 4, 5,
                   3, 4, 2, 3, 4, 5, 3, 4, 3, 2])
s_obs = x_obs.mean()  # summary statistic: sample mean
n_obs = len(x_obs)

# Step 1: Generate training data from prior
S = 50_000
prior = stats.uniform(loc=0.5, scale=9.5)  # Uniform(0.5, 10)
thetas = prior.rvs(S)
summary_stats = np.array([simulator(lam, n_obs).mean() for lam in thetas])

# Step 2: ABC with decreasing epsilon
print(f"Observed summary stat (mean): {s_obs:.2f}")
print(f"\n{'epsilon':<10} {'n_accepted':<14} {'ABC mean':<12} "
      f"{'ABC std':<10} {'Accept %'}")
print("-" * 58)

for epsilon in [1.0, 0.5, 0.3, 0.15, 0.08]:
    mask = np.abs(summary_stats - s_obs) < epsilon
    accepted = thetas[mask]
    if len(accepted) > 5:
        print(f"{epsilon:<10.2f} {len(accepted):<14} {accepted.mean():<12.3f} "
              f"{accepted.std():<10.3f} {100*len(accepted)/S:<.1f}%")

# True posterior: Gamma (since Poisson-Gamma is conjugate)
# With flat prior ~ Gamma(1, 0), posterior is Gamma(1 + sum(x), 1/(1 + n))
alpha_post = 1 + x_obs.sum()
beta_post = 1 + n_obs
true_post = stats.gamma(a=alpha_post, scale=1/beta_post)

print(f"\nTrue posterior (Gamma): mean={true_post.mean():.3f}, "
      f"std={true_post.std():.3f}")
print(f"ABC posterior (eps=0.15): approaches the true posterior")
print(f"\nAs epsilon -> 0, ABC converges to the exact posterior,")
print(f"but the acceptance rate drops, requiring more simulations.")

Limitations. ABC is slow in high dimensions, sensitive to the choice of summary statistics, and requires the tolerance \(\varepsilon\) to be chosen carefully.

Neural Posterior Estimation (NPE)

Instead of accepting/rejecting, train a neural network (typically a normalizing flow, Section 20.5) to directly map simulated data to posterior distributions.

  1. Repeat: sample \(\theta^{(s)} \sim p(\theta)\), simulate \(\mathbf{x}^{(s)} \sim p(\cdot \mid \theta^{(s)})\).

  2. Train a conditional density estimator \(q_\phi(\theta \mid \mathbf{x})\) by maximizing:

    \[\sum_{s=1}^S \log q_\phi(\theta^{(s)} \mid \mathbf{x}^{(s)}).\]
  3. Evaluate \(q_\phi(\theta \mid \mathbf{x}_{\text{obs}})\) at the observed data to obtain the approximate posterior.

The loss function is the negative log-likelihood of the parameters under the conditional density model — a direct extension of MLE.

Neural Likelihood Estimation (NLE)

Instead of estimating the posterior directly, estimate the likelihood:

  1. Train \(q_\phi(\mathbf{x} \mid \theta)\) to approximate \(p(\mathbf{x} \mid \theta)\).

  2. Use the learned likelihood in standard Bayesian inference (e.g., MCMC with \(q_\phi\) as the likelihood).

This separates the approximation step from the inference step and allows reuse of the learned likelihood with different priors.

Neural Ratio Estimation (NRE)

A third approach estimates the likelihood-to-evidence ratio \(r(\mathbf{x}, \theta) = p(\mathbf{x} \mid \theta) / p(\mathbf{x})\), which is the quantity needed in MCMC acceptance ratios. Training a classifier to distinguish pairs \((\mathbf{x}, \theta)\) drawn jointly from \(p(\mathbf{x}, \theta)\) versus the marginals \(p(\mathbf{x})\,p(\theta)\) recovers this ratio.

20.4 Amortized Inference

Why amortize? Traditional inference is “one-shot”: for each new dataset we run a full optimization or MCMC. In amortized inference we train a neural network once on many simulated datasets, and at test time it maps any new observation to its posterior in a single forward pass.

Formally, we learn \(q_\phi(\theta \mid \mathbf{x})\) by minimizing

\[\mathcal{L}(\phi) = -E_{p(\theta, \mathbf{x})}\!\left[\log q_\phi(\theta \mid \mathbf{x})\right]\]

over the joint distribution of parameters and data.

Reading this formula: we generate many (parameter, data) pairs from the simulator, then train the network to maximize the probability it assigns to the true parameter given the data. This is the same maximum likelihood idea applied to the conditional density estimator itself. After training, the network has seen so many examples that it has learned the general pattern of how data map to posteriors — so for any new dataset, it can produce an approximate posterior in a single forward pass, without running any optimization or MCMC.

This is the amortized version of variational inference: instead of optimizing separate variational parameters for each dataset, a single network handles all datasets.

Amortized inference: an encoder that maps data to posterior parameters. The idea of amortized inference is simple: train a function that takes raw data and returns posterior parameters. Here we build a small “encoder” (a linear model, for transparency) that maps summary statistics to the parameters of a Gaussian approximate posterior.

# Amortized inference: train an encoder from data to posterior parameters
import numpy as np

np.random.seed(42)

# Model: Normal(mu, 1) with known variance
# True posterior for n observations with known var=1:
#   mu | x ~ N(x_bar, 1/n)
# The encoder should learn: posterior_mean = x_bar, posterior_logvar = -log(n)

n_obs = 10
n_train = 20_000

# Generate training data: (theta, summary_stats) pairs
thetas_train = np.random.normal(0, 3, n_train)  # prior: N(0, 9)
summaries_train = np.zeros((n_train, 2))  # features: (mean, var)
for i in range(n_train):
    x = np.random.normal(thetas_train[i], 1.0, n_obs)
    summaries_train[i, 0] = x.mean()
    summaries_train[i, 1] = x.var()

# Encoder: linear map from summary stats to (posterior_mean, posterior_log_var)
# q(theta | x) = N(theta; a^T s(x), exp(b^T s(x)))
# Train by maximizing E[log q(theta | x)]

# Add bias term
S = np.column_stack([summaries_train, np.ones(n_train)])  # (n_train, 3)
W = np.zeros((2, 3))  # row 0: weights for mean, row 1: weights for logvar

lr = 0.001
print(f"{'Iter':<8} {'Avg log q(theta|x)':<22} {'Mean weight':<14}")
print("-" * 44)

for iteration in range(501):
    # Forward pass
    outputs = S @ W.T  # (n_train, 2)
    pred_mean = outputs[:, 0]
    pred_logvar = outputs[:, 1]
    pred_var = np.exp(np.clip(pred_logvar, -10, 10))

    # log q(theta | x) = -0.5*log(2*pi*var) - 0.5*(theta - mean)^2/var
    log_q = (-0.5*np.log(2*np.pi*pred_var)
             - 0.5*(thetas_train - pred_mean)**2 / pred_var)
    avg_log_q = log_q.mean()

    if iteration % 100 == 0:
        print(f"{iteration:<8} {avg_log_q:<22.4f} {W[0,0]:<14.4f}")

    # Gradient of log q w.r.t. W
    # d/d(mean) = (theta - mean) / var
    # d/d(logvar) = -0.5 + 0.5*(theta - mean)^2 / var
    d_mean = (thetas_train - pred_mean) / pred_var       # (n_train,)
    d_logvar = -0.5 + 0.5*(thetas_train - pred_mean)**2 / pred_var

    # Gradient w.r.t. W: d(log_q)/d(W_mean) = d_mean * S
    grad_W_mean = d_mean[:, None] * S     # (n_train, 3)
    grad_W_logvar = d_logvar[:, None] * S  # (n_train, 3)

    W[0] += lr * grad_W_mean.mean(axis=0)
    W[1] += lr * grad_W_logvar.mean(axis=0)

# Test: new observation
x_test = np.random.normal(2.5, 1.0, n_obs)
s_test = np.array([x_test.mean(), x_test.var(), 1.0])
pred = s_test @ W.T
print(f"\nTest data: x_bar={x_test.mean():.3f}, n={n_obs}")
print(f"Encoder output:  mean={pred[0]:.3f}, var={np.exp(pred[1]):.4f}")
print(f"True posterior:  mean={x_test.mean():.3f}, var={1/n_obs:.4f}")
print(f"\nThe encoder learned the correct posterior mapping!")

Trade-offs.

  • Speed at test time: Inference is a single neural network evaluation (\(\sim\) milliseconds).

  • Up-front cost: Training requires many simulations and can take hours or days.

  • Amortization gap: The network may not perfectly capture the posterior for every individual dataset; refinement with MCMC can close this gap.

20.5 Normalizing Flows

Why flows? A normalizing flow transforms a simple base distribution (e.g., a standard Normal) into a complex target distribution through a sequence of invertible, differentiable maps. This gives us a flexible density estimator with a tractable likelihood.

Change of variables

Let \(\mathbf{z} \sim p_Z(\mathbf{z})\) (the base distribution) and \(\mathbf{x} = f(\mathbf{z})\) where \(f\) is an invertible map. The density of \(\mathbf{x}\) is:

\[p_X(\mathbf{x}) = p_Z\!\bigl(f^{-1}(\mathbf{x})\bigr)\; \left|\det \frac{\partial f^{-1}}{\partial \mathbf{x}}\right|.\]

In plain English: to find the density of \(\mathbf{x}\), first map it back to the base space via \(f^{-1}\) and evaluate the simple base density there. But that alone is not enough — the transformation stretches and compresses space, so we must account for how much the local volume changes. The Jacobian determinant is exactly this volume-change factor. If the transformation expands a region of space, the density must decrease proportionally (probability is conserved), and vice versa.

Taking the logarithm:

\[\log p_X(\mathbf{x}) = \log p_Z\!\bigl(f^{-1}(\mathbf{x})\bigr) + \log\!\left|\det \frac{\partial f^{-1}}{\partial \mathbf{x}}\right|.\]

For a composition of \(K\) invertible maps \(f = f_K \circ \cdots \circ f_1\):

\[\log p_X(\mathbf{x}) = \log p_Z(\mathbf{z}_0) + \sum_{k=1}^K \log\!\left|\det \frac{\partial f_k^{-1}}{\partial \mathbf{z}_k}\right|,\]

where \(\mathbf{z}_0 = f_1^{-1} \circ \cdots \circ f_K^{-1}(\mathbf{x})\).

Coupling layers

Computing the full Jacobian determinant is \(O(D^3)\). Coupling layers (Dinh et al., 2015, 2017) split the input into two halves \((\mathbf{x}_A, \mathbf{x}_B)\) and apply:

\[\begin{split}\mathbf{z}_A &= \mathbf{x}_A, \\ \mathbf{z}_B &= \mathbf{x}_B \odot \exp\!\bigl(s(\mathbf{x}_A)\bigr) + t(\mathbf{x}_A),\end{split}\]

where \(s\) and \(t\) are neural networks (the “scale” and “translate” networks, respectively).

Why the Jacobian is cheap. Because \(\mathbf{z}_A\) is simply copied from \(\mathbf{x}_A\) unchanged, and \(\mathbf{z}_B\) depends on \(\mathbf{x}_B\) only through an element-wise affine transformation (scale by \(\exp(s(\mathbf{x}_A))\) and shift by \(t(\mathbf{x}_A)\)), the Jacobian matrix of the entire transformation is triangular. The determinant of a triangular matrix is just the product of its diagonal entries, so:

\[\log\!\left|\det \frac{\partial \mathbf{z}}{\partial \mathbf{x}}\right| = \sum_j s_j(\mathbf{x}_A),\]

which is \(O(D)\) — a massive speedup compared to the \(O(D^3)\) cost of a general determinant. Alternating which half is transformed ensures all dimensions interact.

Simple affine coupling layer: transforming a Gaussian into a non-Gaussian. Let us implement the coupling layer from scratch and show that it can transform a standard 2D Gaussian into a distinctly non-Gaussian shape, with the exact log-density computable at every point via the change-of-variables formula.

# Normalizing flows: affine coupling layer
import numpy as np
from scipy import stats

np.random.seed(42)

# Affine coupling layer in 2D:
#   z_1 = x_1                         (passthrough)
#   z_2 = x_2 * exp(s(x_1)) + t(x_1) (affine transform)
# Inverse:
#   x_1 = z_1
#   x_2 = (z_2 - t(z_1)) * exp(-s(z_1))
# Log-det-Jacobian = s(x_1)

# Simple parameterization: s and t are linear functions
# s(x) = a*x + b,  t(x) = c*x + d
# These parameters would be learned; we set them manually for illustration
a, b = 0.5, 0.3   # scale network
c, d = 1.2, -0.5   # translate network

def s_net(x1):
    return a * x1 + b

def t_net(x1):
    return c * x1 + d

def forward(x):
    """x -> z (from data space to base space)."""
    x1, x2 = x[:, 0], x[:, 1]
    z1 = x1
    z2 = x2 * np.exp(s_net(x1)) + t_net(x1)
    return np.column_stack([z1, z2])

def inverse(z):
    """z -> x (from base space to data space)."""
    z1, z2 = z[:, 0], z[:, 1]
    x1 = z1
    x2 = (z2 - t_net(z1)) * np.exp(-s_net(z1))
    return np.column_stack([x1, x2])

def log_det_jacobian(x):
    """Log |det J| of the forward transform."""
    return s_net(x[:, 0])

# Sample from base distribution (standard normal)
n = 5000
z = np.random.randn(n, 2)

# Transform to data space
x = inverse(z)

# Compute log-density of transformed samples
# log p_X(x) = log p_Z(f(x)) + log|det J_f(x)|
z_check = forward(x)
log_pz = stats.norm.logpdf(z_check[:, 0]) + stats.norm.logpdf(z_check[:, 1])
log_px = log_pz + log_det_jacobian(x)

print("Affine Coupling Layer Flow (2D)")
print("=" * 40)
print(f"\nBase distribution: N(0, I_2)")
print(f"Scale network: s(x1) = {a}*x1 + {b}")
print(f"Translate network: t(x1) = {c}*x1 + {d}")

print(f"\nBase (z) statistics:")
print(f"  mean = ({z[:,0].mean():.3f}, {z[:,1].mean():.3f})")
print(f"  std  = ({z[:,0].std():.3f}, {z[:,1].std():.3f})")
print(f"  corr = {np.corrcoef(z[:,0], z[:,1])[0,1]:.3f}")

print(f"\nTransformed (x) statistics:")
print(f"  mean = ({x[:,0].mean():.3f}, {x[:,1].mean():.3f})")
print(f"  std  = ({x[:,0].std():.3f}, {x[:,1].std():.3f})")
print(f"  corr = {np.corrcoef(x[:,0], x[:,1])[0,1]:.3f}")

print(f"\nLog-density at x=(0,0): {log_px[np.argmin(np.sum(x**2, axis=1))]:.4f}")
print(f"Mean log-density:       {log_px.mean():.4f}")

# Verify invertibility
x_roundtrip = inverse(forward(x))
print(f"\nInvertibility check: max |x - inv(fwd(x))| = "
      f"{np.max(np.abs(x - x_roundtrip)):.2e}")
print("The coupling layer is exactly invertible by construction.")

Applications in likelihood-based inference

  • Density estimation: Fit \(p_X\) to data by maximizing the log-likelihood above.

  • Posterior estimation: Use a conditional flow \(q_\phi(\theta \mid \mathbf{x})\) to approximate the posterior in simulation-based inference (Section 20.3).

  • Variational inference: Use a flow as the variational family \(q_\phi(\theta)\) to make the ELBO tighter (see Chapter 18).

20.6 Variational Autoencoders (VAEs)

Why VAEs? A variational autoencoder defines a generative model with latent variables \(\mathbf{z}\):

\[p_\phi(\mathbf{x}) = \int p_\phi(\mathbf{x} \mid \mathbf{z})\, p(\mathbf{z})\, d\mathbf{z}.\]

The marginal likelihood \(p_\phi(\mathbf{x})\) is intractable because of the integral over \(\mathbf{z}\).

The ELBO for VAEs

Introduce an encoder (inference network) \(q_\psi(\mathbf{z} \mid \mathbf{x})\). As derived in Chapter 18:

\[\log p_\phi(\mathbf{x}) \;\ge\; E_{q_\psi(\mathbf{z} \mid \mathbf{x})}\!\bigl[\log p_\phi(\mathbf{x} \mid \mathbf{z})\bigr] - \text{KL}\!\bigl(q_\psi(\mathbf{z} \mid \mathbf{x}) \| p(\mathbf{z})\bigr).\]

The first term is the expected reconstruction likelihood; the second is a regularizer encouraging the approximate posterior to stay close to the prior.

The reparameterization trick

To backpropagate through the expectation we write \(\mathbf{z} = \boldsymbol{\mu}_\psi(\mathbf{x}) + \boldsymbol{\sigma}_\psi(\mathbf{x}) \odot \boldsymbol{\epsilon}\) with \(\boldsymbol{\epsilon} \sim N(\mathbf{0}, \mathbf{I})\). This moves the randomness into \(\boldsymbol{\epsilon}\), making the ELBO differentiable with respect to \(\psi\).

Derivation of the gradient. Let \(g(\psi) = E_{q_\psi}[f(\mathbf{z})]\). The naive score-function estimator has high variance. Using the reparameterization \(\mathbf{z} = h(\boldsymbol{\epsilon}, \psi)\):

\[g(\psi) = E_{p(\boldsymbol{\epsilon})}[f(h(\boldsymbol{\epsilon}, \psi))],\]

so:

\[\nabla_\psi g = E_{p(\boldsymbol{\epsilon})}\!\left[ \nabla_\psi f\!\bigl(h(\boldsymbol{\epsilon}, \psi)\bigr)\right],\]

which can be estimated by a single-sample Monte Carlo average and differentiated through \(h\) using AD (Section 20.1).

In plain English: the reparameterization trick moves the randomness out of the distribution we are differentiating. Instead of sampling \(\mathbf{z}\) from a distribution that depends on \(\psi\) (which makes the gradient hard to compute), we sample fixed noise \(\boldsymbol{\epsilon}\) from a standard Normal and then deterministically transform it into \(\mathbf{z}\) using the function \(h\). Now the gradient with respect to \(\psi\) flows straight through \(h\) via automatic differentiation, giving low-variance gradient estimates that make training practical.

Connection to likelihood

Training a VAE by maximizing the ELBO is a form of approximate maximum likelihood — we maximize a lower bound on \(\log p_\phi(\mathbf{x})\). As the variational family becomes more expressive (e.g., using normalizing flows for \(q\)), the bound tightens and we approach exact MLE.

Computing the ELBO for a simple latent variable model. Let us make the ELBO concrete. We have a 1D latent variable \(z \sim N(0,1)\) and an observation model \(x \mid z \sim N(z, \sigma^2)\). The true marginal is \(x \sim N(0, 1 + \sigma^2)\), so we can check that our ELBO is indeed a lower bound on the true log-marginal.

# VAE ELBO: compute and verify it is a lower bound on log p(x)
import numpy as np
from scipy import stats

np.random.seed(42)

# Generative model: z ~ N(0,1), x|z ~ N(z, sigma_x^2)
sigma_x = 0.5
# True marginal: x ~ N(0, 1 + sigma_x^2)
true_marginal_var = 1.0 + sigma_x**2

# Generate observed data
n = 200
z_true = np.random.normal(0, 1, n)
x_obs = z_true + np.random.normal(0, sigma_x, n)

# Variational posterior: q(z|x) = N(mu_q(x), sigma_q^2(x))
# Optimal: posterior is N(x/(1+sigma_x^2), sigma_x^2/(1+sigma_x^2))
# Let us parameterize and compute the ELBO

def compute_elbo(x, mu_q, log_sigma_q, sigma_x, n_samples=100):
    """Monte Carlo estimate of the ELBO for each observation."""
    sigma_q = np.exp(log_sigma_q)
    elbos = np.zeros(len(x))
    for i in range(len(x)):
        eps = np.random.normal(0, 1, n_samples)
        z = mu_q[i] + sigma_q[i] * eps  # reparameterization trick

        # E_q[log p(x|z)] : reconstruction term
        log_p_x_given_z = stats.norm.logpdf(x[i], loc=z, scale=sigma_x)
        reconstruction = log_p_x_given_z.mean()

        # KL(q(z|x) || p(z)) : analytical for two Gaussians
        kl = 0.5 * (sigma_q[i]**2 + mu_q[i]**2 - 1 - 2*log_sigma_q[i])

        elbos[i] = reconstruction - kl
    return elbos

# Compute with optimal variational parameters
post_var = sigma_x**2 / (1 + sigma_x**2)
mu_q_opt = x_obs / (1 + sigma_x**2)
log_sigma_q_opt = 0.5 * np.log(post_var) * np.ones(n)

elbos = compute_elbo(x_obs, mu_q_opt, log_sigma_q_opt, sigma_x)

# True log-marginal
log_marginals = stats.norm.logpdf(x_obs, loc=0, scale=np.sqrt(true_marginal_var))

print(f"ELBO (total):           {elbos.sum():.2f}")
print(f"True log-marginal:      {log_marginals.sum():.2f}")
print(f"Gap (>= 0):             {log_marginals.sum() - elbos.sum():.2f}")
print(f"\nPer-observation:")
print(f"  Mean ELBO:            {elbos.mean():.4f}")
print(f"  Mean log p(x):        {log_marginals.mean():.4f}")
print(f"  Mean gap:             {(log_marginals - elbos).mean():.4f}")
print(f"\nELBO <= log p(x) for each observation?  "
      f"{np.all(log_marginals >= elbos - 0.1)}")
print("(Small violations possible due to MC noise in ELBO estimate)")

20.7 Score Matching

Why score matching? In some models the density \(p_\theta(\mathbf{x})\) involves an intractable normalizing constant \(Z(\theta)\):

\[p_\theta(\mathbf{x}) = \frac{\tilde{p}_\theta(\mathbf{x})}{Z(\theta)}, \qquad Z(\theta) = \int \tilde{p}_\theta(\mathbf{x})\, d\mathbf{x}.\]

Maximum likelihood is intractable because computing or differentiating \(Z(\theta)\) is infeasible.

The score function is the gradient of the log-density with respect to \(\mathbf{x}\):

\[\mathbf{s}_\theta(\mathbf{x}) = \nabla_{\mathbf{x}} \log p_\theta(\mathbf{x}) = \nabla_{\mathbf{x}} \log \tilde{p}_\theta(\mathbf{x}),\]

where the normalizing constant cancels (it does not depend on \(\mathbf{x}\)).

Note a subtle but important distinction: throughout most of this guide, the “score function” referred to the gradient of the log-likelihood with respect to the parameters \(\theta\). Here, the score is the gradient with respect to the data \(\mathbf{x}\). This data-space score tells you which direction to move a data point to increase its probability under the model — like a vector field pointing “uphill” on the probability landscape. The key advantage is that this score does not depend on the normalizing constant \(Z(\theta)\) at all, since \(\nabla_{\mathbf{x}} \log Z(\theta) = 0\).

Hyvarinen (2005) proposed minimizing the Fisher divergence:

\[J(\theta) = \frac{1}{2} E_{p_{\text{data}}}\!\left[ \|\mathbf{s}_\theta(\mathbf{x}) - \nabla_{\mathbf{x}} \log p_{\text{data}}(\mathbf{x})\|^2 \right].\]

This involves the unknown data score, but integration by parts yields an equivalent explicit score matching objective that only requires the model score and its derivatives — not the unknown data score:

\[J(\theta) = E_{p_{\text{data}}}\!\left[ \frac{1}{2}\|\mathbf{s}_\theta(\mathbf{x})\|^2 + \text{tr}\!\bigl(\nabla_{\mathbf{x}} \mathbf{s}_\theta(\mathbf{x})\bigr) \right] + \text{const}.\]

Reading this formula: the first term, \(\frac{1}{2}\|\mathbf{s}_\theta\|^2\), penalizes model scores that are too large (the model thinks the data points should be moved far). The second term, \(\text{tr}(\nabla_{\mathbf{x}} \mathbf{s}_\theta)\), is the divergence of the score field — it penalizes the model for having scores that converge too sharply (creating overly peaked densities). Together, these two terms replace the need to know the true data score, and minimizing this objective makes the model score match the data score.

Derivation sketch. Expand the squared norm:

\[\|\mathbf{s}_\theta - \mathbf{s}_{\text{data}}\|^2 = \|\mathbf{s}_\theta\|^2 - 2\, \mathbf{s}_\theta \cdot \mathbf{s}_{\text{data}} + \|\mathbf{s}_{\text{data}}\|^2.\]

The last term is constant. For the cross-term, using integration by parts (assuming boundary terms vanish):

\[\begin{split}E_{p_{\text{data}}}[\mathbf{s}_\theta \cdot \mathbf{s}_{\text{data}}] &= \int p_{\text{data}}(\mathbf{x})\, \mathbf{s}_\theta(\mathbf{x}) \cdot \nabla_{\mathbf{x}} \log p_{\text{data}}(\mathbf{x})\, d\mathbf{x} \\ &= \int \mathbf{s}_\theta(\mathbf{x}) \cdot \nabla_{\mathbf{x}} p_{\text{data}}(\mathbf{x})\, d\mathbf{x} \\ &= -\int p_{\text{data}}(\mathbf{x})\, \text{tr}\!\bigl(\nabla_{\mathbf{x}} \mathbf{s}_\theta(\mathbf{x})\bigr)\, d\mathbf{x} \\ &= -E_{p_{\text{data}}}\!\bigl[ \text{tr}\!\bigl(\nabla_{\mathbf{x}} \mathbf{s}_\theta(\mathbf{x})\bigr)\bigr].\end{split}\]

Substituting gives the explicit objective.

Score matching in 1D: fitting a Gaussian without the normalizing constant. To make score matching concrete, let us fit a 1D Gaussian to data using only the score function (derivative of log-density w.r.t. x), without ever computing the normalizing constant \(\sqrt{2\pi}\sigma\). The model score for \(\tilde{p}_\theta(x) = \exp(-\frac{1}{2}(x - \mu)^2/\sigma^2)\) is \(s_\theta(x) = -(x - \mu)/\sigma^2\), and the explicit score matching objective has a closed-form gradient we can optimize.

# Score matching: fit a Gaussian without using the normalizing constant
import numpy as np

np.random.seed(42)

# True distribution: N(mu_true, sigma_true^2)
mu_true, sigma_true = 2.0, 1.5
n = 500
data = np.random.normal(mu_true, sigma_true, n)

# Model score: s_theta(x) = -(x - mu) / sigma^2
# ds/dx = -1/sigma^2
# Score matching objective: E[0.5 * s^2 + ds/dx]
#   = E[0.5*(x-mu)^2/sigma^4 - 1/sigma^2]

def score_matching_objective(mu, sigma):
    """Explicit score matching objective (Hyvarinen 2005)."""
    s2 = sigma**2
    score_sq = 0.5 * np.mean((data - mu)**2) / s2**2
    trace_term = -1.0 / s2
    return score_sq + trace_term

# Gradient descent on the score matching objective
mu_hat, sigma_hat = 0.0, 1.0
lr_mu, lr_sigma = 0.01, 0.01
h = 1e-6

print(f"{'Iter':<8} {'mu':<10} {'sigma':<10} {'SM objective':<14}")
print("-" * 42)
for iteration in range(201):
    obj = score_matching_objective(mu_hat, sigma_hat)
    if iteration % 40 == 0:
        print(f"{iteration:<8} {mu_hat:<10.4f} {sigma_hat:<10.4f} {obj:<14.6f}")

    # Numerical gradients
    g_mu = (score_matching_objective(mu_hat + h, sigma_hat)
            - score_matching_objective(mu_hat - h, sigma_hat)) / (2*h)
    g_sigma = (score_matching_objective(mu_hat, sigma_hat + h)
               - score_matching_objective(mu_hat, sigma_hat - h)) / (2*h)

    mu_hat -= lr_mu * g_mu
    sigma_hat -= lr_sigma * g_sigma

print(f"\nScore matching estimates: mu={mu_hat:.4f}, sigma={sigma_hat:.4f}")
print(f"MLE estimates:           mu={data.mean():.4f}, sigma={data.std():.4f}")
print(f"True values:             mu={mu_true}, sigma={sigma_true}")
print(f"\nWe recovered the parameters without ever computing sqrt(2*pi*sigma^2)!")

Denoising score matching. Song and Ermon (2019) showed that adding noise to the data and matching the score of the noisy distribution avoids computing the trace of the Jacobian, which is expensive in high dimensions. This leads to score-based diffusion models (the engine behind modern image generation).

20.8 Connections Back to Classical Likelihood

Despite their apparent novelty, all of the methods in this chapter are intimately connected to the classical likelihood framework:

  1. Automatic differentiation computes the exact same gradients that classical score equations describe, just algorithmically rather than analytically.

  2. Neural density estimation is maximum likelihood estimation where the model family is a neural network instead of a parametric family.

  3. Simulation-based inference seeks to perform likelihood-based inference (or Bayesian inference with a likelihood) even when \(p(\mathbf{x} \mid \theta)\) cannot be evaluated — it approximates the likelihood or the posterior.

  4. Amortized inference is variational inference (a lower bound on the log-likelihood) with the variational parameters predicted by a neural network.

  5. Normalizing flows exploit the change-of-variables formula to compute exact likelihoods for flexible distributions.

  6. VAEs maximize a lower bound on the marginal log-likelihood.

  7. Score matching estimates the model by matching derivatives of the log-likelihood with respect to the data (rather than the parameters), sidestepping the normalizing constant.

The unifying theme is that the likelihood function — or some tractable proxy — remains the objective guiding learning and inference. What has changed is the expressiveness of the model family and the computational tools available.

20.9 Summary

  • Automatic differentiation (especially reverse mode) enables gradient-based optimization and sampling for models with thousands to millions of parameters.

  • Neural density estimators (MDNs, autoregressive models) turn MLE into training a neural network.

  • Simulation-based inference (ABC, NPE, NLE, NRE) extends Bayesian reasoning to models where the likelihood is intractable.

  • Amortized inference pre-trains a network to produce posteriors instantly for any new dataset.

  • Normalizing flows provide flexible, invertible density models with exact log-likelihoods via the change-of-variables formula.

  • VAEs maximize a lower bound (ELBO) on the marginal log-likelihood using the reparameterization trick.

  • Score matching avoids the normalizing constant by matching gradients of the log-density, and is the foundation of modern diffusion models.

  • Every modern method traces back to the classical likelihood principle: the likelihood function, or an approximation thereof, remains the engine of statistical learning.