Appendix B: Calculus Review¶
This appendix reviews the calculus tools that underpin likelihood-based inference. Every rule and formula is tied to a concrete statistical application and verified in code. The goal is not to re-teach calculus but to show why each tool matters the moment you open a likelihood function.
If you have taken a standard calculus sequence, the rules below will be familiar. What is different here is the lens: every derivative is a score function, every integral is a normalizing constant, and every Taylor expansion is an approximation to a log-likelihood.
Differentiation Rules¶
Throughout, we let \(f\) and \(g\) denote differentiable real-valued functions.
Sum Rule¶
Why it matters for statistics: the sum rule is the reason log-likelihoods are so much easier than likelihoods. Taking the log of a product turns it into a sum, and then we differentiate each observation’s contribution independently.
# Sum rule in action: log-likelihood of n independent observations
# is a SUM of individual log-likelihoods
import numpy as np
from scipy.stats import poisson
data = np.array([3, 5, 2, 7, 4])
lam = 4.0
# Log-likelihood: sum of individual terms
individual_terms = [poisson.logpmf(x, mu=lam) for x in data]
total_ll = sum(individual_terms)
print("Individual log-likelihood contributions:")
for i, (x, ll) in enumerate(zip(data, individual_terms)):
print(f" x_{i} = {x}: log f(x|lam) = {ll:.4f}")
print(f"Total log-likelihood = {total_ll:.4f}")
print(f"Verify: {np.sum(poisson.logpmf(data, mu=lam)):.4f}")
Product Rule¶
You encounter the product rule when differentiating likelihood contributions where both the location and scale depend on the same parameter — for example, in heteroscedastic models.
# Product rule: derivative of x * exp(-x^2)
# This shape appears in the score of a Rayleigh distribution
import numpy as np
x = 2.0
# f(x) = x, g(x) = exp(-x^2)
# d/dx [x * exp(-x^2)] = 1 * exp(-x^2) + x * (-2x) * exp(-x^2)
# = exp(-x^2) * (1 - 2x^2)
analytical = np.exp(-x**2) * (1 - 2 * x**2)
eps = 1e-7
h = lambda t: t * np.exp(-t**2)
numerical = (h(x + eps) - h(x)) / eps
print(f"Product rule (analytical): {analytical:.8f}")
print(f"Finite difference: {numerical:.8f}")
Chain Rule¶
If \(h(x) = f(g(x))\), then:
The chain rule is the single most important differentiation tool in likelihood theory. The score function is the chain rule applied to \(\log(\cdot)\) composed with \(L(\cdot)\):
# Chain rule = score function: d/d(theta) log L(theta)
import numpy as np
from scipy.stats import poisson
data = np.array([3, 5, 2, 7, 4])
theta = 4.0 # Poisson rate
# The chain rule gives us the score:
# d/d(lam) log L = d/d(lam) [sum x_i * log(lam) - n*lam - sum log(x_i!)]
# = sum(x_i) / lam - n
score_chain_rule = data.sum() / theta - len(data)
# Verify numerically
def loglik(lam):
return np.sum(poisson.logpmf(data, mu=lam))
eps = 1e-7
score_numerical = (loglik(theta + eps) - loglik(theta)) / eps
print(f"Score via chain rule: {score_chain_rule:.8f}")
print(f"Score via finite diff: {score_numerical:.8f}")
print(f"MLE (where score = 0): lam_hat = {data.mean()}")
Logarithmic differentiation. For \(L(\theta) > 0\):
This converts differentiating a product (likelihood) into differentiating a sum (log-likelihood) — one of the most powerful tricks in statistics.
# Logarithmic differentiation for a Binomial likelihood
# L(p) = C * p^s * (1-p)^(n-s)
# log L = const + s*log(p) + (n-s)*log(1-p)
# d/dp log L = s/p - (n-s)/(1-p)
import numpy as np
n, s = 100, 73
p = 0.7
# Score via log differentiation
score = s / p - (n - s) / (1 - p)
# Score via direct differentiation of L, then dividing by L
# dL/dp = C * [s * p^(s-1) * (1-p)^(n-s) - (n-s) * p^s * (1-p)^(n-s-1)]
# dL/dp / L = s/p - (n-s)/(1-p) (same thing!)
print(f"Score at p={p}: {score:.4f}")
print(f"Score at MLE p={s/n}: {s/(s/n) - (n-s)/(1-s/n):.10f}")
Partial Derivatives and Gradients¶
For a function \(f : \mathbb{R}^n \to \mathbb{R}\), the gradient is
Setting \(\nabla \ell = \mathbf{0}\) gives the MLE in multiparameter models.
# Gradient of a Normal log-likelihood with two parameters (mu, sigma)
import numpy as np
from scipy.stats import norm
np.random.seed(42)
data = np.random.normal(loc=3.0, scale=2.0, size=100)
def loglik(mu, log_sigma):
sigma = np.exp(log_sigma)
return np.sum(norm.logpdf(data, loc=mu, scale=sigma))
# Analytical gradient for Normal:
# d/d(mu) = sum(x_i - mu) / sigma^2
# d/d(sigma) = -n/sigma + sum(x_i - mu)^2 / sigma^3
mu_hat = data.mean()
sigma_hat = data.std()
grad_mu = np.sum(data - mu_hat) / sigma_hat**2
grad_sigma = -len(data) / sigma_hat + np.sum((data - mu_hat)**2) / sigma_hat**3
print(f"MLE: mu = {mu_hat:.4f}, sigma = {sigma_hat:.4f}")
print(f"Gradient at MLE: d/d(mu) = {grad_mu:.6f}, d/d(sigma) = {grad_sigma:.6f}")
print(f"(Both should be ~0 at the MLE)")
# Numerical verification
eps = 1e-7
grad_mu_num = (loglik(mu_hat + eps, np.log(sigma_hat))
- loglik(mu_hat, np.log(sigma_hat))) / eps
print(f"Numerical d/d(mu) at MLE: {grad_mu_num:.6f}")
Multivariate chain rule and the Jacobian. If \(f\) depends on \(\mathbf{x}\) and \(\mathbf{x}\) depends on \(\mathbf{t}\):
where \(\mathbf{J}\) is the Jacobian. This is the foundation of both backpropagation and the delta method in statistics.
# Delta method: SE of a transformation g(theta_hat)
# If theta_hat ~ N(theta, se^2), then g(theta_hat) ~ N(g(theta), [g'(theta)]^2 * se^2)
import numpy as np
# Example: Poisson rate lambda, want SE of exp(-lambda) = P(X=0)
np.random.seed(42)
data = np.random.poisson(3.0, size=100)
lam_hat = data.mean()
se_lam = np.sqrt(lam_hat / len(data))
# g(lambda) = exp(-lambda), g'(lambda) = -exp(-lambda)
g_hat = np.exp(-lam_hat)
g_prime = -np.exp(-lam_hat)
se_g = abs(g_prime) * se_lam # delta method
print(f"lam_hat = {lam_hat:.4f}, SE(lam) = {se_lam:.4f}")
print(f"P(X=0) = exp(-lam_hat) = {g_hat:.6f}")
print(f"SE of P(X=0) via delta method = {se_g:.6f}")
print(f"95% CI for P(X=0): [{g_hat - 1.96*se_g:.6f}, {g_hat + 1.96*se_g:.6f}]")
Taylor Series¶
Univariate Taylor Expansion¶
The second-order expansion is the basis for the Laplace approximation and for understanding the curvature of the log-likelihood near its maximum.
# Taylor approximations to log(x) around a=1
# This is relevant because log appears in every log-likelihood
import numpy as np
a = 1.0
x_test = 1.5
f_true = np.log(x_test)
taylor_1 = (x_test - a) # order 1
taylor_2 = taylor_1 - (1/2) * (x_test - a)**2 # order 2
taylor_3 = taylor_2 + (1/3) * (x_test - a)**3 # order 3
print(f"log({x_test}) = {f_true:.6f}")
print(f"{'Order':>6} {'Approx':>10} {'Error':>10}")
print("-" * 30)
for order, approx in [(1, taylor_1), (2, taylor_2), (3, taylor_3)]:
print(f"{order:6d} {approx:10.6f} {abs(approx - f_true):10.6f}")
Application: quadratic approximation to the log-likelihood¶
Expanding \(\ell(\theta)\) about the MLE \(\hat\theta\) (where \(\ell'(\hat\theta) = 0\)):
The linear term vanishes because the score is zero at the MLE. What remains is a quadratic — and a quadratic in the exponent is a Gaussian. This is precisely why the MLE is approximately normally distributed.
# Quadratic approximation to a Poisson log-likelihood
import numpy as np
from scipy.stats import poisson
np.random.seed(42)
data = np.array([3, 7, 2, 5, 4, 6, 3, 5])
n = len(data)
theta_mle = data.mean()
def loglik(theta):
return np.sum(poisson.logpmf(data, mu=theta))
# Analytical Hessian: d^2 ell / d(lambda)^2 = -sum(x_i) / lambda^2
hessian = -data.sum() / theta_mle**2
obs_info = -hessian # observed Fisher information
# Quadratic approximation
theta_grid = np.linspace(2.0, 7.0, 200)
ell_true = np.array([loglik(t) for t in theta_grid])
ell_quad = loglik(theta_mle) + 0.5 * hessian * (theta_grid - theta_mle)**2
# Compare at specific points
print(f"MLE = {theta_mle:.2f}")
print(f"Observed info = {obs_info:.4f}")
print(f"SE = 1/sqrt(info) = {1/np.sqrt(obs_info):.4f}")
print(f"Exact SE = sqrt(lam/n) = {np.sqrt(theta_mle/n):.4f}")
print()
print(f"{'theta':>8} {'True ell':>12} {'Quad approx':>12} {'Error':>10}")
print("-" * 46)
for t in [3.0, 3.5, 4.0, 4.375, 5.0, 5.5, 6.0]:
true_val = loglik(t)
quad_val = loglik(theta_mle) + 0.5 * hessian * (t - theta_mle)**2
print(f"{t:8.2f} {true_val:12.4f} {quad_val:12.4f} {abs(true_val-quad_val):10.4f}")
Common Pitfall
The quadratic approximation is only good near the MLE. If the true log-likelihood is strongly skewed (small samples, parameters near boundaries), Wald intervals based on this approximation can be misleading. Profile likelihood intervals are more robust.
Multivariate Taylor Expansion¶
For \(f : \mathbb{R}^n \to \mathbb{R}\), expanded about \(\mathbf{a}\):
where \(\mathbf{H}\) is the Hessian matrix.
# Multivariate Taylor: Normal log-likelihood with (mu, log_sigma)
import numpy as np
from scipy.stats import norm
np.random.seed(42)
data = np.random.normal(loc=5.0, scale=2.0, size=50)
n = len(data)
def loglik(params):
mu, log_sig = params
sig = np.exp(log_sig)
return np.sum(norm.logpdf(data, mu, sig))
# MLE
mu_hat = data.mean()
sig_hat = data.std()
mle = np.array([mu_hat, np.log(sig_hat)])
# Numerical Hessian at MLE
eps = 1e-5
H = np.zeros((2, 2))
for i in range(2):
for j in range(2):
pp = mle.copy(); pp[i] += eps; pp[j] += eps
pm = mle.copy(); pm[i] += eps; pm[j] -= eps
mp = mle.copy(); mp[i] -= eps; mp[j] += eps
mm = mle.copy(); mm[i] -= eps; mm[j] -= eps
H[i, j] = (loglik(pp) - loglik(pm) - loglik(mp) + loglik(mm)) / (4 * eps**2)
obs_info = -H
cov = np.linalg.inv(obs_info)
se = np.sqrt(np.diag(cov))
print(f"MLE: mu = {mu_hat:.4f}, sigma = {sig_hat:.4f}")
print(f"SEs: SE(mu) = {se[0]:.4f}, SE(log sigma) = {se[1]:.4f}")
print(f"\nHessian at MLE:")
print(np.array2string(H, precision=2))
print(f"\nObserved information (= -H):")
print(np.array2string(obs_info, precision=2))
Optimization¶
Necessary Conditions for Extrema¶
First-order condition (score equation):
Second derivative test: at a critical point \(x^*\):
\(f''(x^*) < 0 \implies\) local maximum (what we want for log-likelihoods).
The negative Hessian \(-\mathbf{H}\) is the observed information matrix.
# Finding the MLE numerically: Gamma distribution
import numpy as np
from scipy.optimize import minimize_scalar
from scipy.stats import gamma
np.random.seed(42)
alpha_true, beta_true = 3.0, 2.0
data = gamma.rvs(a=alpha_true, scale=1/beta_true, size=50)
def neg_loglik(alpha):
return -np.sum(gamma.logpdf(data, a=alpha, scale=1/beta_true))
result = minimize_scalar(neg_loglik, bounds=(0.5, 10.0), method='bounded')
alpha_hat = result.x
# Second derivative test: should be positive for neg_loglik (= minimum)
eps = 1e-5
d2 = (neg_loglik(alpha_hat+eps) - 2*neg_loglik(alpha_hat)
+ neg_loglik(alpha_hat-eps)) / eps**2
se = 1.0 / np.sqrt(d2) # d2 = observed info for neg_loglik
print(f"MLE alpha = {alpha_hat:.4f} (true: {alpha_true})")
print(f"d^2(-ell)/d(alpha)^2 at MLE = {d2:.4f} (>0 confirms minimum of -ell)")
print(f"SE(alpha) = {se:.4f}")
print(f"95% CI: [{alpha_hat - 1.96*se:.4f}, {alpha_hat + 1.96*se:.4f}]")
Convexity¶
A function \(f\) is convex if for all \(\mathbf{x}, \mathbf{y}\) and \(t \in [0,1]\):
Many log-likelihoods for exponential family distributions are concave (negative Hessian is positive definite), which guarantees the MLE is unique.
# Concavity check: logistic regression log-likelihood
import numpy as np
np.random.seed(42)
n = 50
X = np.column_stack([np.ones(n), np.random.randn(n)])
beta_true = np.array([0.5, 1.5])
prob = 1 / (1 + np.exp(-X @ beta_true))
y = (np.random.rand(n) < prob).astype(float)
def loglik(beta):
eta = X @ beta
return np.sum(y * eta - np.log(1 + np.exp(eta)))
# Check Hessian at several points --- should always be negative semi-definite
for beta_test in [np.array([0, 0]), np.array([1, 2]), np.array([-1, 0.5])]:
eps = 1e-5
H = np.zeros((2, 2))
for i in range(2):
for j in range(2):
bp, bm = beta_test.copy(), beta_test.copy()
bpp = beta_test.copy(); bpp[i] += eps; bpp[j] += eps
bpm = beta_test.copy(); bpm[i] += eps; bpm[j] -= eps
bmp = beta_test.copy(); bmp[i] -= eps; bmp[j] += eps
bmm = beta_test.copy(); bmm[i] -= eps; bmm[j] -= eps
H[i, j] = (loglik(bpp) - loglik(bpm)
- loglik(bmp) + loglik(bmm)) / (4 * eps**2)
eigenvalues = np.linalg.eigvalsh(H)
print(f"beta = {beta_test}: eigenvalues of H = "
f"[{eigenvalues[0]:.3f}, {eigenvalues[1]:.3f}] "
f"{'<= 0 (concave)' if all(eigenvalues <= 1e-10) else 'NOT concave'}")
Why concavity matters
When the log-likelihood is concave, any critical point is the global maximum. This is the case for logistic regression, Poisson regression, and all canonical-link GLMs. When the log-likelihood is not concave (mixture models, neural networks), algorithms like EM may converge to local maxima.
Integration Techniques¶
Integration by Substitution¶
This is how we compute normalizing constants of transformed random variables.
# Substitution: verify that Y = log(X) has density f_Y(y) = f_X(e^y) * e^y
# when X ~ Gamma(alpha, beta)
import numpy as np
from scipy.stats import gamma
from scipy import integrate
alpha, beta = 3.0, 2.0
# The density of Y = log(X) where X ~ Gamma
def f_Y(y):
x = np.exp(y)
return gamma.pdf(x, a=alpha, scale=1/beta) * x # Jacobian |dx/dy| = e^y
# Verify it integrates to 1
integral, _ = integrate.quad(f_Y, -10, 20)
print(f"int f_Y(y) dy = {integral:.8f} (should be 1)")
# Verify moments: E[Y] = E[log X] = psi(alpha) - log(beta) (digamma)
from scipy.special import digamma
EY_theory = digamma(alpha) - np.log(beta)
EY_numerical, _ = integrate.quad(lambda y: y * f_Y(y), -10, 20)
print(f"E[log X] numerical: {EY_numerical:.6f}")
print(f"E[log X] theory: {EY_theory:.6f}")
Integration by Parts¶
Integration by parts is essential for deriving moments and for the Gamma function recursion \(\Gamma(\alpha+1) = \alpha\,\Gamma(\alpha)\).
# Integration by parts proves Gamma(alpha+1) = alpha * Gamma(alpha)
# Verify computationally:
import numpy as np
from scipy.special import gamma as gamma_fn
for alpha in [1.0, 2.5, 5.0, 10.0]:
lhs = gamma_fn(alpha + 1)
rhs = alpha * gamma_fn(alpha)
print(f"Gamma({alpha+1:.1f}) = {lhs:.6f}, "
f"{alpha:.1f} * Gamma({alpha:.1f}) = {rhs:.6f}, "
f"match: {np.isclose(lhs, rhs)}")
Leibniz Integral Rule¶
When the limits are constant:
This rule justifies differentiating under the integral sign — the technical step behind deriving score functions and Fisher information from the likelihood.
# Leibniz rule: d/d(beta) of the Gamma normalizing constant
import numpy as np
from scipy import integrate
from scipy.special import gamma as gamma_fn
alpha = 3.0
beta = 2.0
# I(beta) = int_0^inf x^(a-1) exp(-beta*x) dx = Gamma(a) / beta^a
def I(b):
val, _ = integrate.quad(lambda x: x**(alpha-1) * np.exp(-b*x), 0, np.inf)
return val
# Numerical dI/dbeta
eps = 1e-7
dI_numerical = (I(beta + eps) - I(beta)) / eps
# Analytical: dI/dbeta = -alpha * Gamma(alpha) / beta^(alpha+1)
dI_analytical = -alpha * gamma_fn(alpha) / beta**(alpha + 1)
print(f"dI/d(beta) numerical: {dI_numerical:.8f}")
print(f"dI/d(beta) analytical: {dI_analytical:.8f}")
When Leibniz fails
The Leibniz rule requires the support to be independent of \(\theta\). When the support does depend on the parameter — for example, the Uniform(0, \(\theta\)) distribution — the rule fails, and the MLE behaves in non-standard ways (the MLE is \(\max(x_i)\) and converges at rate \(1/n\) instead of \(1/\sqrt{n}\)).
# When Leibniz fails: Uniform(0, theta) MLE
import numpy as np
np.random.seed(42)
theta_true = 5.0
for n in [10, 100, 1000, 10000]:
data = np.random.uniform(0, theta_true, size=n)
theta_hat = data.max() # MLE
# Bias and rate of convergence
error = theta_true - theta_hat
print(f"n = {n:5d}: MLE = {theta_hat:.6f}, "
f"error = {error:.6f}, "
f"n * error = {n * error:.4f}")
print("\n(n * error stays roughly constant => O(1/n) rate, not O(1/sqrt(n)))")
The Gamma Function¶
The Gamma function is the normalizing constant that makes the Gamma, Beta, Chi-squared, Student’s t, and F distributions integrate to one.
# Gamma function: recursion, factorials, and special values
import numpy as np
from scipy.special import gamma as gamma_fn
from math import factorial
# Recursion: Gamma(a+1) = a * Gamma(a)
alpha = 3.7
print(f"Gamma({alpha+1}) = {gamma_fn(alpha+1):.6f}")
print(f"{alpha} * Gamma({alpha}) = {alpha * gamma_fn(alpha):.6f}")
# Factorial connection
print(f"\n{'n':>3} {'Gamma(n)':>12} {'(n-1)!':>12}")
for n in range(1, 7):
print(f"{n:3d} {gamma_fn(n):12.1f} {factorial(n-1):12d}")
# Special value
print(f"\nGamma(1/2) = {gamma_fn(0.5):.6f}")
print(f"sqrt(pi) = {np.sqrt(np.pi):.6f}")
The digamma function¶
The digamma function appears in the score function of every distribution whose normalizing constant involves \(\Gamma(\alpha)\).
# Digamma: the score function of the Gamma normalizing constant
import numpy as np
from scipy.special import digamma, gammaln
alpha = 2.5
# Analytical
psi = digamma(alpha)
# Numerical: d/d(alpha) log Gamma(alpha)
eps = 1e-7
psi_num = (gammaln(alpha + eps) - gammaln(alpha)) / eps
print(f"digamma({alpha}) = {psi:.8f}")
print(f"numerical: {psi_num:.8f}")
# Application: score of Gamma(alpha, beta) w.r.t. alpha
beta = 2.0
np.random.seed(42)
data = np.random.gamma(alpha, 1/beta, size=100)
# Score: log(beta) - psi(alpha) + mean(log x)
score_alpha = np.log(beta) - digamma(alpha) + np.mean(np.log(data))
print(f"\nGamma score w.r.t. alpha at true value: {score_alpha:.4f} (should be ~0)")
The Beta function¶
# Beta function: integration vs Gamma formula
import numpy as np
from scipy.special import gamma as gamma_fn, beta as beta_fn
from scipy import integrate
alpha, beta = 2.5, 3.0
numerical, _ = integrate.quad(lambda t: t**(alpha-1) * (1-t)**(beta-1), 0, 1)
via_gamma = gamma_fn(alpha) * gamma_fn(beta) / gamma_fn(alpha + beta)
via_scipy = beta_fn(alpha, beta)
print(f"B({alpha},{beta}) by integration: {numerical:.8f}")
print(f"B({alpha},{beta}) via Gamma: {via_gamma:.8f}")
print(f"B({alpha},{beta}) via scipy: {via_scipy:.8f}")
# Application: verify Beta(alpha, beta) density integrates to 1
def beta_pdf(x, a, b):
return x**(a-1) * (1-x)**(b-1) / beta_fn(a, b)
integral, _ = integrate.quad(beta_pdf, 0, 1, args=(alpha, beta))
print(f"\nint Beta({alpha},{beta}) pdf = {integral:.8f} (should be 1)")
Stirling’s Approximation¶
Stirling’s approximation simplifies log-likelihoods involving factorials (multinomial, Poisson limit of Binomial) and appears in asymptotic arguments for MLE normality.
# Stirling's approximation: accuracy at various n
import numpy as np
from scipy.special import gammaln
print(f"{'n':>6} {'log(n!)':>12} {'Stirling':>12} {'Rel Error':>12}")
print("-" * 48)
for n in [5, 10, 20, 50, 100, 1000]:
exact = gammaln(n + 1)
stirling = n * np.log(n) - n + 0.5 * np.log(2 * np.pi * n)
rel_err = abs(stirling - exact) / abs(exact)
print(f"{n:>6} {exact:>12.4f} {stirling:>12.4f} {rel_err:>12.2e}")
print("\n(Relative error drops below 1% for n >= 10)")
# Application: Stirling in the Poisson-Binomial connection
# P(X = k) for Binomial(n, lam/n) -> Poisson(lam) as n -> inf
import numpy as np
from scipy.special import comb
from scipy.stats import poisson
lam = 3.0
k = 4
print(f"P(X={k}) for Binomial(n, {lam}/n) vs Poisson({lam}):")
print(f"{'n':>8} {'Binomial':>12} {'Poisson':>12} {'Abs Diff':>12}")
print("-" * 48)
for n in [10, 50, 100, 1000, 10000]:
p = lam / n
binom_prob = comb(n, k, exact=True) * p**k * (1-p)**(n-k)
pois_prob = poisson.pmf(k, lam)
print(f"{n:8d} {binom_prob:12.8f} {pois_prob:12.8f} {abs(binom_prob-pois_prob):12.2e}")
Useful Integrals¶
Gaussian Integral¶
This identity is used every time we derive the normalizing constant of a normal distribution or “complete the square” in an exponent.
# Gaussian integrals: numerical verification
import numpy as np
from scipy import integrate
# Basic: int exp(-x^2) dx = sqrt(pi)
val, _ = integrate.quad(lambda x: np.exp(-x**2), -np.inf, np.inf)
print(f"int exp(-x^2) dx = {val:.6f}, sqrt(pi) = {np.sqrt(np.pi):.6f}")
# With parameters: int exp(-ax^2 + bx) dx
a, b = 2.0, 3.0
val_ab, _ = integrate.quad(lambda x: np.exp(-a*x**2 + b*x), -np.inf, np.inf)
expected = np.sqrt(np.pi / a) * np.exp(b**2 / (4*a))
print(f"int exp(-{a}x^2+{b}x) dx = {val_ab:.6f}, formula = {expected:.6f}")
# Application: verify Normal density integrates to 1
mu, sigma = 3.0, 2.0
normal_integral, _ = integrate.quad(
lambda x: (1/(sigma*np.sqrt(2*np.pi))) * np.exp(-(x-mu)**2/(2*sigma**2)),
-np.inf, np.inf)
print(f"\nN({mu},{sigma}^2) integrates to {normal_integral:.8f}")
# Completing the square: the trick behind conjugate Bayesian updates
# Posterior of Normal mean with Normal prior
import numpy as np
from scipy import integrate
# Prior: mu ~ N(mu_0, tau^2)
mu_0, tau = 0.0, 5.0
# Data: x_1, ..., x_n ~ N(mu, sigma^2), sigma known
np.random.seed(42)
sigma = 2.0
data = np.random.normal(3.0, sigma, size=10)
n = len(data)
x_bar = data.mean()
# Posterior by completing the square:
# precision: 1/tau^2 + n/sigma^2
post_prec = 1/tau**2 + n/sigma**2
post_var = 1 / post_prec
post_mean = post_var * (mu_0/tau**2 + n*x_bar/sigma**2)
print(f"Prior: N({mu_0}, {tau**2})")
print(f"Data mean: {x_bar:.4f} (n={n})")
print(f"Posterior: N({post_mean:.4f}, {post_var:.4f})")
print(f"Post SD: {np.sqrt(post_var):.4f}")
# The posterior precision = prior precision + data precision
print(f"\nPrior precision: {1/tau**2:.4f}")
print(f"Data precision: {n/sigma**2:.4f}")
print(f"Posterior precision: {post_prec:.4f}")
Gamma Function Integral¶
This is the normalizing constant of the Gamma distribution and appears whenever we integrate out a rate or precision parameter.
# Gamma integral: normalizing constant verification
import numpy as np
from scipy import integrate
from scipy.special import gamma as gamma_fn
alpha, beta = 3.5, 2.0
numerical, _ = integrate.quad(
lambda x: x**(alpha-1) * np.exp(-beta*x), 0, np.inf)
analytical = gamma_fn(alpha) / beta**alpha
print(f"Numerical: {numerical:.8f}")
print(f"Gamma(a)/beta^a: {analytical:.8f}")
# Application: verify Gamma(alpha, beta) density integrates to 1
def gamma_pdf(x, a, b):
return b**a / gamma_fn(a) * x**(a-1) * np.exp(-b*x)
integral, _ = integrate.quad(gamma_pdf, 0, np.inf, args=(alpha, beta))
print(f"\nGamma({alpha},{beta}) pdf integrates to {integral:.8f}")
Conjugacy connection
In Bayesian inference, if the prior on a Poisson rate is Gamma(\(\alpha_0, \beta_0\)) and we observe \(n\) data points summing to \(s\), the posterior is Gamma(\(\alpha_0 + s, \beta_0 + n\)). The Gamma integral is what makes this conjugate update work.
# Bayesian conjugacy: Poisson-Gamma in action
import numpy as np
np.random.seed(42)
# Prior: lambda ~ Gamma(alpha_0, beta_0)
alpha_0, beta_0 = 2.0, 1.0 # weak prior
# Data: n Poisson observations
lam_true = 4.0
data = np.random.poisson(lam_true, size=20)
n = len(data)
s = data.sum()
# Posterior: Gamma(alpha_0 + s, beta_0 + n)
alpha_post = alpha_0 + s
beta_post = beta_0 + n
post_mean = alpha_post / beta_post
post_var = alpha_post / beta_post**2
# Frequentist MLE for comparison
mle = data.mean()
se_mle = np.sqrt(mle / n)
print(f"Data: n={n}, sum={s}, mean={data.mean():.2f}")
print(f"Prior mean: {alpha_0/beta_0:.4f}")
print(f"Posterior mean: {post_mean:.4f}")
print(f"Posterior SD: {np.sqrt(post_var):.4f}")
print(f"MLE: {mle:.4f}")
print(f"MLE SE: {se_mle:.4f}")
print(f"True lambda: {lam_true}")
Moments of the Gaussian¶
For \(X \sim \mathcal{N}(\mu, \sigma^2)\):
These moments appear when computing the Fisher information for models involving squared observations.
# Gaussian moments: simulation vs formula
import numpy as np
np.random.seed(42)
mu, sigma = 2.0, 1.5
X = np.random.normal(mu, sigma, size=1_000_000)
print(f"{'Moment':>8} {'Simulated':>12} {'Formula':>12}")
print("-" * 36)
print(f"{'E[X^2]':>8} {np.mean(X**2):12.4f} {mu**2 + sigma**2:12.4f}")
print(f"{'E[X^4]':>8} {np.mean(X**4):12.2f} "
f"{mu**4 + 6*mu**2*sigma**2 + 3*sigma**4:12.2f}")
The Chi-Squared Integral¶
If \(X_1, \ldots, X_n \sim \mathcal{N}(0,1)\) i.i.d., then \(Q = \sum X_i^2 \sim \chi^2_n\).
This connects to likelihood ratio tests via Wilks’ theorem: \(-2\log\Lambda \xrightarrow{d} \chi^2_r\).
# Chi-squared: simulation and connection to likelihood ratio tests
import numpy as np
from scipy.stats import chi2
np.random.seed(42)
n = 5
n_sim = 100_000
# Simulate: sum of squared standard normals
Z = np.random.randn(n_sim, n)
Q = np.sum(Z**2, axis=1)
print(f"Chi-squared({n}) distribution:")
print(f" Mean -- simulated: {Q.mean():.3f}, theory: {n}")
print(f" Var -- simulated: {Q.var():.3f}, theory: {2*n}")
# Connection to Wilks' theorem:
# Under H0, -2 log(L0/L1) ~ chi2(r) where r = # restrictions
# This is why the chi2 critical values matter for hypothesis testing
print(f"\nChi2 critical values (for likelihood ratio tests):")
for df in [1, 2, 5, 10]:
for alpha in [0.05, 0.01]:
cv = chi2.ppf(1 - alpha, df=df)
print(f" df={df}, alpha={alpha}: critical value = {cv:.4f}")