Appendix B: Calculus Review¶

This appendix reviews the calculus tools that underpin likelihood-based inference. Every rule and formula is tied to a concrete statistical application and verified in code. The goal is not to re-teach calculus but to show why each tool matters the moment you open a likelihood function.

If you have taken a standard calculus sequence, the rules below will be familiar. What is different here is the lens: every derivative is a score function, every integral is a normalizing constant, and every Taylor expansion is an approximation to a log-likelihood.

Differentiation Rules¶

Throughout, we let \(f\) and \(g\) denote differentiable real-valued functions.

Sum Rule¶

\[\frac{d}{dx}\bigl[f(x) + g(x)\bigr] = f'(x) + g'(x).\]

Why it matters for statistics: the sum rule is the reason log-likelihoods are so much easier than likelihoods. Taking the log of a product turns it into a sum, and then we differentiate each observation’s contribution independently.

# Sum rule in action: log-likelihood of n independent observations
# is a SUM of individual log-likelihoods
import numpy as np
from scipy.stats import poisson

data = np.array([3, 5, 2, 7, 4])
lam = 4.0

# Log-likelihood: sum of individual terms
individual_terms = [poisson.logpmf(x, mu=lam) for x in data]
total_ll = sum(individual_terms)

print("Individual log-likelihood contributions:")
for i, (x, ll) in enumerate(zip(data, individual_terms)):
    print(f"  x_{i} = {x}: log f(x|lam) = {ll:.4f}")
print(f"Total log-likelihood = {total_ll:.4f}")
print(f"Verify: {np.sum(poisson.logpmf(data, mu=lam)):.4f}")

Product Rule¶

\[\frac{d}{dx}\bigl[f(x)\,g(x)\bigr] = f'(x)\,g(x) + f(x)\,g'(x).\]

You encounter the product rule when differentiating likelihood contributions where both the location and scale depend on the same parameter — for example, in heteroscedastic models.

# Product rule: derivative of x * exp(-x^2)
# This shape appears in the score of a Rayleigh distribution
import numpy as np

x = 2.0
# f(x) = x, g(x) = exp(-x^2)
# d/dx [x * exp(-x^2)] = 1 * exp(-x^2) + x * (-2x) * exp(-x^2)
#                       = exp(-x^2) * (1 - 2x^2)

analytical = np.exp(-x**2) * (1 - 2 * x**2)

eps = 1e-7
h = lambda t: t * np.exp(-t**2)
numerical = (h(x + eps) - h(x)) / eps

print(f"Product rule (analytical): {analytical:.8f}")
print(f"Finite difference:         {numerical:.8f}")

Chain Rule¶

If \(h(x) = f(g(x))\), then:

\[h'(x) = f'\bigl(g(x)\bigr)\,g'(x).\]

The chain rule is the single most important differentiation tool in likelihood theory. The score function is the chain rule applied to \(\log(\cdot)\) composed with \(L(\cdot)\):

\[s(\theta) = \frac{d}{d\theta}\log L(\theta) = \frac{1}{L(\theta)}\cdot\frac{dL}{d\theta}.\]

# Chain rule = score function: d/d(theta) log L(theta)
import numpy as np
from scipy.stats import poisson

data = np.array([3, 5, 2, 7, 4])
theta = 4.0  # Poisson rate

# The chain rule gives us the score:
# d/d(lam) log L = d/d(lam) [sum x_i * log(lam) - n*lam - sum log(x_i!)]
#                = sum(x_i) / lam - n
score_chain_rule = data.sum() / theta - len(data)

# Verify numerically
def loglik(lam):
    return np.sum(poisson.logpmf(data, mu=lam))

eps = 1e-7
score_numerical = (loglik(theta + eps) - loglik(theta)) / eps

print(f"Score via chain rule: {score_chain_rule:.8f}")
print(f"Score via finite diff: {score_numerical:.8f}")
print(f"MLE (where score = 0): lam_hat = {data.mean()}")

Logarithmic differentiation. For \(L(\theta) > 0\):

\[\frac{d}{d\theta}\log L(\theta) = \frac{1}{L(\theta)}\,\frac{dL}{d\theta}.\]

This converts differentiating a product (likelihood) into differentiating a sum (log-likelihood) — one of the most powerful tricks in statistics.

# Logarithmic differentiation for a Binomial likelihood
# L(p) = C * p^s * (1-p)^(n-s)
# log L = const + s*log(p) + (n-s)*log(1-p)
# d/dp log L = s/p - (n-s)/(1-p)
import numpy as np

n, s = 100, 73
p = 0.7

# Score via log differentiation
score = s / p - (n - s) / (1 - p)

# Score via direct differentiation of L, then dividing by L
# dL/dp = C * [s * p^(s-1) * (1-p)^(n-s) - (n-s) * p^s * (1-p)^(n-s-1)]
# dL/dp / L = s/p - (n-s)/(1-p)  (same thing!)
print(f"Score at p={p}: {score:.4f}")
print(f"Score at MLE p={s/n}: {s/(s/n) - (n-s)/(1-s/n):.10f}")

Partial Derivatives and Gradients¶

For a function \(f : \mathbb{R}^n \to \mathbb{R}\), the gradient is

\[\begin{split}\nabla f(\mathbf{x}) = \begin{pmatrix} \partial f / \partial x_1 \\ \vdots \\ \partial f / \partial x_n \end{pmatrix}.\end{split}\]

Setting \(\nabla \ell = \mathbf{0}\) gives the MLE in multiparameter models.

# Gradient of a Normal log-likelihood with two parameters (mu, sigma)
import numpy as np
from scipy.stats import norm

np.random.seed(42)
data = np.random.normal(loc=3.0, scale=2.0, size=100)

def loglik(mu, log_sigma):
    sigma = np.exp(log_sigma)
    return np.sum(norm.logpdf(data, loc=mu, scale=sigma))

# Analytical gradient for Normal:
#   d/d(mu) = sum(x_i - mu) / sigma^2
#   d/d(sigma) = -n/sigma + sum(x_i - mu)^2 / sigma^3
mu_hat = data.mean()
sigma_hat = data.std()

grad_mu = np.sum(data - mu_hat) / sigma_hat**2
grad_sigma = -len(data) / sigma_hat + np.sum((data - mu_hat)**2) / sigma_hat**3

print(f"MLE: mu = {mu_hat:.4f}, sigma = {sigma_hat:.4f}")
print(f"Gradient at MLE: d/d(mu) = {grad_mu:.6f}, d/d(sigma) = {grad_sigma:.6f}")
print(f"(Both should be ~0 at the MLE)")

# Numerical verification
eps = 1e-7
grad_mu_num = (loglik(mu_hat + eps, np.log(sigma_hat))
               - loglik(mu_hat, np.log(sigma_hat))) / eps
print(f"Numerical d/d(mu) at MLE: {grad_mu_num:.6f}")

Multivariate chain rule and the Jacobian. If \(f\) depends on \(\mathbf{x}\) and \(\mathbf{x}\) depends on \(\mathbf{t}\):

\[\nabla_{\mathbf{t}} f = \mathbf{J}^\top \nabla_{\mathbf{x}} f,\]

where \(\mathbf{J}\) is the Jacobian. This is the foundation of both backpropagation and the delta method in statistics.

# Delta method: SE of a transformation g(theta_hat)
# If theta_hat ~ N(theta, se^2), then g(theta_hat) ~ N(g(theta), [g'(theta)]^2 * se^2)
import numpy as np

# Example: Poisson rate lambda, want SE of exp(-lambda) = P(X=0)
np.random.seed(42)
data = np.random.poisson(3.0, size=100)
lam_hat = data.mean()
se_lam = np.sqrt(lam_hat / len(data))

# g(lambda) = exp(-lambda), g'(lambda) = -exp(-lambda)
g_hat = np.exp(-lam_hat)
g_prime = -np.exp(-lam_hat)
se_g = abs(g_prime) * se_lam  # delta method

print(f"lam_hat = {lam_hat:.4f}, SE(lam) = {se_lam:.4f}")
print(f"P(X=0) = exp(-lam_hat) = {g_hat:.6f}")
print(f"SE of P(X=0) via delta method = {se_g:.6f}")
print(f"95% CI for P(X=0): [{g_hat - 1.96*se_g:.6f}, {g_hat + 1.96*se_g:.6f}]")

Taylor Series¶

Univariate Taylor Expansion¶

\[f(x) = f(a) + f'(a)\,(x - a) + \frac{f''(a)}{2!}\,(x - a)^2 + \cdots\]

The second-order expansion is the basis for the Laplace approximation and for understanding the curvature of the log-likelihood near its maximum.

# Taylor approximations to log(x) around a=1
# This is relevant because log appears in every log-likelihood
import numpy as np

a = 1.0
x_test = 1.5

f_true = np.log(x_test)
taylor_1 = (x_test - a)                                   # order 1
taylor_2 = taylor_1 - (1/2) * (x_test - a)**2             # order 2
taylor_3 = taylor_2 + (1/3) * (x_test - a)**3             # order 3

print(f"log({x_test}) = {f_true:.6f}")
print(f"{'Order':>6} {'Approx':>10} {'Error':>10}")
print("-" * 30)
for order, approx in [(1, taylor_1), (2, taylor_2), (3, taylor_3)]:
    print(f"{order:6d} {approx:10.6f} {abs(approx - f_true):10.6f}")

Application: quadratic approximation to the log-likelihood¶

Expanding \(\ell(\theta)\) about the MLE \(\hat\theta\) (where \(\ell'(\hat\theta) = 0\)):

\[\ell(\theta) \approx \ell(\hat\theta) + \tfrac{1}{2}\,\ell''(\hat\theta)\,(\theta - \hat\theta)^2.\]

The linear term vanishes because the score is zero at the MLE. What remains is a quadratic — and a quadratic in the exponent is a Gaussian. This is precisely why the MLE is approximately normally distributed.

# Quadratic approximation to a Poisson log-likelihood
import numpy as np
from scipy.stats import poisson

np.random.seed(42)
data = np.array([3, 7, 2, 5, 4, 6, 3, 5])
n = len(data)
theta_mle = data.mean()

def loglik(theta):
    return np.sum(poisson.logpmf(data, mu=theta))

# Analytical Hessian: d^2 ell / d(lambda)^2 = -sum(x_i) / lambda^2
hessian = -data.sum() / theta_mle**2
obs_info = -hessian  # observed Fisher information

# Quadratic approximation
theta_grid = np.linspace(2.0, 7.0, 200)
ell_true = np.array([loglik(t) for t in theta_grid])
ell_quad = loglik(theta_mle) + 0.5 * hessian * (theta_grid - theta_mle)**2

# Compare at specific points
print(f"MLE = {theta_mle:.2f}")
print(f"Observed info = {obs_info:.4f}")
print(f"SE = 1/sqrt(info) = {1/np.sqrt(obs_info):.4f}")
print(f"Exact SE = sqrt(lam/n) = {np.sqrt(theta_mle/n):.4f}")
print()
print(f"{'theta':>8} {'True ell':>12} {'Quad approx':>12} {'Error':>10}")
print("-" * 46)
for t in [3.0, 3.5, 4.0, 4.375, 5.0, 5.5, 6.0]:
    true_val = loglik(t)
    quad_val = loglik(theta_mle) + 0.5 * hessian * (t - theta_mle)**2
    print(f"{t:8.2f} {true_val:12.4f} {quad_val:12.4f} {abs(true_val-quad_val):10.4f}")

Common Pitfall

The quadratic approximation is only good near the MLE. If the true log-likelihood is strongly skewed (small samples, parameters near boundaries), Wald intervals based on this approximation can be misleading. Profile likelihood intervals are more robust.

Multivariate Taylor Expansion¶

For \(f : \mathbb{R}^n \to \mathbb{R}\), expanded about \(\mathbf{a}\):

\[f(\mathbf{x}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a}) + \tfrac{1}{2}\,(\mathbf{x} - \mathbf{a})^\top \mathbf{H}(\mathbf{a})\,(\mathbf{x} - \mathbf{a}),\]

where \(\mathbf{H}\) is the Hessian matrix.

# Multivariate Taylor: Normal log-likelihood with (mu, log_sigma)
import numpy as np
from scipy.stats import norm

np.random.seed(42)
data = np.random.normal(loc=5.0, scale=2.0, size=50)
n = len(data)

def loglik(params):
    mu, log_sig = params
    sig = np.exp(log_sig)
    return np.sum(norm.logpdf(data, mu, sig))

# MLE
mu_hat = data.mean()
sig_hat = data.std()
mle = np.array([mu_hat, np.log(sig_hat)])

# Numerical Hessian at MLE
eps = 1e-5
H = np.zeros((2, 2))
for i in range(2):
    for j in range(2):
        pp = mle.copy(); pp[i] += eps; pp[j] += eps
        pm = mle.copy(); pm[i] += eps; pm[j] -= eps
        mp = mle.copy(); mp[i] -= eps; mp[j] += eps
        mm = mle.copy(); mm[i] -= eps; mm[j] -= eps
        H[i, j] = (loglik(pp) - loglik(pm) - loglik(mp) + loglik(mm)) / (4 * eps**2)

obs_info = -H
cov = np.linalg.inv(obs_info)
se = np.sqrt(np.diag(cov))

print(f"MLE: mu = {mu_hat:.4f}, sigma = {sig_hat:.4f}")
print(f"SEs: SE(mu) = {se[0]:.4f}, SE(log sigma) = {se[1]:.4f}")
print(f"\nHessian at MLE:")
print(np.array2string(H, precision=2))
print(f"\nObserved information (= -H):")
print(np.array2string(obs_info, precision=2))

Optimization¶

Necessary Conditions for Extrema¶

First-order condition (score equation):

\[f'(x^*) = 0 \quad \text{(univariate)}, \qquad \nabla f(\mathbf{x}^*) = \mathbf{0} \quad \text{(multivariate)}.\]

Second derivative test: at a critical point \(x^*\):

\(f''(x^*) < 0 \implies\) local maximum (what we want for log-likelihoods).
The negative Hessian \(-\mathbf{H}\) is the observed information matrix.

# Finding the MLE numerically: Gamma distribution
import numpy as np
from scipy.optimize import minimize_scalar
from scipy.stats import gamma

np.random.seed(42)
alpha_true, beta_true = 3.0, 2.0
data = gamma.rvs(a=alpha_true, scale=1/beta_true, size=50)

def neg_loglik(alpha):
    return -np.sum(gamma.logpdf(data, a=alpha, scale=1/beta_true))

result = minimize_scalar(neg_loglik, bounds=(0.5, 10.0), method='bounded')
alpha_hat = result.x

# Second derivative test: should be positive for neg_loglik (= minimum)
eps = 1e-5
d2 = (neg_loglik(alpha_hat+eps) - 2*neg_loglik(alpha_hat)
      + neg_loglik(alpha_hat-eps)) / eps**2

se = 1.0 / np.sqrt(d2)  # d2 = observed info for neg_loglik

print(f"MLE alpha = {alpha_hat:.4f} (true: {alpha_true})")
print(f"d^2(-ell)/d(alpha)^2 at MLE = {d2:.4f} (>0 confirms minimum of -ell)")
print(f"SE(alpha) = {se:.4f}")
print(f"95% CI: [{alpha_hat - 1.96*se:.4f}, {alpha_hat + 1.96*se:.4f}]")

Convexity¶

A function \(f\) is convex if for all \(\mathbf{x}, \mathbf{y}\) and \(t \in [0,1]\):

\[f\bigl(t\,\mathbf{x} + (1-t)\,\mathbf{y}\bigr) \leq t\,f(\mathbf{x}) + (1-t)\,f(\mathbf{y}).\]

Many log-likelihoods for exponential family distributions are concave (negative Hessian is positive definite), which guarantees the MLE is unique.

# Concavity check: logistic regression log-likelihood
import numpy as np

np.random.seed(42)
n = 50
X = np.column_stack([np.ones(n), np.random.randn(n)])
beta_true = np.array([0.5, 1.5])
prob = 1 / (1 + np.exp(-X @ beta_true))
y = (np.random.rand(n) < prob).astype(float)

def loglik(beta):
    eta = X @ beta
    return np.sum(y * eta - np.log(1 + np.exp(eta)))

# Check Hessian at several points --- should always be negative semi-definite
for beta_test in [np.array([0, 0]), np.array([1, 2]), np.array([-1, 0.5])]:
    eps = 1e-5
    H = np.zeros((2, 2))
    for i in range(2):
        for j in range(2):
            bp, bm = beta_test.copy(), beta_test.copy()
            bpp = beta_test.copy(); bpp[i] += eps; bpp[j] += eps
            bpm = beta_test.copy(); bpm[i] += eps; bpm[j] -= eps
            bmp = beta_test.copy(); bmp[i] -= eps; bmp[j] += eps
            bmm = beta_test.copy(); bmm[i] -= eps; bmm[j] -= eps
            H[i, j] = (loglik(bpp) - loglik(bpm)
                        - loglik(bmp) + loglik(bmm)) / (4 * eps**2)

    eigenvalues = np.linalg.eigvalsh(H)
    print(f"beta = {beta_test}: eigenvalues of H = "
          f"[{eigenvalues[0]:.3f}, {eigenvalues[1]:.3f}] "
          f"{'<= 0 (concave)' if all(eigenvalues <= 1e-10) else 'NOT concave'}")

Why concavity matters

When the log-likelihood is concave, any critical point is the global maximum. This is the case for logistic regression, Poisson regression, and all canonical-link GLMs. When the log-likelihood is not concave (mixture models, neural networks), algorithms like EM may converge to local maxima.

Integration Techniques¶

Integration by Substitution¶

\[\int f(g(x))\,g'(x)\,dx = \int f(u)\,du.\]

This is how we compute normalizing constants of transformed random variables.

# Substitution: verify that Y = log(X) has density f_Y(y) = f_X(e^y) * e^y
# when X ~ Gamma(alpha, beta)
import numpy as np
from scipy.stats import gamma
from scipy import integrate

alpha, beta = 3.0, 2.0

# The density of Y = log(X) where X ~ Gamma
def f_Y(y):
    x = np.exp(y)
    return gamma.pdf(x, a=alpha, scale=1/beta) * x  # Jacobian |dx/dy| = e^y

# Verify it integrates to 1
integral, _ = integrate.quad(f_Y, -10, 20)
print(f"int f_Y(y) dy = {integral:.8f} (should be 1)")

# Verify moments: E[Y] = E[log X] = psi(alpha) - log(beta) (digamma)
from scipy.special import digamma
EY_theory = digamma(alpha) - np.log(beta)
EY_numerical, _ = integrate.quad(lambda y: y * f_Y(y), -10, 20)
print(f"E[log X] numerical: {EY_numerical:.6f}")
print(f"E[log X] theory:    {EY_theory:.6f}")

Integration by Parts¶

\[\int u\,dv = uv - \int v\,du.\]

Integration by parts is essential for deriving moments and for the Gamma function recursion \(\Gamma(\alpha+1) = \alpha\,\Gamma(\alpha)\).

# Integration by parts proves Gamma(alpha+1) = alpha * Gamma(alpha)
# Verify computationally:
import numpy as np
from scipy.special import gamma as gamma_fn

for alpha in [1.0, 2.5, 5.0, 10.0]:
    lhs = gamma_fn(alpha + 1)
    rhs = alpha * gamma_fn(alpha)
    print(f"Gamma({alpha+1:.1f}) = {lhs:.6f},  "
          f"{alpha:.1f} * Gamma({alpha:.1f}) = {rhs:.6f},  "
          f"match: {np.isclose(lhs, rhs)}")

Leibniz Integral Rule¶

When the limits are constant:

\[\frac{d}{d\theta} \int_a^b f(x, \theta)\,dx = \int_a^b \frac{\partial f}{\partial \theta}\,dx.\]

This rule justifies differentiating under the integral sign — the technical step behind deriving score functions and Fisher information from the likelihood.

# Leibniz rule: d/d(beta) of the Gamma normalizing constant
import numpy as np
from scipy import integrate
from scipy.special import gamma as gamma_fn

alpha = 3.0
beta = 2.0

# I(beta) = int_0^inf x^(a-1) exp(-beta*x) dx = Gamma(a) / beta^a
def I(b):
    val, _ = integrate.quad(lambda x: x**(alpha-1) * np.exp(-b*x), 0, np.inf)
    return val

# Numerical dI/dbeta
eps = 1e-7
dI_numerical = (I(beta + eps) - I(beta)) / eps

# Analytical: dI/dbeta = -alpha * Gamma(alpha) / beta^(alpha+1)
dI_analytical = -alpha * gamma_fn(alpha) / beta**(alpha + 1)

print(f"dI/d(beta) numerical:  {dI_numerical:.8f}")
print(f"dI/d(beta) analytical: {dI_analytical:.8f}")

When Leibniz fails

The Leibniz rule requires the support to be independent of \(\theta\). When the support does depend on the parameter — for example, the Uniform(0, \(\theta\)) distribution — the rule fails, and the MLE behaves in non-standard ways (the MLE is \(\max(x_i)\) and converges at rate \(1/n\) instead of \(1/\sqrt{n}\)).

# When Leibniz fails: Uniform(0, theta) MLE
import numpy as np

np.random.seed(42)
theta_true = 5.0

for n in [10, 100, 1000, 10000]:
    data = np.random.uniform(0, theta_true, size=n)
    theta_hat = data.max()  # MLE
    # Bias and rate of convergence
    error = theta_true - theta_hat
    print(f"n = {n:5d}: MLE = {theta_hat:.6f}, "
          f"error = {error:.6f}, "
          f"n * error = {n * error:.4f}")

print("\n(n * error stays roughly constant => O(1/n) rate, not O(1/sqrt(n)))")

The Gamma Function¶

\[\Gamma(\alpha) = \int_0^\infty t^{\alpha - 1}\,e^{-t}\,dt, \qquad \alpha > 0.\]

The Gamma function is the normalizing constant that makes the Gamma, Beta, Chi-squared, Student’s t, and F distributions integrate to one.

# Gamma function: recursion, factorials, and special values
import numpy as np
from scipy.special import gamma as gamma_fn
from math import factorial

# Recursion: Gamma(a+1) = a * Gamma(a)
alpha = 3.7
print(f"Gamma({alpha+1}) = {gamma_fn(alpha+1):.6f}")
print(f"{alpha} * Gamma({alpha}) = {alpha * gamma_fn(alpha):.6f}")

# Factorial connection
print(f"\n{'n':>3} {'Gamma(n)':>12} {'(n-1)!':>12}")
for n in range(1, 7):
    print(f"{n:3d} {gamma_fn(n):12.1f} {factorial(n-1):12d}")

# Special value
print(f"\nGamma(1/2) = {gamma_fn(0.5):.6f}")
print(f"sqrt(pi)   = {np.sqrt(np.pi):.6f}")

The digamma function¶

\[\psi(\alpha) = \frac{d}{d\alpha}\log\Gamma(\alpha).\]

The digamma function appears in the score function of every distribution whose normalizing constant involves \(\Gamma(\alpha)\).

# Digamma: the score function of the Gamma normalizing constant
import numpy as np
from scipy.special import digamma, gammaln

alpha = 2.5

# Analytical
psi = digamma(alpha)

# Numerical: d/d(alpha) log Gamma(alpha)
eps = 1e-7
psi_num = (gammaln(alpha + eps) - gammaln(alpha)) / eps

print(f"digamma({alpha}) = {psi:.8f}")
print(f"numerical:        {psi_num:.8f}")

# Application: score of Gamma(alpha, beta) w.r.t. alpha
beta = 2.0
np.random.seed(42)
data = np.random.gamma(alpha, 1/beta, size=100)

# Score: log(beta) - psi(alpha) + mean(log x)
score_alpha = np.log(beta) - digamma(alpha) + np.mean(np.log(data))
print(f"\nGamma score w.r.t. alpha at true value: {score_alpha:.4f} (should be ~0)")

The Beta function¶

\[B(\alpha, \beta) = \int_0^1 t^{\alpha - 1}(1 - t)^{\beta - 1}\,dt = \frac{\Gamma(\alpha)\,\Gamma(\beta)}{\Gamma(\alpha + \beta)}.\]

# Beta function: integration vs Gamma formula
import numpy as np
from scipy.special import gamma as gamma_fn, beta as beta_fn
from scipy import integrate

alpha, beta = 2.5, 3.0

numerical, _ = integrate.quad(lambda t: t**(alpha-1) * (1-t)**(beta-1), 0, 1)
via_gamma = gamma_fn(alpha) * gamma_fn(beta) / gamma_fn(alpha + beta)
via_scipy = beta_fn(alpha, beta)

print(f"B({alpha},{beta}) by integration: {numerical:.8f}")
print(f"B({alpha},{beta}) via Gamma:      {via_gamma:.8f}")
print(f"B({alpha},{beta}) via scipy:       {via_scipy:.8f}")

# Application: verify Beta(alpha, beta) density integrates to 1
def beta_pdf(x, a, b):
    return x**(a-1) * (1-x)**(b-1) / beta_fn(a, b)

integral, _ = integrate.quad(beta_pdf, 0, 1, args=(alpha, beta))
print(f"\nint Beta({alpha},{beta}) pdf = {integral:.8f} (should be 1)")

Stirling’s Approximation¶

\[\log(n!) \approx n\log n - n + \tfrac{1}{2}\log(2\pi n).\]

Stirling’s approximation simplifies log-likelihoods involving factorials (multinomial, Poisson limit of Binomial) and appears in asymptotic arguments for MLE normality.

# Stirling's approximation: accuracy at various n
import numpy as np
from scipy.special import gammaln

print(f"{'n':>6}  {'log(n!)':>12}  {'Stirling':>12}  {'Rel Error':>12}")
print("-" * 48)
for n in [5, 10, 20, 50, 100, 1000]:
    exact = gammaln(n + 1)
    stirling = n * np.log(n) - n + 0.5 * np.log(2 * np.pi * n)
    rel_err = abs(stirling - exact) / abs(exact)
    print(f"{n:>6}  {exact:>12.4f}  {stirling:>12.4f}  {rel_err:>12.2e}")

print("\n(Relative error drops below 1% for n >= 10)")

# Application: Stirling in the Poisson-Binomial connection
# P(X = k) for Binomial(n, lam/n) -> Poisson(lam) as n -> inf
import numpy as np
from scipy.special import comb
from scipy.stats import poisson

lam = 3.0
k = 4

print(f"P(X={k}) for Binomial(n, {lam}/n) vs Poisson({lam}):")
print(f"{'n':>8}  {'Binomial':>12}  {'Poisson':>12}  {'Abs Diff':>12}")
print("-" * 48)
for n in [10, 50, 100, 1000, 10000]:
    p = lam / n
    binom_prob = comb(n, k, exact=True) * p**k * (1-p)**(n-k)
    pois_prob = poisson.pmf(k, lam)
    print(f"{n:8d}  {binom_prob:12.8f}  {pois_prob:12.8f}  {abs(binom_prob-pois_prob):12.2e}")

Useful Integrals¶

Gaussian Integral¶

\[\int_{-\infty}^{\infty} e^{-ax^2 + bx}\,dx = \sqrt{\frac{\pi}{a}}\,\exp\!\left(\frac{b^2}{4a}\right), \qquad a > 0.\]

This identity is used every time we derive the normalizing constant of a normal distribution or “complete the square” in an exponent.

# Gaussian integrals: numerical verification
import numpy as np
from scipy import integrate

# Basic: int exp(-x^2) dx = sqrt(pi)
val, _ = integrate.quad(lambda x: np.exp(-x**2), -np.inf, np.inf)
print(f"int exp(-x^2) dx = {val:.6f},  sqrt(pi) = {np.sqrt(np.pi):.6f}")

# With parameters: int exp(-ax^2 + bx) dx
a, b = 2.0, 3.0
val_ab, _ = integrate.quad(lambda x: np.exp(-a*x**2 + b*x), -np.inf, np.inf)
expected = np.sqrt(np.pi / a) * np.exp(b**2 / (4*a))
print(f"int exp(-{a}x^2+{b}x) dx = {val_ab:.6f},  formula = {expected:.6f}")

# Application: verify Normal density integrates to 1
mu, sigma = 3.0, 2.0
normal_integral, _ = integrate.quad(
    lambda x: (1/(sigma*np.sqrt(2*np.pi))) * np.exp(-(x-mu)**2/(2*sigma**2)),
    -np.inf, np.inf)
print(f"\nN({mu},{sigma}^2) integrates to {normal_integral:.8f}")

# Completing the square: the trick behind conjugate Bayesian updates
# Posterior of Normal mean with Normal prior
import numpy as np
from scipy import integrate

# Prior: mu ~ N(mu_0, tau^2)
mu_0, tau = 0.0, 5.0
# Data: x_1, ..., x_n ~ N(mu, sigma^2), sigma known
np.random.seed(42)
sigma = 2.0
data = np.random.normal(3.0, sigma, size=10)
n = len(data)
x_bar = data.mean()

# Posterior by completing the square:
# precision: 1/tau^2 + n/sigma^2
post_prec = 1/tau**2 + n/sigma**2
post_var = 1 / post_prec
post_mean = post_var * (mu_0/tau**2 + n*x_bar/sigma**2)

print(f"Prior:     N({mu_0}, {tau**2})")
print(f"Data mean: {x_bar:.4f} (n={n})")
print(f"Posterior: N({post_mean:.4f}, {post_var:.4f})")
print(f"Post SD:   {np.sqrt(post_var):.4f}")

# The posterior precision = prior precision + data precision
print(f"\nPrior precision:    {1/tau**2:.4f}")
print(f"Data precision:     {n/sigma**2:.4f}")
print(f"Posterior precision: {post_prec:.4f}")

Gamma Function Integral¶

\[\int_0^\infty x^{\alpha - 1}\,e^{-\beta x}\,dx = \frac{\Gamma(\alpha)}{\beta^\alpha}.\]

This is the normalizing constant of the Gamma distribution and appears whenever we integrate out a rate or precision parameter.

# Gamma integral: normalizing constant verification
import numpy as np
from scipy import integrate
from scipy.special import gamma as gamma_fn

alpha, beta = 3.5, 2.0

numerical, _ = integrate.quad(
    lambda x: x**(alpha-1) * np.exp(-beta*x), 0, np.inf)
analytical = gamma_fn(alpha) / beta**alpha

print(f"Numerical:       {numerical:.8f}")
print(f"Gamma(a)/beta^a: {analytical:.8f}")

# Application: verify Gamma(alpha, beta) density integrates to 1
def gamma_pdf(x, a, b):
    return b**a / gamma_fn(a) * x**(a-1) * np.exp(-b*x)

integral, _ = integrate.quad(gamma_pdf, 0, np.inf, args=(alpha, beta))
print(f"\nGamma({alpha},{beta}) pdf integrates to {integral:.8f}")

Conjugacy connection

In Bayesian inference, if the prior on a Poisson rate is Gamma(\(\alpha_0, \beta_0\)) and we observe \(n\) data points summing to \(s\), the posterior is Gamma(\(\alpha_0 + s, \beta_0 + n\)). The Gamma integral is what makes this conjugate update work.

# Bayesian conjugacy: Poisson-Gamma in action
import numpy as np

np.random.seed(42)
# Prior: lambda ~ Gamma(alpha_0, beta_0)
alpha_0, beta_0 = 2.0, 1.0  # weak prior

# Data: n Poisson observations
lam_true = 4.0
data = np.random.poisson(lam_true, size=20)
n = len(data)
s = data.sum()

# Posterior: Gamma(alpha_0 + s, beta_0 + n)
alpha_post = alpha_0 + s
beta_post = beta_0 + n
post_mean = alpha_post / beta_post
post_var = alpha_post / beta_post**2

# Frequentist MLE for comparison
mle = data.mean()
se_mle = np.sqrt(mle / n)

print(f"Data: n={n}, sum={s}, mean={data.mean():.2f}")
print(f"Prior mean:     {alpha_0/beta_0:.4f}")
print(f"Posterior mean: {post_mean:.4f}")
print(f"Posterior SD:   {np.sqrt(post_var):.4f}")
print(f"MLE:            {mle:.4f}")
print(f"MLE SE:         {se_mle:.4f}")
print(f"True lambda:    {lam_true}")

Moments of the Gaussian¶

For \(X \sim \mathcal{N}(\mu, \sigma^2)\):

\[\begin{split}E[X^2] &= \mu^2 + \sigma^2, \\ E[X^4] &= \mu^4 + 6\mu^2\sigma^2 + 3\sigma^4.\end{split}\]

These moments appear when computing the Fisher information for models involving squared observations.

# Gaussian moments: simulation vs formula
import numpy as np

np.random.seed(42)
mu, sigma = 2.0, 1.5
X = np.random.normal(mu, sigma, size=1_000_000)

print(f"{'Moment':>8} {'Simulated':>12} {'Formula':>12}")
print("-" * 36)
print(f"{'E[X^2]':>8} {np.mean(X**2):12.4f} {mu**2 + sigma**2:12.4f}")
print(f"{'E[X^4]':>8} {np.mean(X**4):12.2f} "
      f"{mu**4 + 6*mu**2*sigma**2 + 3*sigma**4:12.2f}")

The Chi-Squared Integral¶

If \(X_1, \ldots, X_n \sim \mathcal{N}(0,1)\) i.i.d., then \(Q = \sum X_i^2 \sim \chi^2_n\).

\[f_Q(q) = \frac{1}{2^{n/2}\,\Gamma(n/2)}\,q^{n/2 - 1}\,e^{-q/2}.\]

This connects to likelihood ratio tests via Wilks’ theorem: \(-2\log\Lambda \xrightarrow{d} \chi^2_r\).

# Chi-squared: simulation and connection to likelihood ratio tests
import numpy as np
from scipy.stats import chi2

np.random.seed(42)
n = 5
n_sim = 100_000

# Simulate: sum of squared standard normals
Z = np.random.randn(n_sim, n)
Q = np.sum(Z**2, axis=1)

print(f"Chi-squared({n}) distribution:")
print(f"  Mean  -- simulated: {Q.mean():.3f},  theory: {n}")
print(f"  Var   -- simulated: {Q.var():.3f},  theory: {2*n}")

# Connection to Wilks' theorem:
# Under H0, -2 log(L0/L1) ~ chi2(r) where r = # restrictions
# This is why the chi2 critical values matter for hypothesis testing
print(f"\nChi2 critical values (for likelihood ratio tests):")
for df in [1, 2, 5, 10]:
    for alpha in [0.05, 0.01]:
        cv = chi2.ppf(1 - alpha, df=df)
        print(f"  df={df}, alpha={alpha}: critical value = {cv:.4f}")