.. _appendix_calculus: ================================ Appendix B: Calculus Review ================================ This appendix reviews the calculus tools that underpin likelihood-based inference. Every rule and formula is tied to a concrete statistical application and verified in code. The goal is not to re-teach calculus but to show *why* each tool matters the moment you open a likelihood function. If you have taken a standard calculus sequence, the rules below will be familiar. What is different here is the lens: every derivative is a score function, every integral is a normalizing constant, and every Taylor expansion is an approximation to a log-likelihood. Differentiation Rules ====================== Throughout, we let :math:`f` and :math:`g` denote differentiable real-valued functions. Sum Rule -------- .. math:: \frac{d}{dx}\bigl[f(x) + g(x)\bigr] = f'(x) + g'(x). **Why it matters for statistics:** the sum rule is the reason log-likelihoods are so much easier than likelihoods. Taking the log of a product turns it into a sum, and then we differentiate each observation's contribution independently. .. code-block:: python # Sum rule in action: log-likelihood of n independent observations # is a SUM of individual log-likelihoods import numpy as np from scipy.stats import poisson data = np.array([3, 5, 2, 7, 4]) lam = 4.0 # Log-likelihood: sum of individual terms individual_terms = [poisson.logpmf(x, mu=lam) for x in data] total_ll = sum(individual_terms) print("Individual log-likelihood contributions:") for i, (x, ll) in enumerate(zip(data, individual_terms)): print(f" x_{i} = {x}: log f(x|lam) = {ll:.4f}") print(f"Total log-likelihood = {total_ll:.4f}") print(f"Verify: {np.sum(poisson.logpmf(data, mu=lam)):.4f}") Product Rule ------------ .. math:: \frac{d}{dx}\bigl[f(x)\,g(x)\bigr] = f'(x)\,g(x) + f(x)\,g'(x). You encounter the product rule when differentiating likelihood contributions where both the location and scale depend on the same parameter --- for example, in heteroscedastic models. .. code-block:: python # Product rule: derivative of x * exp(-x^2) # This shape appears in the score of a Rayleigh distribution import numpy as np x = 2.0 # f(x) = x, g(x) = exp(-x^2) # d/dx [x * exp(-x^2)] = 1 * exp(-x^2) + x * (-2x) * exp(-x^2) # = exp(-x^2) * (1 - 2x^2) analytical = np.exp(-x**2) * (1 - 2 * x**2) eps = 1e-7 h = lambda t: t * np.exp(-t**2) numerical = (h(x + eps) - h(x)) / eps print(f"Product rule (analytical): {analytical:.8f}") print(f"Finite difference: {numerical:.8f}") Chain Rule ---------- If :math:`h(x) = f(g(x))`, then: .. math:: h'(x) = f'\bigl(g(x)\bigr)\,g'(x). The chain rule is the single most important differentiation tool in likelihood theory. **The score function is the chain rule applied to** :math:`\log(\cdot)` **composed with** :math:`L(\cdot)`: .. math:: s(\theta) = \frac{d}{d\theta}\log L(\theta) = \frac{1}{L(\theta)}\cdot\frac{dL}{d\theta}. .. code-block:: python # Chain rule = score function: d/d(theta) log L(theta) import numpy as np from scipy.stats import poisson data = np.array([3, 5, 2, 7, 4]) theta = 4.0 # Poisson rate # The chain rule gives us the score: # d/d(lam) log L = d/d(lam) [sum x_i * log(lam) - n*lam - sum log(x_i!)] # = sum(x_i) / lam - n score_chain_rule = data.sum() / theta - len(data) # Verify numerically def loglik(lam): return np.sum(poisson.logpmf(data, mu=lam)) eps = 1e-7 score_numerical = (loglik(theta + eps) - loglik(theta)) / eps print(f"Score via chain rule: {score_chain_rule:.8f}") print(f"Score via finite diff: {score_numerical:.8f}") print(f"MLE (where score = 0): lam_hat = {data.mean()}") **Logarithmic differentiation.** For :math:`L(\theta) > 0`: .. math:: \frac{d}{d\theta}\log L(\theta) = \frac{1}{L(\theta)}\,\frac{dL}{d\theta}. This converts differentiating a product (likelihood) into differentiating a sum (log-likelihood) --- one of the most powerful tricks in statistics. .. code-block:: python # Logarithmic differentiation for a Binomial likelihood # L(p) = C * p^s * (1-p)^(n-s) # log L = const + s*log(p) + (n-s)*log(1-p) # d/dp log L = s/p - (n-s)/(1-p) import numpy as np n, s = 100, 73 p = 0.7 # Score via log differentiation score = s / p - (n - s) / (1 - p) # Score via direct differentiation of L, then dividing by L # dL/dp = C * [s * p^(s-1) * (1-p)^(n-s) - (n-s) * p^s * (1-p)^(n-s-1)] # dL/dp / L = s/p - (n-s)/(1-p) (same thing!) print(f"Score at p={p}: {score:.4f}") print(f"Score at MLE p={s/n}: {s/(s/n) - (n-s)/(1-s/n):.10f}") Partial Derivatives and Gradients ================================== For a function :math:`f : \mathbb{R}^n \to \mathbb{R}`, the **gradient** is .. math:: \nabla f(\mathbf{x}) = \begin{pmatrix} \partial f / \partial x_1 \\ \vdots \\ \partial f / \partial x_n \end{pmatrix}. Setting :math:`\nabla \ell = \mathbf{0}` gives the MLE in multiparameter models. .. code-block:: python # Gradient of a Normal log-likelihood with two parameters (mu, sigma) import numpy as np from scipy.stats import norm np.random.seed(42) data = np.random.normal(loc=3.0, scale=2.0, size=100) def loglik(mu, log_sigma): sigma = np.exp(log_sigma) return np.sum(norm.logpdf(data, loc=mu, scale=sigma)) # Analytical gradient for Normal: # d/d(mu) = sum(x_i - mu) / sigma^2 # d/d(sigma) = -n/sigma + sum(x_i - mu)^2 / sigma^3 mu_hat = data.mean() sigma_hat = data.std() grad_mu = np.sum(data - mu_hat) / sigma_hat**2 grad_sigma = -len(data) / sigma_hat + np.sum((data - mu_hat)**2) / sigma_hat**3 print(f"MLE: mu = {mu_hat:.4f}, sigma = {sigma_hat:.4f}") print(f"Gradient at MLE: d/d(mu) = {grad_mu:.6f}, d/d(sigma) = {grad_sigma:.6f}") print(f"(Both should be ~0 at the MLE)") # Numerical verification eps = 1e-7 grad_mu_num = (loglik(mu_hat + eps, np.log(sigma_hat)) - loglik(mu_hat, np.log(sigma_hat))) / eps print(f"Numerical d/d(mu) at MLE: {grad_mu_num:.6f}") **Multivariate chain rule and the Jacobian.** If :math:`f` depends on :math:`\mathbf{x}` and :math:`\mathbf{x}` depends on :math:`\mathbf{t}`: .. math:: \nabla_{\mathbf{t}} f = \mathbf{J}^\top \nabla_{\mathbf{x}} f, where :math:`\mathbf{J}` is the Jacobian. This is the foundation of both backpropagation and the delta method in statistics. .. code-block:: python # Delta method: SE of a transformation g(theta_hat) # If theta_hat ~ N(theta, se^2), then g(theta_hat) ~ N(g(theta), [g'(theta)]^2 * se^2) import numpy as np # Example: Poisson rate lambda, want SE of exp(-lambda) = P(X=0) np.random.seed(42) data = np.random.poisson(3.0, size=100) lam_hat = data.mean() se_lam = np.sqrt(lam_hat / len(data)) # g(lambda) = exp(-lambda), g'(lambda) = -exp(-lambda) g_hat = np.exp(-lam_hat) g_prime = -np.exp(-lam_hat) se_g = abs(g_prime) * se_lam # delta method print(f"lam_hat = {lam_hat:.4f}, SE(lam) = {se_lam:.4f}") print(f"P(X=0) = exp(-lam_hat) = {g_hat:.6f}") print(f"SE of P(X=0) via delta method = {se_g:.6f}") print(f"95% CI for P(X=0): [{g_hat - 1.96*se_g:.6f}, {g_hat + 1.96*se_g:.6f}]") Taylor Series ============== Univariate Taylor Expansion ---------------------------- .. math:: f(x) = f(a) + f'(a)\,(x - a) + \frac{f''(a)}{2!}\,(x - a)^2 + \cdots The second-order expansion is the basis for the **Laplace approximation** and for understanding the curvature of the log-likelihood near its maximum. .. code-block:: python # Taylor approximations to log(x) around a=1 # This is relevant because log appears in every log-likelihood import numpy as np a = 1.0 x_test = 1.5 f_true = np.log(x_test) taylor_1 = (x_test - a) # order 1 taylor_2 = taylor_1 - (1/2) * (x_test - a)**2 # order 2 taylor_3 = taylor_2 + (1/3) * (x_test - a)**3 # order 3 print(f"log({x_test}) = {f_true:.6f}") print(f"{'Order':>6} {'Approx':>10} {'Error':>10}") print("-" * 30) for order, approx in [(1, taylor_1), (2, taylor_2), (3, taylor_3)]: print(f"{order:6d} {approx:10.6f} {abs(approx - f_true):10.6f}") Application: quadratic approximation to the log-likelihood ------------------------------------------------------------- Expanding :math:`\ell(\theta)` about the MLE :math:`\hat\theta` (where :math:`\ell'(\hat\theta) = 0`): .. math:: \ell(\theta) \approx \ell(\hat\theta) + \tfrac{1}{2}\,\ell''(\hat\theta)\,(\theta - \hat\theta)^2. The linear term vanishes because the score is zero at the MLE. What remains is a quadratic --- and a quadratic in the exponent is a Gaussian. **This is precisely why the MLE is approximately normally distributed.** .. code-block:: python # Quadratic approximation to a Poisson log-likelihood import numpy as np from scipy.stats import poisson np.random.seed(42) data = np.array([3, 7, 2, 5, 4, 6, 3, 5]) n = len(data) theta_mle = data.mean() def loglik(theta): return np.sum(poisson.logpmf(data, mu=theta)) # Analytical Hessian: d^2 ell / d(lambda)^2 = -sum(x_i) / lambda^2 hessian = -data.sum() / theta_mle**2 obs_info = -hessian # observed Fisher information # Quadratic approximation theta_grid = np.linspace(2.0, 7.0, 200) ell_true = np.array([loglik(t) for t in theta_grid]) ell_quad = loglik(theta_mle) + 0.5 * hessian * (theta_grid - theta_mle)**2 # Compare at specific points print(f"MLE = {theta_mle:.2f}") print(f"Observed info = {obs_info:.4f}") print(f"SE = 1/sqrt(info) = {1/np.sqrt(obs_info):.4f}") print(f"Exact SE = sqrt(lam/n) = {np.sqrt(theta_mle/n):.4f}") print() print(f"{'theta':>8} {'True ell':>12} {'Quad approx':>12} {'Error':>10}") print("-" * 46) for t in [3.0, 3.5, 4.0, 4.375, 5.0, 5.5, 6.0]: true_val = loglik(t) quad_val = loglik(theta_mle) + 0.5 * hessian * (t - theta_mle)**2 print(f"{t:8.2f} {true_val:12.4f} {quad_val:12.4f} {abs(true_val-quad_val):10.4f}") .. admonition:: Common Pitfall The quadratic approximation is only good *near* the MLE. If the true log-likelihood is strongly skewed (small samples, parameters near boundaries), Wald intervals based on this approximation can be misleading. Profile likelihood intervals are more robust. Multivariate Taylor Expansion ------------------------------ For :math:`f : \mathbb{R}^n \to \mathbb{R}`, expanded about :math:`\mathbf{a}`: .. math:: f(\mathbf{x}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a}) + \tfrac{1}{2}\,(\mathbf{x} - \mathbf{a})^\top \mathbf{H}(\mathbf{a})\,(\mathbf{x} - \mathbf{a}), where :math:`\mathbf{H}` is the Hessian matrix. .. code-block:: python # Multivariate Taylor: Normal log-likelihood with (mu, log_sigma) import numpy as np from scipy.stats import norm np.random.seed(42) data = np.random.normal(loc=5.0, scale=2.0, size=50) n = len(data) def loglik(params): mu, log_sig = params sig = np.exp(log_sig) return np.sum(norm.logpdf(data, mu, sig)) # MLE mu_hat = data.mean() sig_hat = data.std() mle = np.array([mu_hat, np.log(sig_hat)]) # Numerical Hessian at MLE eps = 1e-5 H = np.zeros((2, 2)) for i in range(2): for j in range(2): pp = mle.copy(); pp[i] += eps; pp[j] += eps pm = mle.copy(); pm[i] += eps; pm[j] -= eps mp = mle.copy(); mp[i] -= eps; mp[j] += eps mm = mle.copy(); mm[i] -= eps; mm[j] -= eps H[i, j] = (loglik(pp) - loglik(pm) - loglik(mp) + loglik(mm)) / (4 * eps**2) obs_info = -H cov = np.linalg.inv(obs_info) se = np.sqrt(np.diag(cov)) print(f"MLE: mu = {mu_hat:.4f}, sigma = {sig_hat:.4f}") print(f"SEs: SE(mu) = {se[0]:.4f}, SE(log sigma) = {se[1]:.4f}") print(f"\nHessian at MLE:") print(np.array2string(H, precision=2)) print(f"\nObserved information (= -H):") print(np.array2string(obs_info, precision=2)) Optimization ============= Necessary Conditions for Extrema ---------------------------------- **First-order condition (score equation):** .. math:: f'(x^*) = 0 \quad \text{(univariate)}, \qquad \nabla f(\mathbf{x}^*) = \mathbf{0} \quad \text{(multivariate)}. **Second derivative test:** at a critical point :math:`x^*`: - :math:`f''(x^*) < 0 \implies` local maximum (what we want for log-likelihoods). - The negative Hessian :math:`-\mathbf{H}` is the **observed information matrix**. .. code-block:: python # Finding the MLE numerically: Gamma distribution import numpy as np from scipy.optimize import minimize_scalar from scipy.stats import gamma np.random.seed(42) alpha_true, beta_true = 3.0, 2.0 data = gamma.rvs(a=alpha_true, scale=1/beta_true, size=50) def neg_loglik(alpha): return -np.sum(gamma.logpdf(data, a=alpha, scale=1/beta_true)) result = minimize_scalar(neg_loglik, bounds=(0.5, 10.0), method='bounded') alpha_hat = result.x # Second derivative test: should be positive for neg_loglik (= minimum) eps = 1e-5 d2 = (neg_loglik(alpha_hat+eps) - 2*neg_loglik(alpha_hat) + neg_loglik(alpha_hat-eps)) / eps**2 se = 1.0 / np.sqrt(d2) # d2 = observed info for neg_loglik print(f"MLE alpha = {alpha_hat:.4f} (true: {alpha_true})") print(f"d^2(-ell)/d(alpha)^2 at MLE = {d2:.4f} (>0 confirms minimum of -ell)") print(f"SE(alpha) = {se:.4f}") print(f"95% CI: [{alpha_hat - 1.96*se:.4f}, {alpha_hat + 1.96*se:.4f}]") Convexity --------- A function :math:`f` is **convex** if for all :math:`\mathbf{x}, \mathbf{y}` and :math:`t \in [0,1]`: .. math:: f\bigl(t\,\mathbf{x} + (1-t)\,\mathbf{y}\bigr) \leq t\,f(\mathbf{x}) + (1-t)\,f(\mathbf{y}). Many log-likelihoods for exponential family distributions are **concave** (negative Hessian is positive definite), which guarantees the MLE is unique. .. code-block:: python # Concavity check: logistic regression log-likelihood import numpy as np np.random.seed(42) n = 50 X = np.column_stack([np.ones(n), np.random.randn(n)]) beta_true = np.array([0.5, 1.5]) prob = 1 / (1 + np.exp(-X @ beta_true)) y = (np.random.rand(n) < prob).astype(float) def loglik(beta): eta = X @ beta return np.sum(y * eta - np.log(1 + np.exp(eta))) # Check Hessian at several points --- should always be negative semi-definite for beta_test in [np.array([0, 0]), np.array([1, 2]), np.array([-1, 0.5])]: eps = 1e-5 H = np.zeros((2, 2)) for i in range(2): for j in range(2): bp, bm = beta_test.copy(), beta_test.copy() bpp = beta_test.copy(); bpp[i] += eps; bpp[j] += eps bpm = beta_test.copy(); bpm[i] += eps; bpm[j] -= eps bmp = beta_test.copy(); bmp[i] -= eps; bmp[j] += eps bmm = beta_test.copy(); bmm[i] -= eps; bmm[j] -= eps H[i, j] = (loglik(bpp) - loglik(bpm) - loglik(bmp) + loglik(bmm)) / (4 * eps**2) eigenvalues = np.linalg.eigvalsh(H) print(f"beta = {beta_test}: eigenvalues of H = " f"[{eigenvalues[0]:.3f}, {eigenvalues[1]:.3f}] " f"{'<= 0 (concave)' if all(eigenvalues <= 1e-10) else 'NOT concave'}") .. admonition:: Why concavity matters When the log-likelihood is concave, any critical point is the global maximum. This is the case for logistic regression, Poisson regression, and all canonical-link GLMs. When the log-likelihood is *not* concave (mixture models, neural networks), algorithms like EM may converge to local maxima. Integration Techniques ======================= Integration by Substitution ----------------------------- .. math:: \int f(g(x))\,g'(x)\,dx = \int f(u)\,du. This is how we compute normalizing constants of transformed random variables. .. code-block:: python # Substitution: verify that Y = log(X) has density f_Y(y) = f_X(e^y) * e^y # when X ~ Gamma(alpha, beta) import numpy as np from scipy.stats import gamma from scipy import integrate alpha, beta = 3.0, 2.0 # The density of Y = log(X) where X ~ Gamma def f_Y(y): x = np.exp(y) return gamma.pdf(x, a=alpha, scale=1/beta) * x # Jacobian |dx/dy| = e^y # Verify it integrates to 1 integral, _ = integrate.quad(f_Y, -10, 20) print(f"int f_Y(y) dy = {integral:.8f} (should be 1)") # Verify moments: E[Y] = E[log X] = psi(alpha) - log(beta) (digamma) from scipy.special import digamma EY_theory = digamma(alpha) - np.log(beta) EY_numerical, _ = integrate.quad(lambda y: y * f_Y(y), -10, 20) print(f"E[log X] numerical: {EY_numerical:.6f}") print(f"E[log X] theory: {EY_theory:.6f}") Integration by Parts --------------------- .. math:: \int u\,dv = uv - \int v\,du. Integration by parts is essential for deriving moments and for the Gamma function recursion :math:`\Gamma(\alpha+1) = \alpha\,\Gamma(\alpha)`. .. code-block:: python # Integration by parts proves Gamma(alpha+1) = alpha * Gamma(alpha) # Verify computationally: import numpy as np from scipy.special import gamma as gamma_fn for alpha in [1.0, 2.5, 5.0, 10.0]: lhs = gamma_fn(alpha + 1) rhs = alpha * gamma_fn(alpha) print(f"Gamma({alpha+1:.1f}) = {lhs:.6f}, " f"{alpha:.1f} * Gamma({alpha:.1f}) = {rhs:.6f}, " f"match: {np.isclose(lhs, rhs)}") Leibniz Integral Rule ---------------------- When the limits are constant: .. math:: \frac{d}{d\theta} \int_a^b f(x, \theta)\,dx = \int_a^b \frac{\partial f}{\partial \theta}\,dx. This rule justifies **differentiating under the integral sign** --- the technical step behind deriving score functions and Fisher information from the likelihood. .. code-block:: python # Leibniz rule: d/d(beta) of the Gamma normalizing constant import numpy as np from scipy import integrate from scipy.special import gamma as gamma_fn alpha = 3.0 beta = 2.0 # I(beta) = int_0^inf x^(a-1) exp(-beta*x) dx = Gamma(a) / beta^a def I(b): val, _ = integrate.quad(lambda x: x**(alpha-1) * np.exp(-b*x), 0, np.inf) return val # Numerical dI/dbeta eps = 1e-7 dI_numerical = (I(beta + eps) - I(beta)) / eps # Analytical: dI/dbeta = -alpha * Gamma(alpha) / beta^(alpha+1) dI_analytical = -alpha * gamma_fn(alpha) / beta**(alpha + 1) print(f"dI/d(beta) numerical: {dI_numerical:.8f}") print(f"dI/d(beta) analytical: {dI_analytical:.8f}") .. admonition:: When Leibniz fails The Leibniz rule requires the support to be independent of :math:`\theta`. When the support *does* depend on the parameter --- for example, the Uniform(0, :math:`\theta`) distribution --- the rule fails, and the MLE behaves in non-standard ways (the MLE is :math:`\max(x_i)` and converges at rate :math:`1/n` instead of :math:`1/\sqrt{n}`). .. code-block:: python # When Leibniz fails: Uniform(0, theta) MLE import numpy as np np.random.seed(42) theta_true = 5.0 for n in [10, 100, 1000, 10000]: data = np.random.uniform(0, theta_true, size=n) theta_hat = data.max() # MLE # Bias and rate of convergence error = theta_true - theta_hat print(f"n = {n:5d}: MLE = {theta_hat:.6f}, " f"error = {error:.6f}, " f"n * error = {n * error:.4f}") print("\n(n * error stays roughly constant => O(1/n) rate, not O(1/sqrt(n)))") The Gamma Function =================== .. math:: \Gamma(\alpha) = \int_0^\infty t^{\alpha - 1}\,e^{-t}\,dt, \qquad \alpha > 0. The Gamma function is the normalizing constant that makes the Gamma, Beta, Chi-squared, Student's t, and F distributions integrate to one. .. code-block:: python # Gamma function: recursion, factorials, and special values import numpy as np from scipy.special import gamma as gamma_fn from math import factorial # Recursion: Gamma(a+1) = a * Gamma(a) alpha = 3.7 print(f"Gamma({alpha+1}) = {gamma_fn(alpha+1):.6f}") print(f"{alpha} * Gamma({alpha}) = {alpha * gamma_fn(alpha):.6f}") # Factorial connection print(f"\n{'n':>3} {'Gamma(n)':>12} {'(n-1)!':>12}") for n in range(1, 7): print(f"{n:3d} {gamma_fn(n):12.1f} {factorial(n-1):12d}") # Special value print(f"\nGamma(1/2) = {gamma_fn(0.5):.6f}") print(f"sqrt(pi) = {np.sqrt(np.pi):.6f}") The digamma function --------------------- .. math:: \psi(\alpha) = \frac{d}{d\alpha}\log\Gamma(\alpha). The digamma function appears in the score function of every distribution whose normalizing constant involves :math:`\Gamma(\alpha)`. .. code-block:: python # Digamma: the score function of the Gamma normalizing constant import numpy as np from scipy.special import digamma, gammaln alpha = 2.5 # Analytical psi = digamma(alpha) # Numerical: d/d(alpha) log Gamma(alpha) eps = 1e-7 psi_num = (gammaln(alpha + eps) - gammaln(alpha)) / eps print(f"digamma({alpha}) = {psi:.8f}") print(f"numerical: {psi_num:.8f}") # Application: score of Gamma(alpha, beta) w.r.t. alpha beta = 2.0 np.random.seed(42) data = np.random.gamma(alpha, 1/beta, size=100) # Score: log(beta) - psi(alpha) + mean(log x) score_alpha = np.log(beta) - digamma(alpha) + np.mean(np.log(data)) print(f"\nGamma score w.r.t. alpha at true value: {score_alpha:.4f} (should be ~0)") The Beta function ------------------ .. math:: B(\alpha, \beta) = \int_0^1 t^{\alpha - 1}(1 - t)^{\beta - 1}\,dt = \frac{\Gamma(\alpha)\,\Gamma(\beta)}{\Gamma(\alpha + \beta)}. .. code-block:: python # Beta function: integration vs Gamma formula import numpy as np from scipy.special import gamma as gamma_fn, beta as beta_fn from scipy import integrate alpha, beta = 2.5, 3.0 numerical, _ = integrate.quad(lambda t: t**(alpha-1) * (1-t)**(beta-1), 0, 1) via_gamma = gamma_fn(alpha) * gamma_fn(beta) / gamma_fn(alpha + beta) via_scipy = beta_fn(alpha, beta) print(f"B({alpha},{beta}) by integration: {numerical:.8f}") print(f"B({alpha},{beta}) via Gamma: {via_gamma:.8f}") print(f"B({alpha},{beta}) via scipy: {via_scipy:.8f}") # Application: verify Beta(alpha, beta) density integrates to 1 def beta_pdf(x, a, b): return x**(a-1) * (1-x)**(b-1) / beta_fn(a, b) integral, _ = integrate.quad(beta_pdf, 0, 1, args=(alpha, beta)) print(f"\nint Beta({alpha},{beta}) pdf = {integral:.8f} (should be 1)") Stirling's Approximation ========================== .. math:: \log(n!) \approx n\log n - n + \tfrac{1}{2}\log(2\pi n). Stirling's approximation simplifies log-likelihoods involving factorials (multinomial, Poisson limit of Binomial) and appears in asymptotic arguments for MLE normality. .. code-block:: python # Stirling's approximation: accuracy at various n import numpy as np from scipy.special import gammaln print(f"{'n':>6} {'log(n!)':>12} {'Stirling':>12} {'Rel Error':>12}") print("-" * 48) for n in [5, 10, 20, 50, 100, 1000]: exact = gammaln(n + 1) stirling = n * np.log(n) - n + 0.5 * np.log(2 * np.pi * n) rel_err = abs(stirling - exact) / abs(exact) print(f"{n:>6} {exact:>12.4f} {stirling:>12.4f} {rel_err:>12.2e}") print("\n(Relative error drops below 1% for n >= 10)") .. code-block:: python # Application: Stirling in the Poisson-Binomial connection # P(X = k) for Binomial(n, lam/n) -> Poisson(lam) as n -> inf import numpy as np from scipy.special import comb from scipy.stats import poisson lam = 3.0 k = 4 print(f"P(X={k}) for Binomial(n, {lam}/n) vs Poisson({lam}):") print(f"{'n':>8} {'Binomial':>12} {'Poisson':>12} {'Abs Diff':>12}") print("-" * 48) for n in [10, 50, 100, 1000, 10000]: p = lam / n binom_prob = comb(n, k, exact=True) * p**k * (1-p)**(n-k) pois_prob = poisson.pmf(k, lam) print(f"{n:8d} {binom_prob:12.8f} {pois_prob:12.8f} {abs(binom_prob-pois_prob):12.2e}") Useful Integrals ================== Gaussian Integral ------------------ .. math:: \int_{-\infty}^{\infty} e^{-ax^2 + bx}\,dx = \sqrt{\frac{\pi}{a}}\,\exp\!\left(\frac{b^2}{4a}\right), \qquad a > 0. This identity is used every time we derive the normalizing constant of a normal distribution or "complete the square" in an exponent. .. code-block:: python # Gaussian integrals: numerical verification import numpy as np from scipy import integrate # Basic: int exp(-x^2) dx = sqrt(pi) val, _ = integrate.quad(lambda x: np.exp(-x**2), -np.inf, np.inf) print(f"int exp(-x^2) dx = {val:.6f}, sqrt(pi) = {np.sqrt(np.pi):.6f}") # With parameters: int exp(-ax^2 + bx) dx a, b = 2.0, 3.0 val_ab, _ = integrate.quad(lambda x: np.exp(-a*x**2 + b*x), -np.inf, np.inf) expected = np.sqrt(np.pi / a) * np.exp(b**2 / (4*a)) print(f"int exp(-{a}x^2+{b}x) dx = {val_ab:.6f}, formula = {expected:.6f}") # Application: verify Normal density integrates to 1 mu, sigma = 3.0, 2.0 normal_integral, _ = integrate.quad( lambda x: (1/(sigma*np.sqrt(2*np.pi))) * np.exp(-(x-mu)**2/(2*sigma**2)), -np.inf, np.inf) print(f"\nN({mu},{sigma}^2) integrates to {normal_integral:.8f}") .. code-block:: python # Completing the square: the trick behind conjugate Bayesian updates # Posterior of Normal mean with Normal prior import numpy as np from scipy import integrate # Prior: mu ~ N(mu_0, tau^2) mu_0, tau = 0.0, 5.0 # Data: x_1, ..., x_n ~ N(mu, sigma^2), sigma known np.random.seed(42) sigma = 2.0 data = np.random.normal(3.0, sigma, size=10) n = len(data) x_bar = data.mean() # Posterior by completing the square: # precision: 1/tau^2 + n/sigma^2 post_prec = 1/tau**2 + n/sigma**2 post_var = 1 / post_prec post_mean = post_var * (mu_0/tau**2 + n*x_bar/sigma**2) print(f"Prior: N({mu_0}, {tau**2})") print(f"Data mean: {x_bar:.4f} (n={n})") print(f"Posterior: N({post_mean:.4f}, {post_var:.4f})") print(f"Post SD: {np.sqrt(post_var):.4f}") # The posterior precision = prior precision + data precision print(f"\nPrior precision: {1/tau**2:.4f}") print(f"Data precision: {n/sigma**2:.4f}") print(f"Posterior precision: {post_prec:.4f}") Gamma Function Integral ------------------------- .. math:: \int_0^\infty x^{\alpha - 1}\,e^{-\beta x}\,dx = \frac{\Gamma(\alpha)}{\beta^\alpha}. This is the normalizing constant of the Gamma distribution and appears whenever we integrate out a rate or precision parameter. .. code-block:: python # Gamma integral: normalizing constant verification import numpy as np from scipy import integrate from scipy.special import gamma as gamma_fn alpha, beta = 3.5, 2.0 numerical, _ = integrate.quad( lambda x: x**(alpha-1) * np.exp(-beta*x), 0, np.inf) analytical = gamma_fn(alpha) / beta**alpha print(f"Numerical: {numerical:.8f}") print(f"Gamma(a)/beta^a: {analytical:.8f}") # Application: verify Gamma(alpha, beta) density integrates to 1 def gamma_pdf(x, a, b): return b**a / gamma_fn(a) * x**(a-1) * np.exp(-b*x) integral, _ = integrate.quad(gamma_pdf, 0, np.inf, args=(alpha, beta)) print(f"\nGamma({alpha},{beta}) pdf integrates to {integral:.8f}") .. admonition:: Conjugacy connection In Bayesian inference, if the prior on a Poisson rate is Gamma(:math:`\alpha_0, \beta_0`) and we observe :math:`n` data points summing to :math:`s`, the posterior is Gamma(:math:`\alpha_0 + s, \beta_0 + n`). The Gamma integral is what makes this conjugate update work. .. code-block:: python # Bayesian conjugacy: Poisson-Gamma in action import numpy as np np.random.seed(42) # Prior: lambda ~ Gamma(alpha_0, beta_0) alpha_0, beta_0 = 2.0, 1.0 # weak prior # Data: n Poisson observations lam_true = 4.0 data = np.random.poisson(lam_true, size=20) n = len(data) s = data.sum() # Posterior: Gamma(alpha_0 + s, beta_0 + n) alpha_post = alpha_0 + s beta_post = beta_0 + n post_mean = alpha_post / beta_post post_var = alpha_post / beta_post**2 # Frequentist MLE for comparison mle = data.mean() se_mle = np.sqrt(mle / n) print(f"Data: n={n}, sum={s}, mean={data.mean():.2f}") print(f"Prior mean: {alpha_0/beta_0:.4f}") print(f"Posterior mean: {post_mean:.4f}") print(f"Posterior SD: {np.sqrt(post_var):.4f}") print(f"MLE: {mle:.4f}") print(f"MLE SE: {se_mle:.4f}") print(f"True lambda: {lam_true}") Moments of the Gaussian ------------------------- For :math:`X \sim \mathcal{N}(\mu, \sigma^2)`: .. math:: E[X^2] &= \mu^2 + \sigma^2, \\ E[X^4] &= \mu^4 + 6\mu^2\sigma^2 + 3\sigma^4. These moments appear when computing the Fisher information for models involving squared observations. .. code-block:: python # Gaussian moments: simulation vs formula import numpy as np np.random.seed(42) mu, sigma = 2.0, 1.5 X = np.random.normal(mu, sigma, size=1_000_000) print(f"{'Moment':>8} {'Simulated':>12} {'Formula':>12}") print("-" * 36) print(f"{'E[X^2]':>8} {np.mean(X**2):12.4f} {mu**2 + sigma**2:12.4f}") print(f"{'E[X^4]':>8} {np.mean(X**4):12.2f} " f"{mu**4 + 6*mu**2*sigma**2 + 3*sigma**4:12.2f}") The Chi-Squared Integral -------------------------- If :math:`X_1, \ldots, X_n \sim \mathcal{N}(0,1)` i.i.d., then :math:`Q = \sum X_i^2 \sim \chi^2_n`. .. math:: f_Q(q) = \frac{1}{2^{n/2}\,\Gamma(n/2)}\,q^{n/2 - 1}\,e^{-q/2}. This connects to likelihood ratio tests via Wilks' theorem: :math:`-2\log\Lambda \xrightarrow{d} \chi^2_r`. .. code-block:: python # Chi-squared: simulation and connection to likelihood ratio tests import numpy as np from scipy.stats import chi2 np.random.seed(42) n = 5 n_sim = 100_000 # Simulate: sum of squared standard normals Z = np.random.randn(n_sim, n) Q = np.sum(Z**2, axis=1) print(f"Chi-squared({n}) distribution:") print(f" Mean -- simulated: {Q.mean():.3f}, theory: {n}") print(f" Var -- simulated: {Q.var():.3f}, theory: {2*n}") # Connection to Wilks' theorem: # Under H0, -2 log(L0/L1) ~ chi2(r) where r = # restrictions # This is why the chi2 critical values matter for hypothesis testing print(f"\nChi2 critical values (for likelihood ratio tests):") for df in [1, 2, 5, 10]: for alpha in [0.05, 0.01]: cv = chi2.ppf(1 - alpha, df=df) print(f" df={df}, alpha={alpha}: critical value = {cv:.4f}")