.. _ch8_specialized: ========================================== Chapter 8 --- Specialized Likelihoods ========================================== The "standard" likelihood --- the product of i.i.d. densities --- is a powerful tool, but many practical problems require modifications. This chapter surveys the most important variants, each built around a concrete problem where the standard approach breaks down. **Three running scenarios thread through this chapter:** 1. **Censored data** --- a clinical trial where patients drop out before the study ends, so some survival times are only partially observed. 2. **High-dimensional regression** --- gene expression data with 20,000 genes and only 100 patients, where the standard MLE does not exist. 3. **Spatial dependence** --- ecological surveys where nearby plots are correlated, making the full joint likelihood intractable. Each method trades something --- efficiency, full distributional assumptions, exact probability interpretation --- in exchange for tractability or robustness. Knowing which tool to reach for is a core skill in applied statistics. .. contents:: Topics in this chapter :local: :depth: 1 :class: this-will-duplicate-information-and-it-is-still-useful-here .. _sec_profile: 8.1 Profile Likelihood ======================== The problem with nuisance parameters -------------------------------------- You are estimating the mean :math:`\mu` of a Normal distribution, but the variance :math:`\sigma^2` is also unknown. The log-likelihood :math:`\ell(\mu, \sigma^2)` depends on both, yet you only care about :math:`\mu`. Reporting a confidence interval for :math:`\mu` that accounts for uncertainty in :math:`\sigma^2` requires eliminating the nuisance parameter. The **profile likelihood** does this by, for each candidate :math:`\mu`, plugging in the best-fitting :math:`\sigma^2`. Definition ---------- The profile log-likelihood for :math:`\theta` is .. math:: \ell_P(\theta) = \max_{\lambda}\;\ell(\theta, \lambda) = \ell\!\left(\theta,\; \hat{\lambda}(\theta)\right), where :math:`\hat{\lambda}(\theta)` is the constrained MLE of the nuisance parameter :math:`\lambda` holding :math:`\theta` fixed. In plain English: for each hypothetical value of :math:`\theta`, ask "what is the best :math:`\lambda` could be?" and record the resulting log-likelihood. This sweeps out a curve over :math:`\theta` alone, and its peak is at the full MLE :math:`\hat{\theta}`. The profile likelihood is *not* a true likelihood, but it behaves like one for inference purposes. .. code-block:: python # Profile likelihood: the idea in code # For each candidate mu, find the best sigma^2, then record the log-lik import numpy as np np.random.seed(42) mu_true, sigma_true = 5.0, 2.0 n = 50 data = np.random.normal(mu_true, sigma_true, size=n) def full_loglik(mu, sigma2): """Normal log-likelihood for given mu and sigma^2.""" return -n/2 * np.log(2 * np.pi * sigma2) - np.sum((data - mu)**2) / (2 * sigma2) # For each mu, the optimal sigma^2 is the sample variance around mu def sigma2_hat(mu): return np.mean((data - mu)**2) mu_grid = np.linspace(3.5, 6.5, 200) profile_ll = np.array([full_loglik(mu, sigma2_hat(mu)) for mu in mu_grid]) mu_hat = data.mean() print(f"MLE mu_hat = {mu_hat:.4f} (true = {mu_true})") print(f"Profile log-lik at MLE = {full_loglik(mu_hat, sigma2_hat(mu_hat)):.4f}") Confidence intervals via Wilks' theorem ----------------------------------------- Wilks' theorem states that, under regularity conditions, the twice-difference in profile log-likelihoods converges to a Chi-squared distribution: .. math:: 2\bigl[\ell_P(\hat{\theta}) - \ell_P(\theta_0)\bigr] \;\xrightarrow{d}\; \chi^2_1, This is a powerful result: it provides a confidence interval without needing to compute standard errors or Fisher information. A :math:`(1-\alpha)` confidence region is simply the set of parameter values where the profile log-likelihood is "close enough" to its maximum: :math:`\{\theta : 2[\ell_P(\hat\theta) - \ell_P(\theta)] \le \chi^2_{1,\alpha}\}`. .. code-block:: python # Profile CI via Wilks' theorem from scipy.stats import chi2 ll_max = full_loglik(mu_hat, sigma2_hat(mu_hat)) cutoff = chi2.ppf(0.95, df=1) / 2 # Find the CI by checking which mu values are within the cutoff in_ci = mu_grid[profile_ll >= ll_max - cutoff] ci_lo, ci_hi = in_ci[0], in_ci[-1] # Compare to naive Wald CI se_wald = np.sqrt(sigma2_hat(mu_hat) / n) print(f"Profile 95% CI: [{ci_lo:.4f}, {ci_hi:.4f}]") print(f"Wald 95% CI: [{mu_hat - 1.96*se_wald:.4f}, {mu_hat + 1.96*se_wald:.4f}]") print(f"True mu = {mu_true}") .. admonition:: When to use profile likelihood Use profile likelihood when you have nuisance parameters that can be maximised out, and you need confidence intervals for a subset of parameters. It is especially valuable when the Wald approximation is poor (small samples, parameters near boundaries). .. _sec_partial: 8.2 Partial Likelihood ======================== The problem with unknown baselines ------------------------------------ You are running a clinical trial with 200 cancer patients. You want to know whether a new drug reduces mortality, adjusting for age and tumour stage. The survival times have a complex baseline hazard :math:`h_0(t)` that you cannot specify parametrically. The **partial likelihood** extracts the information about covariate effects while leaving the baseline completely unspecified. Cox proportional hazards model ------------------------------- The hazard for individual :math:`i` with covariate vector :math:`\mathbf{z}_i` is .. math:: h(t \mid \mathbf{z}_i) = h_0(t)\,\exp(\boldsymbol{\beta}^\top \mathbf{z}_i), where :math:`h_0(t)` is an unspecified baseline hazard and :math:`\boldsymbol{\beta}` is the parameter of interest. Construction of the partial likelihood ---------------------------------------- At each event time :math:`t_{(j)}`, the probability that the specific individual :math:`d_j` fails (given exactly one failure occurs) is .. math:: \frac{\exp(\boldsymbol{\beta}^\top \mathbf{z}_{d_j})} {\sum_{i \in \mathcal{R}_j}\exp(\boldsymbol{\beta}^\top \mathbf{z}_i)}, where the risk set :math:`\mathcal{R}_j` contains everyone still at risk at time :math:`t_{(j)}`. This formula is a **softmax**: the numerator is the hazard contribution of the individual who actually failed, and the denominator sums the hazard contributions of everyone who *could have* failed. The crucial trick is that :math:`h_0(t_{(j)})` appears in both numerator and denominator, so it cancels completely. The partial likelihood is the product of these conditional probabilities over all :math:`D` observed events: .. math:: L_{\text{partial}}(\boldsymbol{\beta}) = \prod_{j=1}^{D} \frac{\exp(\boldsymbol{\beta}^\top \mathbf{z}_{d_j})} {\sum_{i \in \mathcal{R}_j}\exp(\boldsymbol{\beta}^\top \mathbf{z}_i)}. Each factor asks: "among everyone at risk at this time, how plausible was it that *this particular person* was the one to fail?" A positive coefficient :math:`\beta_k` means higher values of covariate :math:`k` increase the hazard (higher failure risk). .. code-block:: python # Partial likelihood: Cox model from scratch import numpy as np from scipy.optimize import minimize np.random.seed(42) n = 200 # Simulate: drug (0/1) and age (standardized) drug = np.random.binomial(1, 0.5, size=n) age = np.random.randn(n) beta_true = np.array([-0.5, 0.3]) # drug reduces risk, age increases it # Simulate survival times from Cox model with Weibull baseline linear_pred = beta_true[0] * drug + beta_true[1] * age baseline_times = np.random.weibull(1.5, size=n) true_times = baseline_times * np.exp(-linear_pred) # Administrative censoring at 5 years + random dropout censor_times = np.minimum(5.0, np.random.exponential(8.0, size=n)) time = np.minimum(true_times, censor_times) event = (true_times <= censor_times).astype(int) print(f"Events: {event.sum()}, Censored: {n - event.sum()}") # Build covariate matrix Z = np.column_stack([drug, age]) # Partial log-likelihood def partial_loglik(beta): eta = Z @ beta ll = 0.0 for j in range(n): if event[j] == 1: risk_set = time >= time[j] ll += eta[j] - np.log(np.sum(np.exp(eta[risk_set]))) return ll # Maximize result = minimize(lambda b: -partial_loglik(b), x0=[0.0, 0.0], method='Nelder-Mead') beta_hat = result.x print(f"\nTrue beta: drug = {beta_true[0]:.3f}, age = {beta_true[1]:.3f}") print(f"Partial MLE: drug = {beta_hat[0]:.3f}, age = {beta_hat[1]:.3f}") print(f"Hazard ratio (drug): {np.exp(beta_hat[0]):.3f} " f"(<1 means drug is protective)") .. code-block:: python # Partial likelihood: numerical Hessian for SEs eps = 1e-5 H = np.zeros((2, 2)) for i in range(2): for j in range(2): pp = beta_hat.copy(); pp[i] += eps; pp[j] += eps pm = beta_hat.copy(); pm[i] += eps; pm[j] -= eps mp = beta_hat.copy(); mp[i] -= eps; mp[j] += eps mm = beta_hat.copy(); mm[i] -= eps; mm[j] -= eps H[i, j] = (partial_loglik(pp) - partial_loglik(pm) - partial_loglik(mp) + partial_loglik(mm)) / (4 * eps**2) obs_info = -H se = np.sqrt(np.diag(np.linalg.inv(obs_info))) print(f"{'Covariate':<8} {'beta_hat':>10} {'SE':>8} {'95% CI':>22} {'HR':>8}") print("-" * 56) for k, name in enumerate(["drug", "age"]): ci_lo = beta_hat[k] - 1.96 * se[k] ci_hi = beta_hat[k] + 1.96 * se[k] print(f"{name:<8} {beta_hat[k]:10.4f} {se[k]:8.4f} " f"[{ci_lo:8.4f}, {ci_hi:8.4f}] {np.exp(beta_hat[k]):8.3f}") .. admonition:: Why the partial likelihood works The baseline hazard :math:`h_0(t)` is an infinite-dimensional nuisance parameter. The partial likelihood sidesteps it entirely by conditioning on the event times, making the baseline cancel. You can learn about what *increases or decreases* risk without specifying the overall level of risk. .. _sec_marginal: 8.3 Marginal Likelihood ========================= The problem with too many parameters -------------------------------------- A psychometrician fits a random-effects model to test scores from students nested within schools. Each school has its own intercept (a nuisance parameter), but she cares about the between-school variance. With 500 schools, profiling out 500 intercepts is awkward. The **marginal likelihood** integrates out all school effects simultaneously using a prior/mixing distribution. Definition ---------- If :math:`\theta` is the parameter of interest and :math:`\lambda` is a nuisance parameter with distribution :math:`\pi(\lambda)`: .. math:: L_M(\theta) = \int L(\theta, \lambda)\,\pi(\lambda)\,d\lambda. Where the profile likelihood maximizes over the nuisance parameter, the marginal likelihood *averages* over it, weighting each value of :math:`\lambda` by a prior or mixing distribution :math:`\pi(\lambda)`. This is conceptually closer to Bayesian thinking: rather than picking the single best :math:`\lambda`, we account for all possible values. .. code-block:: python # Marginal likelihood: one-way random effects model import numpy as np from scipy import integrate from scipy.stats import norm from scipy.optimize import minimize_scalar np.random.seed(42) n_schools = 30 n_per_school = 20 mu_true = 50.0 # grand mean sigma_b_true = 5.0 # between-school SD sigma_w_true = 10.0 # within-school SD # Simulate data: y_{ij} = mu + b_i + e_{ij} school_effects = np.random.normal(0, sigma_b_true, size=n_schools) data = [] school_ids = [] for i in range(n_schools): y = mu_true + school_effects[i] + np.random.normal(0, sigma_w_true, size=n_per_school) data.append(y) school_ids.extend([i] * n_per_school) data = np.concatenate(data) school_ids = np.array(school_ids) # For fixed sigma_b, the marginal likelihood integrates out school effects # For school i with data y_i, the marginal is Normal: # y_i | mu, sigma_b, sigma_w ~ N(mu * 1, sigma_b^2 * J + sigma_w^2 * I) def marginal_loglik(sigma_b, mu=None, sigma_w=None): if mu is None: mu = data.mean() if sigma_w is None: sigma_w = sigma_w_true # known for simplicity ll = 0.0 for i in range(n_schools): y_i = data[school_ids == i] n_i = len(y_i) # Marginal variance of y_bar_i var_i = sigma_b**2 + sigma_w**2 / n_i y_bar_i = y_i.mean() ll += norm.logpdf(y_bar_i, loc=mu, scale=np.sqrt(var_i)) return ll # Maximize over sigma_b result = minimize_scalar(lambda sb: -marginal_loglik(max(sb, 0.01)), bounds=(0.01, 20), method='bounded') sigma_b_hat = result.x print(f"True sigma_b = {sigma_b_true}") print(f"Marginal MLE sigma_b = {sigma_b_hat:.4f}") print(f"Grand mean = {data.mean():.2f} (true = {mu_true})") Connection to Bayesian model evidence --------------------------------------- The fully marginal likelihood :math:`m(\mathbf{x}) = \int\!\int f(\mathbf{x} \mid \theta, \lambda)\,\pi(\theta)\,\pi(\lambda)\, d\theta\,d\lambda` is the **model evidence**, used for Bayes factors: .. math:: \text{BF}_{12} = \frac{m_1(\mathbf{x})}{m_2(\mathbf{x})}. .. code-block:: python # Bayes factor: comparing two models for the school data # Model 1: sigma_b > 0 (random effects) # Model 2: sigma_b = 0 (no school effects) import numpy as np from scipy.stats import norm # Log marginal likelihood at the MLEs ll_model1 = marginal_loglik(sigma_b_hat) ll_model2 = marginal_loglik(0.001) # effectively sigma_b = 0 log_bf = ll_model1 - ll_model2 print(f"Log Bayes factor (random effects vs. fixed): {log_bf:.2f}") print(f"Bayes factor: {np.exp(log_bf):.2f}") if log_bf > 3: print("=> Strong evidence for between-school variation") .. _sec_conditional: 8.4 Conditional Likelihood ============================ The problem with incidental parameters ---------------------------------------- A matched case-control study has :math:`K = 200` matched pairs. Each pair has its own intercept :math:`\alpha_k` (representing the shared environment of the pair), but you care about the effect :math:`\beta` of an exposure. With 200 nuisance parameters and only 400 observations, the unconditional MLE for :math:`\beta` is inconsistent (the Neyman-Scott problem). The **conditional likelihood** eliminates all :math:`\alpha_k` by conditioning on sufficient statistics. Definition ---------- .. math:: L_C(\theta) = f(\mathbf{x} \mid T = t;\; \theta), where :math:`T` is a sufficient statistic for the nuisance parameters. The idea is elegant: if :math:`T` "captures" all the information about the nuisance parameters, then conditioning on :math:`T = t` removes the nuisance parameters from the probability model. What remains depends only on :math:`\theta`. Example: matched pairs in logistic regression ------------------------------------------------ The logistic model is :math:`\text{logit}\,P(Y_{kj}=1) = \alpha_k + \beta\,x_{kj}`. Conditioning on the pair totals :math:`Y_{k1} + Y_{k0} = 1`: .. math:: L_C(\beta) = \prod_{k=1}^{K} \frac{1}{1 + \exp(-\beta\,(x_{k1} - x_{k0}))}. The :math:`\alpha_k` cancel entirely. .. code-block:: python # Conditional likelihood: matched case-control study import numpy as np from scipy.optimize import minimize_scalar np.random.seed(42) K = 200 # matched pairs beta_true = 0.8 # Simulate exposure for cases and controls x_case = np.random.normal(1.0, 1.0, size=K) # cases tend to be more exposed x_control = np.random.normal(0.0, 1.0, size=K) # Pair-specific intercepts (nuisance) --- these cancel out alpha = np.random.normal(0, 2.0, size=K) # Generate outcomes (one case and one control per pair, total = 1) # The conditional likelihood depends only on d_k = x_case - x_control d = x_case - x_control # Conditional log-likelihood: product of logistic terms def cond_loglik(beta): return np.sum(-np.log(1 + np.exp(-beta * d))) # Maximize result = minimize_scalar(lambda b: -cond_loglik(b), bounds=(-5, 5), method='bounded') beta_hat = result.x # SE from observed information eps = 1e-5 H = (cond_loglik(beta_hat + eps) - 2 * cond_loglik(beta_hat) + cond_loglik(beta_hat - eps)) / eps**2 se = 1.0 / np.sqrt(-H) print(f"True beta = {beta_true}") print(f"Conditional MLE = {beta_hat:.4f}") print(f"SE = {se:.4f}") print(f"95% CI = [{beta_hat - 1.96*se:.4f}, {beta_hat + 1.96*se:.4f}]") print(f"Odds ratio = {np.exp(beta_hat):.3f}") .. _sec_composite: 8.5 Composite Likelihood ========================== The problem with intractable joint distributions -------------------------------------------------- **Running scenario: spatial ecology.** An ecologist surveys 200 plots across a landscape, recording presence/absence of a tree species. Nearby plots are correlated (if a species thrives in one spot, it likely thrives 100m away). The full joint likelihood for 200 correlated binary variables requires a 200-dimensional integral --- intractable. The **composite likelihood** breaks this into manageable pieces: pairs of observations whose bivariate distributions are tractable. Definition ---------- Let :math:`\{A_1, \dots, A_K\}` be subsets of the data. The composite likelihood is .. math:: L_{\text{CL}}(\theta) = \prod_{k=1}^{K} f_{A_k}(\mathbf{x}_{A_k} \mid \theta). The idea is to replace the intractable full joint density with a product of tractable *marginal* or *conditional* densities over subsets of the data. Each factor :math:`f_{A_k}` is a valid probability model for a small piece of the data, and we multiply them together as if they were independent (even though they are not). This is not a true likelihood, but the resulting estimator is consistent. **Pairwise likelihood** (the most common variant) uses all pairs of observations: .. math:: L_{\text{pair}}(\theta) = \prod_{i < j} f(x_i, x_j \mid \theta). Each pair's bivariate density is easy to compute, even when the full :math:`n`-dimensional joint density is not. .. code-block:: python # Composite likelihood: spatial ecology scenario import numpy as np from scipy.optimize import minimize_scalar from scipy.stats import norm np.random.seed(42) n_plots = 200 # Spatial coordinates on a 2D landscape coords = np.column_stack([ np.random.uniform(0, 10, size=n_plots), np.random.uniform(0, 10, size=n_plots) ]) # True spatial correlation: exponential with range parameter theta theta_true = 2.0 dists = np.sqrt(((coords[:, None] - coords[None, :])**2).sum(axis=2)) Sigma = np.exp(-dists / theta_true) Sigma += 0.01 * np.eye(n_plots) # numerical stability # Simulate correlated binary data via latent Gaussian L_chol = np.linalg.cholesky(Sigma) z = L_chol @ np.random.randn(n_plots) presence = (z > 0).astype(int) print(f"Species present at {presence.sum()}/{n_plots} plots") Here is the key issue: the full joint likelihood requires computing the determinant and inverse of a 200x200 correlation matrix for binary data --- and the probit link makes this a 200-dimensional integral over a truncated multivariate normal. Instead, we use **pairwise composite likelihood**, considering only pairs of nearby plots. .. code-block:: python # Pairwise composite log-likelihood from scipy.stats import mvn def pairwise_cll(theta): """Pairwise composite log-likelihood for spatial binary data.""" rho_mat = np.exp(-dists / theta) cll = 0.0 n_pairs = 0 for i in range(n_plots): for j in range(i + 1, n_plots): if dists[i, j] > 4.0: # only nearby pairs contribute continue rho = rho_mat[i, j] # Bivariate probit probability # P(Y_i=y_i, Y_j=y_j) via bivariate normal CDF lo_i = 0.0 if presence[i] == 1 else -np.inf hi_i = np.inf if presence[i] == 1 else 0.0 lo_j = 0.0 if presence[j] == 1 else -np.inf hi_j = np.inf if presence[j] == 1 else 0.0 cov = np.array([[1.0, rho], [rho, 1.0]]) lower = np.array([lo_i, lo_j]) upper = np.array([hi_i, hi_j]) # Use scipy.stats.mvn for bivariate normal rectangle prob p, _ = mvn.mvnun(lower, upper, np.array([0.0, 0.0]), cov) cll += np.log(max(p, 1e-15)) n_pairs += 1 return cll, n_pairs # Optimize over theta def neg_cll(theta): val, _ = pairwise_cll(max(theta, 0.1)) return -val result = minimize_scalar(neg_cll, bounds=(0.3, 8.0), method='bounded') theta_hat = result.x _, n_pairs_used = pairwise_cll(theta_hat) print(f"\nTrue theta = {theta_true}") print(f"Composite MLE theta = {theta_hat:.4f}") print(f"Pairs used: {n_pairs_used}") Properties ---------- * The composite MLE is consistent but not fully efficient. * Standard errors require the **Godambe sandwich correction** (also called the "robust" or "sandwich" variance): .. math:: \text{Var}(\hat\theta_{\text{CL}}) \approx H^{-1}\,J\,H^{-1}, where :math:`H = -E[\nabla^2 \ell_{\text{CL}}]` (the sensitivity, measuring curvature) and :math:`J = \text{Var}(\nabla\ell_{\text{CL}})` (the variability of the composite score). The "sandwich" name comes from :math:`J` being sandwiched between two copies of :math:`H^{-1}`. For a true likelihood, :math:`H = J` and the sandwich reduces to the usual :math:`H^{-1}`, but for a composite likelihood the mismatch between :math:`H` and :math:`J` inflates the standard errors --- the price we pay for not using the full joint distribution. .. code-block:: python # Sandwich SE for composite likelihood (numerical) eps = 1e-4 cll_plus, _ = pairwise_cll(theta_hat + eps) cll_center, _ = pairwise_cll(theta_hat) cll_minus, _ = pairwise_cll(theta_hat - eps) # H = -second derivative (sensitivity) H_num = -(cll_plus - 2 * cll_center + cll_minus) / eps**2 # For the sandwich, we would ideally compute J from per-pair scores. # As a rough approximation, use the inverse Hessian SE: se_naive = 1.0 / np.sqrt(abs(H_num)) print(f"Composite MLE = {theta_hat:.4f}") print(f"Naive SE = {se_naive:.4f}") print(f"95% CI = [{theta_hat - 1.96*se_naive:.4f}, {theta_hat + 1.96*se_naive:.4f}]") print(f"True theta = {theta_true}") .. _sec_quasi: 8.6 Quasi-Likelihood ====================== The problem with unknown distributions ---------------------------------------- An entomologist counts insect larvae on crop plants. The counts show variance about 3 times the mean (overdispersion). She does not want to commit to a specific distribution (Poisson? Negative Binomial?), but she does know that the variance is proportional to the mean. **Quasi-likelihood** requires only a mean--variance relationship and produces valid inference even without a fully specified distribution. Definition ---------- Suppose we specify only two things about the distribution: 1. The mean: :math:`E[Y_i] = \mu_i(\boldsymbol{\beta})`. 2. The variance: :math:`\text{Var}(Y_i) = \phi\,V(\mu_i)`, where :math:`V(\mu)` is a known **variance function** and :math:`\phi` is an unknown scale (the dispersion parameter). We do *not* specify the full probability distribution. The **quasi-score** mimics what the score function would look like if we had a full likelihood: .. math:: q_i(\boldsymbol{\beta}) = \frac{y_i - \mu_i}{V(\mu_i)}\,\frac{\partial\mu_i}{\partial\boldsymbol{\beta}}. Reading this formula: :math:`(y_i - \mu_i)/V(\mu_i)` is a standardized residual (deviation from the mean, scaled by the variance function), and :math:`\partial\mu_i/\partial\boldsymbol{\beta}` propagates the residual back to the parameter space. Setting :math:`\sum q_i = 0` gives estimating equations that produce consistent estimates even without knowing the true distribution. .. code-block:: python # Quasi-likelihood: overdispersed Poisson regression import numpy as np from scipy.optimize import minimize np.random.seed(42) n = 200 # Covariates: fertiliser amount (standardized) x = np.random.randn(n) beta_true = np.array([1.5, 0.8]) # intercept and slope (log link) # True mean: mu = exp(beta0 + beta1 * x) eta_true = beta_true[0] + beta_true[1] * x mu_true = np.exp(eta_true) # Simulate overdispersed counts (NegBin as stand-in for unknown dist) phi_true = 3.0 # overdispersion factor # NegBin with mean mu and var = phi * mu r_nb = mu_true / (phi_true - 1) # NegBin r parameter p_nb = r_nb / (r_nb + mu_true) # NegBin p parameter y = np.array([np.random.negative_binomial(max(ri, 0.1), pi) for ri, pi in zip(r_nb, p_nb)]) print(f"Sample mean = {y.mean():.2f}, Sample var = {y.var():.2f}") print(f"Var/Mean ratio = {y.var()/y.mean():.2f} (true phi ~ {phi_true})") # Quasi-Poisson: V(mu) = mu, estimate beta by solving quasi-score = 0 # This is equivalent to Poisson GLM (ignoring overdispersion for point estimates) X = np.column_stack([np.ones(n), x]) def quasi_score_norm(beta): eta = X @ beta mu = np.exp(eta) # Quasi-score: sum of [(y - mu) / V(mu)] * d_mu/d_beta # With log link: d_mu/d_beta = mu * X residuals = (y - mu) / mu # (y-mu)/V(mu) with V(mu) = mu score = X.T @ (residuals * mu) # = X.T @ (y - mu) return -np.sum(score**2) # minimize negative squared norm # Solve via IRLS (equivalent to Poisson GLM fit) result = minimize(lambda b: np.sum((y - np.exp(X @ b))**2 / np.exp(X @ b)), x0=[0.0, 0.0], method='Nelder-Mead') beta_hat = result.x # Estimate dispersion from Pearson residuals mu_hat = np.exp(X @ beta_hat) pearson_resid = (y - mu_hat) / np.sqrt(mu_hat) phi_hat = np.sum(pearson_resid**2) / (n - 2) print(f"\nTrue beta: intercept = {beta_true[0]:.3f}, slope = {beta_true[1]:.3f}") print(f"Quasi MLE: intercept = {beta_hat[0]:.3f}, slope = {beta_hat[1]:.3f}") print(f"Estimated dispersion phi = {phi_hat:.2f} (true ~ {phi_true})") # Sandwich SE (accounts for overdispersion) # Naive SE from Poisson * sqrt(phi) = quasi SE H = np.zeros((2, 2)) for i in range(n): xi = X[i:i+1].T H += mu_hat[i] * (xi @ xi.T) naive_cov = np.linalg.inv(H) quasi_cov = phi_hat * naive_cov quasi_se = np.sqrt(np.diag(quasi_cov)) print(f"\nQuasi-SEs: intercept = {quasi_se[0]:.4f}, slope = {quasi_se[1]:.4f}") print(f"95% CI for slope: [{beta_hat[1]-1.96*quasi_se[1]:.4f}, " f"{beta_hat[1]+1.96*quasi_se[1]:.4f}]") .. _sec_pseudo: 8.7 Pseudo-Likelihood ======================= The problem with intractable normalizing constants ----------------------------------------------------- A neuroscientist models the joint firing patterns of 50 neurons as an Ising model on a graph. The joint probability involves a normalizing constant that sums over :math:`2^{50}` configurations --- utterly intractable. The **pseudo-likelihood** replaces the joint density with the product of full conditional densities, each of which is a simple logistic function. Definition ---------- .. math:: L_{\text{PL}}(\theta) = \prod_{i=1}^{n} f(x_i \mid x_{-i};\; \theta). For a Markov random field, each conditional depends only on the neighbours: .. math:: P(x_i = 1 \mid x_{\partial i}) = \frac{\exp\bigl(\beta\sum_{j \in \partial i} x_j\bigr)} {1 + \exp\bigl(\beta\sum_{j \in \partial i} x_j\bigr)}. .. code-block:: python # Pseudo-likelihood: Ising model on a grid import numpy as np from scipy.optimize import minimize_scalar np.random.seed(42) grid_size = 20 n = grid_size * grid_size # 400 nodes beta_true = 0.4 # Simulate Ising model via Gibbs sampling x = np.random.choice([0, 1], size=(grid_size, grid_size)) for sweep in range(500): # burn-in for i in range(grid_size): for j in range(grid_size): neighbours = [] if i > 0: neighbours.append(x[i-1, j]) if i < grid_size-1: neighbours.append(x[i+1, j]) if j > 0: neighbours.append(x[i, j-1]) if j < grid_size-1: neighbours.append(x[i, j+1]) s = sum(neighbours) prob = 1 / (1 + np.exp(-beta_true * s)) x[i, j] = int(np.random.rand() < prob) print(f"Fraction of 1s: {x.mean():.3f}") # Pseudo-log-likelihood def pseudo_loglik(beta): pll = 0.0 for i in range(grid_size): for j in range(grid_size): neighbours = [] if i > 0: neighbours.append(x[i-1, j]) if i < grid_size-1: neighbours.append(x[i+1, j]) if j > 0: neighbours.append(x[i, j-1]) if j < grid_size-1: neighbours.append(x[i, j+1]) s = sum(neighbours) eta = beta * s pll += x[i, j] * eta - np.log(1 + np.exp(eta)) return pll # Maximize result = minimize_scalar(lambda b: -pseudo_loglik(b), bounds=(0.0, 2.0), method='bounded') beta_hat = result.x # SE from curvature eps = 1e-5 H = (pseudo_loglik(beta_hat+eps) - 2*pseudo_loglik(beta_hat) + pseudo_loglik(beta_hat-eps)) / eps**2 se = 1.0 / np.sqrt(-H) print(f"True beta = {beta_true}") print(f"Pseudo-MLE = {beta_hat:.4f}") print(f"SE = {se:.4f}") print(f"95% CI = [{beta_hat - 1.96*se:.4f}, {beta_hat + 1.96*se:.4f}]") .. _sec_empirical: 8.8 Empirical Likelihood ========================== The problem with distributional assumptions ---------------------------------------------- A labour economist has income data that are heavily right-skewed. She wants a confidence interval for the population mean, but the data are far from Normal and the sample is too small for the CLT to give accurate coverage. The **empirical likelihood** constructs likelihood-ratio-type confidence intervals without assuming any parametric family, automatically adapting to skewness. Definition ---------- Given observations :math:`x_1, \dots, x_n`, the empirical likelihood for the mean :math:`\mu` is .. math:: L_{\text{EL}}(\mu) = \max_{w_i} \left\{\prod_{i=1}^{n} w_i \;\Bigg|\; \sum w_i(x_i - \mu) = 0,\; \sum w_i = 1,\; w_i \ge 0 \right\}. Owen's theorem: :math:`-2\ln R(\mu_0) \xrightarrow{d} \chi^2_1`. .. code-block:: python # Empirical likelihood: CI for the mean of skewed data import numpy as np from scipy.optimize import brentq from scipy.stats import chi2 np.random.seed(42) # Simulate right-skewed income data (log-normal) n = 60 data = np.random.lognormal(mean=3.5, sigma=0.8, size=n) print(f"Sample mean = {data.mean():.2f}") print(f"Sample median = {np.median(data):.2f}") print(f"Skewness ~ {((data - data.mean())**3).mean() / data.std()**3:.2f}") # Compute -2 log R(mu) for a given mu def neg2logR(mu): """Compute -2 log empirical likelihood ratio at mu.""" g = data - mu # Find Lagrange multiplier lambda via root-finding def equation(lam): return np.sum(g / (1 + lam * g)) # Bracket: lam must satisfy 1 + lam * g_i > 0 for all i lam_lo = -1 / (max(g) + 1e-10) + 1e-10 if max(g) > 0 else -1e6 lam_hi = -1 / (min(g) - 1e-10) - 1e-10 if min(g) < 0 else 1e6 lam_lo = max(lam_lo, -1e6) lam_hi = min(lam_hi, 1e6) try: lam = brentq(equation, lam_lo, lam_hi) return 2 * np.sum(np.log(1 + lam * g)) except: return np.inf # Find EL confidence interval: {mu : -2 log R(mu) <= chi2_{0.95}} cutoff = chi2.ppf(0.95, df=1) x_bar = data.mean() # Search for CI boundaries mu_grid = np.linspace(data.mean() * 0.7, data.mean() * 1.3, 500) el_values = np.array([neg2logR(mu) for mu in mu_grid]) in_ci = mu_grid[el_values <= cutoff] if len(in_ci) > 0: el_ci_lo, el_ci_hi = in_ci[0], in_ci[-1] else: el_ci_lo, el_ci_hi = np.nan, np.nan # Compare to Wald CI se = data.std() / np.sqrt(n) wald_ci_lo = x_bar - 1.96 * se wald_ci_hi = x_bar + 1.96 * se print(f"\nEmpirical likelihood 95% CI: [{el_ci_lo:.2f}, {el_ci_hi:.2f}]") print(f"Wald 95% CI: [{wald_ci_lo:.2f}, {wald_ci_hi:.2f}]") print(f"EL CI is asymmetric: left width = {x_bar - el_ci_lo:.2f}, " f"right width = {el_ci_hi - x_bar:.2f}") print(f"(Asymmetry correctly reflects the right skew of the data)") .. _sec_censored: 8.9 Censored and Truncated Likelihoods ======================================== The problem with incomplete observations ------------------------------------------ **Running scenario: clinical trial with dropouts.** A cancer trial follows 200 patients for up to 5 years. Some patients die during the trial (observed events). Others are lost to follow-up or are still alive when the study ends (censored). You cannot simply throw away censored observations --- that would bias your estimates. And you cannot treat censored times as actual death times --- that would also be wrong. The likelihood must reflect *exactly what you know* about each patient. The core principle: for each observation, contribute to the likelihood exactly the information you have. * Event observed: contribute the density :math:`f(y_i)`. * Right censored (only know :math:`T_i > y_i`): contribute :math:`S(y_i)`. * Left censored (only know :math:`T_i \le y_i`): contribute :math:`F(y_i)`. * Interval censored (:math:`T_i \in (a_i, b_i]`): contribute :math:`F(b_i) - F(a_i)`. Right-censored likelihood -------------------------- .. math:: L(\theta) = \prod_{i=1}^{n} f(y_i;\;\theta)^{\delta_i}\,S(y_i;\;\theta)^{1-\delta_i}. where :math:`\delta_i = 1` if the event was observed, 0 if censored. .. code-block:: python # Censored likelihood: clinical trial with Exponential survival import numpy as np np.random.seed(42) n = 200 lam_true = 0.3 # true mortality rate # Simulate true event times ~ Exp(lambda) true_times = np.random.exponential(1 / lam_true, size=n) # Simulate censoring: administrative at 5 years + random dropout admin_censor = 5.0 * np.ones(n) dropout_censor = np.random.exponential(8.0, size=n) censor_times = np.minimum(admin_censor, dropout_censor) # What we actually observe y = np.minimum(true_times, censor_times) delta = (true_times <= censor_times).astype(int) d = delta.sum() print(f"Events: {d}, Censored: {n - d}") print(f"Total person-years: {y.sum():.1f}") print(f"Median observed time: {np.median(y):.2f} years") Now the standard Poisson-type MLE :math:`\hat\lambda = n / \sum y_i` would be *wrong* because it treats censored observations as events. The censored MLE correctly uses only the number of *events* in the numerator. .. code-block:: python # Censored Exponential log-likelihood # ell(lambda) = d * log(lambda) - lambda * sum(y_i) lam_grid = np.linspace(0.05, 0.8, 300) loglik = d * np.log(lam_grid) - lam_grid * y.sum() lam_peak = lam_grid[np.argmax(loglik)] lam_hat = d / y.sum() # exact MLE print(f"Grid search MLE: {lam_peak:.4f}") print(f"Exact MLE: {lam_hat:.4f}") print(f"Naive (wrong): {n / y.sum():.4f}") print(f"True lambda: {lam_true}") .. code-block:: python # Censored Exponential: SE and CI # Fisher info for censored exponential: approx d / lambda^2 se = lam_hat / np.sqrt(d) ci_lo = lam_hat - 1.96 * se ci_hi = lam_hat + 1.96 * se # Verify score = 0 at MLE score = d / lam_hat - y.sum() print(f"MLE = {lam_hat:.4f}, SE = {se:.4f}, " f"95% CI = [{ci_lo:.4f}, {ci_hi:.4f}], true = {lam_true}") print(f"Score at MLE = {score:.2e}") print(f"Median survival (estimated): {np.log(2)/lam_hat:.2f} years") print(f"Median survival (true): {np.log(2)/lam_true:.2f} years") .. code-block:: python # Why censoring matters: compare correct vs. naive estimates import numpy as np # Correct: events / total person-time lam_correct = d / y.sum() # Wrong approach 1: pretend censored = events lam_naive1 = n / y.sum() # Wrong approach 2: throw away censored observations lam_naive2 = d / y[delta == 1].sum() print(f"{'Method':<35} {'Estimate':>10} {'Error':>8}") print("-" * 56) print(f"{'Correct (censored likelihood)':<35} {lam_correct:10.4f} {abs(lam_correct - lam_true):8.4f}") print(f"{'Wrong: all obs are events':<35} {lam_naive1:10.4f} {abs(lam_naive1 - lam_true):8.4f}") print(f"{'Wrong: drop censored obs':<35} {lam_naive2:10.4f} {abs(lam_naive2 - lam_true):8.4f}") print(f"{'True lambda':<35} {lam_true:10.4f}") Truncation ---------- **Left truncation** (delayed entry): individual :math:`i` is observed only if :math:`T_i > t_{\text{entry},i}`. The contribution is :math:`f(y_i) / S(t_{\text{entry},i})`. .. code-block:: python # Left truncation: delayed entry in a survival study import numpy as np from scipy.optimize import minimize_scalar np.random.seed(42) n_potential = 500 lam_true = 0.2 # Simulate event times true_times = np.random.exponential(1 / lam_true, size=n_potential) # Left truncation: only observe individuals who survive past entry time entry_times = np.random.uniform(0, 2.0, size=n_potential) observed = true_times > entry_times y = true_times[observed] t_entry = entry_times[observed] n_obs = observed.sum() print(f"Potential subjects: {n_potential}") print(f"Observed (survived past entry): {n_obs}") # Truncated log-likelihood: sum of [log f(y_i) - log S(t_entry_i)] # For Exp(lambda): log f(y) = log(lambda) - lambda*y # log S(t) = -lambda * t def trunc_loglik(lam): return np.sum(np.log(lam) - lam * y - (-lam * t_entry)) result = minimize_scalar(lambda l: -trunc_loglik(l), bounds=(0.01, 2.0), method='bounded') lam_hat = result.x # Naive (ignoring truncation) lam_naive = n_obs / y.sum() print(f"\nTrue lambda = {lam_true}") print(f"Truncation-adjusted MLE = {lam_hat:.4f}") print(f"Naive MLE (ignoring truncation) = {lam_naive:.4f}") .. _sec_penalized: 8.10 Penalized Likelihood =========================== The problem with too many predictors -------------------------------------- **Running scenario: gene expression.** A geneticist measures the expression of 20,000 genes in 100 cancer patients and wants to predict tumour growth rate. With :math:`p = 20{,}000` predictors and :math:`n = 100` observations, the ordinary least-squares estimator does not exist --- :math:`\mathbf{X}^\top\mathbf{X}` is singular. Even if it did exist, it would overfit catastrophically. **Penalized likelihood** adds a penalty that shrinks coefficients toward zero, trading a little bias for a dramatic reduction in variance. General form ------------- .. math:: \ell_{\text{pen}}(\theta) = \ell(\theta) - \lambda\,P(\theta), where :math:`P(\theta) \ge 0` is a penalty and :math:`\lambda \ge 0` controls the strength of regularisation. .. code-block:: python # The problem: OLS fails when p >> n import numpy as np np.random.seed(42) n_patients = 100 n_genes = 500 # using 500 for speed; real genomics has 20000+ n_causal = 10 X = np.random.randn(n_patients, n_genes) true_beta = np.zeros(n_genes) causal_idx = np.random.choice(n_genes, n_causal, replace=False) true_beta[causal_idx] = np.random.randn(n_causal) * 2 y = X @ true_beta + np.random.randn(n_patients) * 0.5 # OLS: X'X is singular XtX = X.T @ X print(f"Dimensions: n = {n_patients}, p = {n_genes}") print(f"Rank of X'X = {np.linalg.matrix_rank(XtX)} (need {n_genes} for OLS)") print(f"=> OLS is impossible. We need penalization.") Ridge regression (L2 penalty) ------------------------------ .. math:: \hat{\boldsymbol{\beta}}_{\text{ridge}} = (\mathbf{X}^\top\mathbf{X} + 2\lambda\sigma^2\mathbf{I})^{-1} \mathbf{X}^\top\mathbf{y}. Ridge always exists, even when :math:`\mathbf{X}^\top\mathbf{X}` is singular. .. code-block:: python # Ridge regression: closed-form solution import numpy as np lam_ridge = 1.0 I = np.eye(n_genes) beta_ridge = np.linalg.solve(X.T @ X + 2 * lam_ridge * I, X.T @ y) # How well does it recover the true signal? mse_ridge = np.mean((beta_ridge - true_beta)**2) nonzero_ridge = np.sum(np.abs(beta_ridge) > 0.01) print(f"Ridge MSE(beta): {mse_ridge:.6f}") print(f"Ridge nonzero coefficients: {nonzero_ridge} (true: {n_causal})") print(f"Ridge does NOT produce sparsity --- all {n_genes} coefficients are nonzero") LASSO (L1 penalty) -------------------- .. math:: P(\boldsymbol{\beta}) = \|\boldsymbol{\beta}\|_1 = \sum_j |\beta_j|. The L1 penalty sets some coefficients to **exactly zero**, performing simultaneous estimation and variable selection. .. code-block:: python # LASSO: sparse solution via coordinate descent (scikit-learn) import numpy as np from sklearn.linear_model import LassoCV np.random.seed(42) lasso = LassoCV(cv=5, max_iter=10000) lasso.fit(X, y) selected = np.where(lasso.coef_ != 0)[0] recovered = set(selected) & set(causal_idx) false_positives = set(selected) - set(causal_idx) print(f"Best lambda: {lasso.alpha_:.4f}") print(f"True causal genes: {n_causal}") print(f"Selected by LASSO: {len(selected)}") print(f"Correctly identified: {len(recovered)}") print(f"False positives: {len(false_positives)}") print(f"R^2 (training): {lasso.score(X, y):.4f}") .. code-block:: python # Compare Ridge vs LASSO: which coefficients survive? import numpy as np print(f"{'Gene':>6} {'True':>8} {'Ridge':>8} {'LASSO':>8} {'Causal?':>8}") print("-" * 44) # Show causal genes for idx in sorted(causal_idx): marker = "***" print(f"{idx:6d} {true_beta[idx]:8.3f} {beta_ridge[idx]:8.3f} " f"{lasso.coef_[idx]:8.3f} {marker:>8}") # Show a few non-causal genes for comparison non_causal = [i for i in range(min(5, n_genes)) if i not in causal_idx] for idx in non_causal[:3]: print(f"{idx:6d} {true_beta[idx]:8.3f} {beta_ridge[idx]:8.3f} " f"{lasso.coef_[idx]:8.3f} {'':>8}") Elastic net ----------- .. math:: P(\boldsymbol{\beta}) = \alpha\|\boldsymbol{\beta}\|_1 + \frac{1-\alpha}{2}\|\boldsymbol{\beta}\|_2^2. The elastic net combines L1 sparsity with L2 grouping: when covariates are correlated, it tends to select them together. Choosing the tuning parameter ------------------------------- :math:`\lambda` is typically chosen by **cross-validation**: divide the data into :math:`K` folds, fit on :math:`K-1`, evaluate on the held-out fold, and select the :math:`\lambda` that minimises prediction error. .. code-block:: python # Cross-validation path: how many genes are selected at each lambda? import numpy as np from sklearn.linear_model import lasso_path alphas, coefs, _ = lasso_path(X, y, n_alphas=50) print(f"{'lambda':>10} {'Nonzero':>10} {'MSE':>10}") print("-" * 34) for i in range(0, len(alphas), 10): n_nonzero = np.sum(coefs[:, i] != 0) pred = X @ coefs[:, i] mse = np.mean((y - pred)**2) print(f"{alphas[i]:10.4f} {n_nonzero:10d} {mse:10.4f}") Connection to Bayesian priors ------------------------------- .. list-table:: Penalties and their Bayesian priors :header-rows: 1 :widths: 25 25 30 * - Penalty - Prior - Effect * - L2 (Ridge) - Gaussian - Shrinkage, no sparsity * - L1 (LASSO) - Laplace - Shrinkage + sparsity * - Elastic net - Gaussian + Laplace - Shrinkage + grouped sparsity .. _sec_specialized_summary: 8.11 Summary ============= .. list-table:: When to use each specialised likelihood :header-rows: 1 :widths: 25 50 * - Method - Use when... * - Profile likelihood - Nuisance parameters can be maximised out; you need confidence intervals for a subset of parameters. * - Partial likelihood - The baseline hazard is unspecified (Cox model); you care about regression coefficients, not the baseline. * - Marginal likelihood - You want to integrate out nuisance parameters; Bayesian model comparison via Bayes factors. * - Conditional likelihood - A sufficient statistic for the nuisance exists; matched designs. * - Composite likelihood - The full joint density is intractable; spatial or high-dimensional data where marginals/pairs are manageable. * - Quasi-likelihood - You specify only a mean--variance relationship, not a full distribution; overdispersed GLM data. * - Pseudo-likelihood - Markov random fields or network models with intractable normalising constants. * - Empirical likelihood - You want nonparametric, likelihood-ratio-based inference without specifying a distributional family. * - Censored / truncated - Observations are incompletely observed (survival data, detection limits). * - Penalized likelihood - Over-parametrised models; you need regularisation for prediction or variable selection. .. code-block:: python # Decision tree: which likelihood variant do you need? import numpy as np def diagnose_likelihood(n_obs, n_params, has_censoring, has_spatial_corr, has_nuisance, dist_known): """Print which specialized likelihood to consider.""" print("Likelihood diagnosis:") print(f" n = {n_obs}, p = {n_params}") recommendations = [] if has_censoring: recommendations.append("Censored/truncated likelihood (Sec 8.9)") if has_spatial_corr: recommendations.append("Composite or pseudo-likelihood (Sec 8.5, 8.7)") if has_nuisance: recommendations.append("Profile, marginal, or conditional likelihood (Sec 8.1, 8.3, 8.4)") if n_params > n_obs: recommendations.append("Penalized likelihood (Sec 8.10)") if not dist_known: recommendations.append("Quasi-likelihood or empirical likelihood (Sec 8.6, 8.8)") if not recommendations: recommendations.append("Standard likelihood should work fine!") for r in recommendations: print(f" => {r}") # Example: clinical trial print("--- Clinical trial ---") diagnose_likelihood(200, 3, has_censoring=True, has_spatial_corr=False, has_nuisance=False, dist_known=True) print("\n--- Gene expression ---") diagnose_likelihood(100, 20000, has_censoring=False, has_spatial_corr=False, has_nuisance=False, dist_known=True) print("\n--- Spatial ecology ---") diagnose_likelihood(500, 3, has_censoring=False, has_spatial_corr=True, has_nuisance=False, dist_known=True) .. admonition:: Choosing the right likelihood variant When facing a new problem, ask yourself these questions in order: 1. **Are all observations fully observed?** If not, use censored/truncated likelihoods (Section 8.9). 2. **Is the joint distribution tractable?** If not, consider composite, pseudo-, or partial likelihood (Sections 8.2, 8.5, 8.7). 3. **Are there nuisance parameters?** If so, choose between profile (maximise out), marginal (integrate out), or conditional (condition out) likelihoods (Sections 8.1, 8.3, 8.4). 4. **Is the model over-parametrised?** If so, add a penalty (Section 8.10). 5. **Do you want to avoid distributional assumptions?** Consider empirical or quasi-likelihood (Sections 8.6, 8.8).