.. _ch7_multivariate: ========================================== Chapter 7 --- Multivariate Likelihoods ========================================== Moving from scalar to vector- and matrix-valued random variables introduces new mathematical machinery---matrix calculus, determinants, and traces---but the core ideas of likelihood, score, information, and maximum-likelihood estimation carry over intact. This chapter treats the most important multivariate distributions and the calculus tools required for their derivations. If you have made it through the univariate distributions in :ref:`ch5_discrete` and :ref:`ch6_continuous`, the good news is that the *logic* is identical here. The only difference is notation: scalars become vectors, division becomes matrix inversion, and squares become quadratic forms. Let's build up the tools and then put them to work. .. contents:: Topics in this chapter :local: :depth: 1 :class: this-will-duplicate-information-and-it-is-still-useful-here .. _sec_matrix_calculus: 7.1 Matrix Calculus Essentials =============================== Before tackling multivariate likelihoods we collect the key matrix-calculus identities that appear repeatedly in the derivations below. The reader who is already comfortable with these results can skip ahead to :ref:`sec_mvn`. Think of this section as your reference card. You will find yourself coming back to these identities every time you differentiate a multivariate log-likelihood. Notation -------- * :math:`\mathbf{x}` --- a column vector in :math:`\mathbb{R}^p`. * :math:`\mathbf{A}` --- a :math:`p \times p` matrix. * :math:`\text{tr}(\mathbf{A})` --- the trace (sum of diagonal entries). * :math:`|\mathbf{A}|` --- the determinant. * :math:`\mathbf{A}^{-1}` --- the matrix inverse (when it exists). * :math:`\mathbf{A}^\top` --- the transpose. Key identities --------------- 1. **Derivative of a scalar quadratic form.** For a symmetric, positive definite matrix :math:`\boldsymbol{\Sigma}`: .. math:: \frac{\partial}{\partial\boldsymbol{\mu}} (\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) = -2\,\boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu}). This is the multivariate analogue of :math:`d(x - \mu)^2/d\mu = -2(x - \mu)`. The :math:`\boldsymbol{\Sigma}^{-1}` plays the role of :math:`1/\sigma^2`, rescaling the residual vector to account for different variances and correlations along different directions. This identity is used every time we differentiate a MVN log-likelihood with respect to the mean. 2. **Derivative of log-determinant.** For a positive definite matrix :math:`\boldsymbol{\Sigma}`: .. math:: \frac{\partial}{\partial\boldsymbol{\Sigma}} \ln|\boldsymbol{\Sigma}| = \boldsymbol{\Sigma}^{-1}. Here the derivative is taken in the sense that :math:`[\partial\ln|\boldsymbol{\Sigma}|/\partial\boldsymbol{\Sigma}]_{jk}` is the derivative with respect to :math:`\Sigma_{jk}`. .. admonition:: Intuition This identity is the multivariate version of :math:`d\ln\sigma^2/d\sigma^2 = 1/\sigma^2`. The inverse plays the role of the reciprocal. 3. **Derivative of inverse times vector.** .. math:: \frac{\partial}{\partial\boldsymbol{\Sigma}} \text{tr}\!\left(\boldsymbol{\Sigma}^{-1}\mathbf{A}\right) = -\boldsymbol{\Sigma}^{-1}\mathbf{A}\,\boldsymbol{\Sigma}^{-1}. 4. **Derivative of trace of product.** .. math:: \frac{\partial}{\partial\mathbf{A}}\text{tr}(\mathbf{A}\mathbf{B}) = \mathbf{B}^\top. 5. **Quadratic form as trace.** For vectors :math:`\mathbf{a}` and :math:`\mathbf{b}`: .. math:: \mathbf{a}^\top \mathbf{B} \mathbf{a} = \text{tr}(\mathbf{B}\,\mathbf{a}\mathbf{a}^\top). This identity is used constantly to convert sums of quadratic forms into a single trace expression. It works because the left-hand side is a scalar, and a scalar equals its own trace; then we use the cyclic property of the trace to rearrange the terms. 6. **Sum of outer products.** Given observations :math:`\mathbf{x}_1, \dots, \mathbf{x}_n` with sample mean :math:`\bar{\mathbf{x}}`: .. math:: \sum_{i=1}^{n}(\mathbf{x}_i - \boldsymbol{\mu}) (\mathbf{x}_i - \boldsymbol{\mu})^\top = \mathbf{S}_n + n(\bar{\mathbf{x}} - \boldsymbol{\mu}) (\bar{\mathbf{x}} - \boldsymbol{\mu})^\top, where :math:`\mathbf{S}_n = \sum_{i=1}^{n} (\mathbf{x}_i - \bar{\mathbf{x}})(\mathbf{x}_i - \bar{\mathbf{x}})^\top` is the scatter matrix. .. _sec_mvn: 7.2 Multivariate Normal Distribution ====================================== Motivation ---------- The Multivariate Normal (MVN) is the cornerstone of multivariate statistics. It arises from the multivariate Central Limit Theorem, serves as the default model for continuous vector data, and underpins linear regression, principal component analysis, factor analysis, and Gaussian processes. You can think of the MVN as the natural extension of the univariate Normal to higher dimensions. The mean becomes a vector, the variance becomes a covariance matrix, and the bell curve becomes a bell-shaped ellipsoid. PDF --- Let :math:`\mathbf{x} \in \mathbb{R}^p` with mean :math:`\boldsymbol{\mu} \in \mathbb{R}^p` and :math:`p \times p` positive definite covariance matrix :math:`\boldsymbol{\Sigma}`: .. math:: f(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = (2\pi)^{-p/2}\,|\boldsymbol{\Sigma}|^{-1/2} \exp\!\left( -\tfrac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right). The exponent :math:`(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})` is the squared **Mahalanobis distance** from :math:`\mathbf{x}` to the mean. It generalises :math:`(x - \mu)^2/\sigma^2` by accounting for correlations between components. **The Mahalanobis distance in practice.** A hospital measures three blood markers on each patient. Raw Euclidean distance from the population mean treats all markers equally and ignores correlations. The Mahalanobis distance rescales by the covariance, so a patient who is 2 standard deviations away on a *correlated* pair of markers is properly flagged as unusual. Under a true MVN, the squared Mahalanobis distance follows a :math:`\chi^2_p` distribution, giving a principled threshold for outlier detection. .. code-block:: python # Mahalanobis distance: outlier detection under the MVN import numpy as np from scipy import stats np.random.seed(42) p = 3 n = 500 mu = np.array([5.0, 3.0, 8.0]) Sigma = np.array([[2.0, 0.8, 0.3], [0.8, 1.5, 0.1], [0.3, 0.1, 1.0]]) # Generate "healthy" patients from MVN X = np.random.multivariate_normal(mu, Sigma, size=n) # Plant 5 outlier patients outliers = np.array([[12.0, 7.0, 4.0], [1.0, -2.0, 14.0], [10.0, 8.0, 11.0], [0.0, 0.0, 0.0], [9.0, 6.0, 12.0]]) X_all = np.vstack([X, outliers]) # Compute Mahalanobis distances mu_hat = X.mean(axis=0) Sigma_hat = np.cov(X, rowvar=False) Sigma_inv = np.linalg.inv(Sigma_hat) diff = X_all - mu_hat mahal_sq = np.sum(diff @ Sigma_inv * diff, axis=1) # vectorized quadratic form # Under MVN, mahal^2 ~ chi2(p) -> threshold at 99th percentile threshold = stats.chi2.ppf(0.99, df=p) n_flagged_normal = np.sum(mahal_sq[:n] > threshold) n_flagged_outlier = np.sum(mahal_sq[n:] > threshold) print(f"Chi-squared threshold (p={p}, 99%): {threshold:.2f}") print(f"Normal patients flagged: {n_flagged_normal}/{n} " f"(expect ~{n*0.01:.0f})") print(f"Planted outliers flagged: {n_flagged_outlier}/{len(outliers)}") print(f"\nMahalanobis^2 of outliers: {np.round(mahal_sq[n:], 1)}") print(f"Euclidean distances: " f"{np.round(np.sqrt(np.sum(diff[n:]**2, axis=1)), 1)}") print("Note: Mahalanobis accounts for correlations; Euclidean does not.") Likelihood ---------- For :math:`n` i.i.d. observations :math:`\mathbf{x}_1, \dots, \mathbf{x}_n`: .. math:: L(\boldsymbol{\mu}, \boldsymbol{\Sigma}) = (2\pi)^{-np/2}\,|\boldsymbol{\Sigma}|^{-n/2} \exp\!\left( -\tfrac{1}{2}\sum_{i=1}^{n} (\mathbf{x}_i - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) \right). Log-likelihood -------------- Using the trace identity :math:`\mathbf{a}^\top\mathbf{B}\mathbf{a} = \text{tr}(\mathbf{B}\mathbf{a}\mathbf{a}^\top)` to collect the sum: .. math:: \ell(\boldsymbol{\mu}, \boldsymbol{\Sigma}) = -\frac{np}{2}\ln(2\pi) - \frac{n}{2}\ln|\boldsymbol{\Sigma}| - \frac{1}{2}\text{tr}\!\left( \boldsymbol{\Sigma}^{-1}\sum_{i=1}^{n} (\mathbf{x}_i - \boldsymbol{\mu}) (\mathbf{x}_i - \boldsymbol{\mu})^\top \right). Define the matrix .. math:: \mathbf{W}(\boldsymbol{\mu}) = \sum_{i=1}^{n}(\mathbf{x}_i - \boldsymbol{\mu}) (\mathbf{x}_i - \boldsymbol{\mu})^\top, so that .. math:: \ell = -\frac{np}{2}\ln(2\pi) - \frac{n}{2}\ln|\boldsymbol{\Sigma}| - \frac{1}{2}\text{tr}\!\left(\boldsymbol{\Sigma}^{-1}\mathbf{W}\right). The entire dataset is summarised by two sufficient statistics: the sample mean :math:`\bar{\mathbf{x}}` and the scatter matrix :math:`\mathbf{W}`. No matter how many observations you have, these are all you need. Score for :math:`\boldsymbol{\mu}` ------------------------------------ Using identity 1 from :ref:`sec_matrix_calculus`: .. math:: \frac{\partial\ell}{\partial\boldsymbol{\mu}} = \boldsymbol{\Sigma}^{-1}\sum_{i=1}^{n}(\mathbf{x}_i - \boldsymbol{\mu}). Setting to zero (and noting :math:`\boldsymbol{\Sigma}^{-1}` is invertible): .. math:: \sum_{i=1}^{n}(\mathbf{x}_i - \boldsymbol{\mu}) = \mathbf{0} \;\Longrightarrow\; \hat{\boldsymbol{\mu}} = \bar{\mathbf{x}} = \frac{1}{n}\sum_{i=1}^{n}\mathbf{x}_i. Just as in the univariate case, the MLE for the mean is the sample mean---now a vector. Score for :math:`\boldsymbol{\Sigma}` --------------------------------------- Using identities 2 and 3: .. math:: \frac{\partial\ell}{\partial\boldsymbol{\Sigma}} = -\frac{n}{2}\boldsymbol{\Sigma}^{-1} + \frac{1}{2}\boldsymbol{\Sigma}^{-1}\mathbf{W}\,\boldsymbol{\Sigma}^{-1}. Setting to zero and multiplying on the right by :math:`\boldsymbol{\Sigma}`: .. math:: \frac{n}{2}\boldsymbol{\Sigma}^{-1} = \frac{1}{2}\boldsymbol{\Sigma}^{-1}\mathbf{W}\,\boldsymbol{\Sigma}^{-1} \;\Longrightarrow\; n\mathbf{I} = \mathbf{W}\,\boldsymbol{\Sigma}^{-1} \;\Longrightarrow\; \hat{\boldsymbol{\Sigma}} = \frac{1}{n}\mathbf{W}(\hat{\boldsymbol{\mu}}). Evaluating :math:`\mathbf{W}` at :math:`\hat{\boldsymbol{\mu}} = \bar{\mathbf{x}}`: .. math:: \hat{\boldsymbol{\Sigma}} = \frac{1}{n}\sum_{i=1}^{n} (\mathbf{x}_i - \bar{\mathbf{x}}) (\mathbf{x}_i - \bar{\mathbf{x}})^\top = \frac{1}{n}\mathbf{S}_n. As in the univariate case the MLE divides by :math:`n`, not :math:`n-1`. **Computing the MLE and verifying the score equations.** The score equations tell us that at the MLE, the gradient of the log-likelihood must be zero. This is the multivariate analogue of :math:`\partial\ell/\partial\mu = 0` and :math:`\partial\ell/\partial\sigma = 0` from the univariate Normal. Let us generate MVN data, compute the MLE, and verify that the score is indeed zero at the estimates. .. code-block:: python # MVN MLE: estimate mean and covariance, verify score = 0 import numpy as np np.random.seed(42) p = 3 n = 200 mu_true = np.array([1.0, -2.0, 3.0]) Sigma_true = np.array([[2.0, 0.5, 0.3], [0.5, 1.5, -0.2], [0.3, -0.2, 1.0]]) data = np.random.multivariate_normal(mu_true, Sigma_true, size=n) # MLE for mean mu_hat = data.mean(axis=0) # MLE for covariance (divides by n, not n-1) centered = data - mu_hat Sigma_hat = centered.T @ centered / n print("True mean:", mu_true) print("MLE mean: ", np.round(mu_hat, 4)) print("\nTrue Sigma:\n", Sigma_true) print("\nMLE Sigma:\n", np.round(Sigma_hat, 4)) # --- Verify score = 0 at the MLE --- # Score for mu: Sigma_inv @ sum(x_i - mu) Sigma_inv = np.linalg.inv(Sigma_hat) score_mu = Sigma_inv @ np.sum(data - mu_hat, axis=0) print(f"\nScore for mu at MLE: {np.round(score_mu, 12)}") print(f" (should be zero vector)") # Score for Sigma: -n/2 * Sigma_inv + 1/2 * Sigma_inv @ W @ Sigma_inv W = centered.T @ centered # = n * Sigma_hat score_Sigma = -n/2 * Sigma_inv + 0.5 * Sigma_inv @ W @ Sigma_inv print(f"\nScore for Sigma at MLE (Frobenius norm): " f"{np.linalg.norm(score_Sigma):.2e}") print(f" (should be ~0)") .. admonition:: Real-World Example **Portfolio returns.** A financial analyst models the daily returns of 3 stocks as a Multivariate Normal to estimate expected returns and the covariance structure (which determines portfolio risk). .. code-block:: text Scenario: 252 trading days of daily returns for 3 stocks Data: n x 3 matrix of log-returns Goal: estimate mean return vector and covariance matrix Approach: MVN(mu, Sigma) -> MLE gives sample mean and sample covariance Use: portfolio optimisation, Value-at-Risk .. code-block:: python # Portfolio returns: MVN MLE for mean and covariance import numpy as np np.random.seed(42) n_days = 252 # one trading year p = 3 # three stocks # Simulate realistic daily log-returns mu_true = np.array([0.0005, 0.0003, 0.0008]) # daily expected returns Sigma_true = np.array([ [0.0004, 0.0001, 0.00015], [0.0001, 0.0003, 0.00005], [0.00015, 0.00005, 0.0005] ]) returns = np.random.multivariate_normal(mu_true, Sigma_true, size=n_days) mu_hat = returns.mean(axis=0) Sigma_hat = (returns - mu_hat).T @ (returns - mu_hat) / n_days # Annualised estimates mu_annual = mu_hat * 252 vol_annual = np.sqrt(np.diag(Sigma_hat) * 252) corr = Sigma_hat / np.sqrt(np.outer(np.diag(Sigma_hat), np.diag(Sigma_hat))) print("Annualised expected returns:", np.round(mu_annual * 100, 2), "%") print("Annualised volatilities: ", np.round(vol_annual * 100, 2), "%") print("\nCorrelation matrix:\n", np.round(corr, 3)) Fisher information matrix -------------------------- The Fisher information for a single observation is a matrix indexed by the :math:`p` entries of :math:`\boldsymbol{\mu}` and the :math:`p(p+1)/2` unique entries of :math:`\boldsymbol{\Sigma}`. In block form: .. math:: \mathcal{I}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) = \begin{pmatrix} \boldsymbol{\Sigma}^{-1} & \mathbf{0} \\ \mathbf{0} & \tfrac{1}{2} (\boldsymbol{\Sigma}^{-1} \otimes \boldsymbol{\Sigma}^{-1}) \end{pmatrix}, where :math:`\otimes` denotes the Kronecker product (with appropriate duplication-matrix adjustments for the symmetric parametrisation). The key point is that the mean and covariance blocks are **orthogonal** in the information sense, meaning :math:`\bar{\mathbf{x}}` and :math:`\hat{\boldsymbol{\Sigma}}` are asymptotically independent. .. admonition:: Why does this matter? The block-diagonal structure means that your estimate of the mean vector is equally good whether or not you know the covariance, and vice versa. This orthogonality is a special property of the MVN that does *not* hold for most other multivariate distributions. **Verifying the block-diagonal structure.** The Fisher information matrix of the MVN is block-diagonal: the cross-information between :math:`\boldsymbol{\mu}` and :math:`\boldsymbol{\Sigma}` is zero. We can verify this numerically by computing the empirical Fisher information from simulated score vectors and checking that the off-diagonal block is negligible. .. code-block:: python # MVN Fisher information: verify block-diagonal structure import numpy as np np.random.seed(42) p = 2 # keep small for clear display n_sim = 50_000 mu = np.array([1.0, -1.0]) Sigma = np.array([[2.0, 0.5], [0.5, 1.0]]) Sigma_inv = np.linalg.inv(Sigma) # Compute score vectors for many single observations scores = [] for _ in range(n_sim): x = np.random.multivariate_normal(mu, Sigma) # Score for mu: Sigma_inv @ (x - mu), length p s_mu = Sigma_inv @ (x - mu) # Score for vech(Sigma): use the unique upper-triangle entries # d ell / d Sigma = -1/2 Sigma_inv + 1/2 Sigma_inv (x-mu)(x-mu)^T Sigma_inv outer = np.outer(x - mu, x - mu) dL_dSigma = -0.5 * Sigma_inv + 0.5 * Sigma_inv @ outer @ Sigma_inv # Extract upper triangle (unique entries for symmetric matrix) idx = np.triu_indices(p) s_sigma = dL_dSigma[idx] scores.append(np.concatenate([s_mu, s_sigma])) scores = np.array(scores) # Empirical Fisher = Cov(score) = E[score @ score^T] (since E[score]=0) empirical_fisher = scores.T @ scores / n_sim print("Empirical Fisher information matrix:") print(np.round(empirical_fisher, 3)) print(f"\nDimensions: {p} (mu) + {p*(p+1)//2} (vech Sigma) " f"= {empirical_fisher.shape[0]}") print(f"\nOff-diagonal block (mu vs Sigma), should be ~0:") print(np.round(empirical_fisher[:p, p:], 4)) print("\nThis confirms the block-diagonal structure of the Fisher information.") .. _sec_wishart: 7.3 Wishart Distribution ========================== Motivation ---------- The Wishart distribution is the multivariate generalisation of the Chi-squared. If :math:`\mathbf{x}_1, \dots, \mathbf{x}_n` are i.i.d. :math:`\mathcal{N}_p(\mathbf{0}, \boldsymbol{\Sigma})`, then the scatter matrix :math:`\mathbf{S} = \sum_{i=1}^{n}\mathbf{x}_i\mathbf{x}_i^\top` follows a Wishart distribution. It plays a central role in multivariate hypothesis testing and as the likelihood for covariance matrices. Think of the Wishart as the distribution of "sample covariance matrices". Just as the Chi-squared tells you how sample variances behave, the Wishart tells you how sample covariance matrices behave. PDF --- Let :math:`\mathbf{W} \sim \mathcal{W}_p(n, \boldsymbol{\Sigma})` where :math:`n \ge p` (degrees of freedom) and :math:`\boldsymbol{\Sigma}` is :math:`p \times p` positive definite: .. math:: f(\mathbf{W} \mid n, \boldsymbol{\Sigma}) = \frac{|\mathbf{W}|^{(n-p-1)/2}}{ 2^{np/2}\,|\boldsymbol{\Sigma}|^{n/2}\,\Gamma_p(n/2) } \exp\!\left( -\tfrac{1}{2}\text{tr}(\boldsymbol{\Sigma}^{-1}\mathbf{W}) \right), The exponential term :math:`\exp(-\frac{1}{2}\text{tr}(\boldsymbol{\Sigma}^{-1}\mathbf{W}))` is the multivariate analogue of :math:`e^{-w/(2\sigma^2)}` for the Chi-squared. The trace :math:`\text{tr}(\boldsymbol{\Sigma}^{-1}\mathbf{W})` measures the "size" of :math:`\mathbf{W}` relative to :math:`\boldsymbol{\Sigma}` --- larger scatter matrices (relative to the true covariance) are penalised. The :math:`|\mathbf{W}|^{(n-p-1)/2}` term plays the role of :math:`x^{k/2-1}` in the Chi-squared PDF. Here :math:`\Gamma_p` is the multivariate gamma function: .. math:: \Gamma_p(a) = \pi^{p(p-1)/4}\prod_{j=1}^{p}\Gamma\!\left(a + \frac{1-j}{2}\right). Log-likelihood -------------- For a single observation :math:`\mathbf{W}` with known :math:`n`: .. math:: \ell(\boldsymbol{\Sigma}) = \frac{n-p-1}{2}\ln|\mathbf{W}| - \frac{n}{2}\ln|\boldsymbol{\Sigma}| - \frac{1}{2}\text{tr}(\boldsymbol{\Sigma}^{-1}\mathbf{W}) + \text{const}. MLE for :math:`\boldsymbol{\Sigma}` ------------------------------------- Differentiating with respect to :math:`\boldsymbol{\Sigma}` using the same identities as for the MVN: .. math:: \frac{\partial\ell}{\partial\boldsymbol{\Sigma}} = -\frac{n}{2}\boldsymbol{\Sigma}^{-1} + \frac{1}{2}\boldsymbol{\Sigma}^{-1}\mathbf{W}\,\boldsymbol{\Sigma}^{-1}. Setting to zero: .. math:: \hat{\boldsymbol{\Sigma}} = \frac{1}{n}\mathbf{W}. Connection to MVN MLE ---------------------- When :math:`\mathbf{W} = \sum_{i=1}^{n}(\mathbf{x}_i - \bar{\mathbf{x}}) (\mathbf{x}_i - \bar{\mathbf{x}})^\top` (the scatter matrix from :math:`n` MVN observations), the Wishart MLE :math:`\mathbf{W}/n` coincides with the MVN MLE :math:`\hat{\boldsymbol{\Sigma}}`. **Simulating scatter matrices and verifying the Wishart distribution.** The Wishart distribution tells us how scatter matrices behave. If we draw many samples of size :math:`n` from a MVN, compute the scatter matrix each time, and look at properties like the expected value and the determinant, the results should match the Wishart theory. The expected value of :math:`\mathbf{W} \sim \mathcal{W}_p(n, \boldsymbol{\Sigma})` is :math:`n\boldsymbol{\Sigma}`. .. code-block:: python # Wishart: simulate scatter matrices, verify E[W] = n * Sigma import numpy as np np.random.seed(42) p = 3 n = 20 # degrees of freedom (sample size per scatter matrix) n_sim = 5000 # number of scatter matrices to simulate Sigma = np.array([[2.0, 0.5, 0.3], [0.5, 1.5, -0.2], [0.3, -0.2, 1.0]]) # Simulate many scatter matrices scatter_matrices = np.zeros((n_sim, p, p)) log_dets = np.zeros(n_sim) for s in range(n_sim): X = np.random.multivariate_normal(np.zeros(p), Sigma, size=n) W = X.T @ X # scatter matrix ~ Wishart(n, Sigma) scatter_matrices[s] = W log_dets[s] = np.linalg.slogdet(W)[1] # Verify E[W] = n * Sigma E_W = scatter_matrices.mean(axis=0) print("E[W] / n (should approximate Sigma):") print(np.round(E_W / n, 3)) print("\nTrue Sigma:") print(Sigma) print(f"\nMax absolute error in E[W]/n: {np.max(np.abs(E_W/n - Sigma)):.4f}") # Verify log-determinant distribution # E[log|W|] = p*log(2) + log|Sigma| + sum(digamma((n+1-j)/2)) from scipy.special import digamma expected_logdet = (p * np.log(2) + np.linalg.slogdet(Sigma)[1] + sum(digamma((n + 1 - j) / 2) for j in range(1, p + 1))) print(f"\nE[log|W|]: simulated = {log_dets.mean():.3f}, " f"theoretical = {expected_logdet:.3f}") **So what?** The Wishart distribution is not just a mathematical curiosity. Every time you report a sample covariance matrix and wonder "how much could this vary?", the Wishart is the answer. The simulation above confirms that the theory correctly predicts the sampling behavior of scatter matrices. .. _sec_inv_wishart: 7.4 Inverse-Wishart Distribution ================================== Motivation ---------- The Inverse-Wishart is the distribution of the inverse of a Wishart-distributed matrix. In Bayesian statistics it serves as the **conjugate prior** for the covariance matrix of a Multivariate Normal: if :math:`\boldsymbol{\Sigma} \sim \text{Inv-Wishart}(\nu, \boldsymbol{\Psi})` and the data are :math:`\mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma})`, then the posterior for :math:`\boldsymbol{\Sigma}` is also Inverse-Wishart. This conjugacy makes the Inverse-Wishart extremely convenient for Bayesian analysis. The posterior update has the same form as the prior, with the parameters simply absorbing the data. PDF --- Let :math:`\boldsymbol{\Sigma} \sim \text{IW}_p(\nu, \boldsymbol{\Psi})` where :math:`\nu > p + 1` and :math:`\boldsymbol{\Psi}` is :math:`p \times p` positive definite: .. math:: f(\boldsymbol{\Sigma} \mid \nu, \boldsymbol{\Psi}) = \frac{|\boldsymbol{\Psi}|^{\nu/2}}{ 2^{\nu p/2}\,\Gamma_p(\nu/2) } \,|\boldsymbol{\Sigma}|^{-(\nu+p+1)/2} \exp\!\left( -\tfrac{1}{2}\text{tr}(\boldsymbol{\Psi}\,\boldsymbol{\Sigma}^{-1}) \right). Properties ---------- * **Mean:** :math:`E[\boldsymbol{\Sigma}] = \boldsymbol{\Psi}/(\nu - p - 1)` for :math:`\nu > p + 1`. * **Mode:** :math:`\boldsymbol{\Psi}/(\nu + p + 1)`. Posterior under MVN sampling ----------------------------- Prior: :math:`\boldsymbol{\Sigma} \sim \text{IW}_p(\nu_0, \boldsymbol{\Psi}_0)`. Data: :math:`\mathbf{x}_1, \dots, \mathbf{x}_n \mid \boldsymbol{\mu}, \boldsymbol{\Sigma} \sim \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma})`. Posterior (with :math:`\boldsymbol{\mu}` known): .. math:: \boldsymbol{\Sigma} \mid \text{data} \sim \text{IW}_p\!\left( \nu_0 + n,\; \boldsymbol{\Psi}_0 + \sum_{i=1}^{n}(\mathbf{x}_i - \boldsymbol{\mu}) (\mathbf{x}_i - \boldsymbol{\mu})^\top \right). This shows how the prior "pseudo-observations" :math:`\nu_0` and "pseudo-scatter" :math:`\boldsymbol{\Psi}_0` combine with the data scatter matrix. **Bayesian conjugate update in action.** Suppose a lab measures two biomarkers on patients. Before collecting data, the researcher has a prior belief about the covariance structure (perhaps from a pilot study). After observing new patients, the Inverse-Wishart conjugate update combines prior and data seamlessly. We verify that the posterior mean is a weighted average of the prior mean and the sample covariance. .. code-block:: python # Inverse-Wishart conjugate update for MVN covariance import numpy as np np.random.seed(42) p = 2 # True covariance (unknown to the analyst) Sigma_true = np.array([[1.0, 0.6], [0.6, 2.0]]) mu_known = np.array([0.0, 0.0]) # assume mean is known # Prior: IW(nu_0, Psi_0) with E[Sigma] = Psi_0 / (nu_0 - p - 1) nu_0 = 5 Psi_0 = (nu_0 - p - 1) * np.eye(p) # prior mean = I prior_mean = Psi_0 / (nu_0 - p - 1) print("Prior mean for Sigma:") print(prior_mean) # Observe data n = 30 data = np.random.multivariate_normal(mu_known, Sigma_true, size=n) scatter = (data - mu_known).T @ (data - mu_known) # Posterior: IW(nu_0 + n, Psi_0 + scatter) nu_post = nu_0 + n Psi_post = Psi_0 + scatter posterior_mean = Psi_post / (nu_post - p - 1) # Sample covariance for comparison sample_cov = scatter / n print(f"\nSample covariance (n={n}):") print(np.round(sample_cov, 3)) print(f"\nPosterior mean for Sigma (nu_post={nu_post}):") print(np.round(posterior_mean, 3)) print(f"\nTrue Sigma:") print(Sigma_true) # Show it is a weighted combination w_prior = (nu_0 - p - 1) / (nu_post - p - 1) w_data = n / (nu_post - p - 1) weighted = w_prior * prior_mean + w_data * sample_cov print(f"\nWeights: prior={w_prior:.3f}, data={w_data:.3f}") print(f"Weighted combination:") print(np.round(weighted, 3)) print(f"Match posterior mean: {np.allclose(weighted, posterior_mean)}") # Simulate from posterior to get credible intervals n_post_samples = 5000 post_samples_11 = [] post_samples_12 = [] for _ in range(n_post_samples): # IW(nu, Psi) = inv(Wishart(nu, Psi_inv)) W = np.random.multivariate_normal( np.zeros(p), np.linalg.inv(Psi_post), size=nu_post) S = W.T @ W Sigma_sample = np.linalg.inv(S) post_samples_11.append(Sigma_sample[0, 0]) post_samples_12.append(Sigma_sample[0, 1]) post_samples_11 = np.array(post_samples_11) post_samples_12 = np.array(post_samples_12) print(f"\n95% credible interval for Sigma[1,1]: " f"({np.percentile(post_samples_11, 2.5):.3f}, " f"{np.percentile(post_samples_11, 97.5):.3f}), " f"true={Sigma_true[0,0]}") print(f"95% credible interval for Sigma[1,2]: " f"({np.percentile(post_samples_12, 2.5):.3f}, " f"{np.percentile(post_samples_12, 97.5):.3f}), " f"true={Sigma_true[0,1]}") .. _sec_dirichlet: 7.5 Dirichlet Distribution ============================ Motivation ---------- The Dirichlet distribution is the multivariate generalisation of the Beta. It is defined on the probability simplex :math:`\{(p_1, \dots, p_k) : p_j \ge 0,\; \sum p_j = 1\}` and is the conjugate prior for the Multinomial. It is used in topic modelling (Latent Dirichlet Allocation), Bayesian nonparametrics, and any setting where one needs a prior over categorical distributions. If you need a distribution over *distributions*---a probability distribution whose samples are themselves probability vectors---the Dirichlet is your primary tool. PDF --- Let :math:`\mathbf{p} = (p_1, \dots, p_k)` lie on the simplex and let :math:`\boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_k)` with each :math:`\alpha_j > 0`: .. math:: f(\mathbf{p} \mid \boldsymbol{\alpha}) = \frac{\Gamma(\alpha_0)}{\prod_{j=1}^{k}\Gamma(\alpha_j)} \prod_{j=1}^{k} p_j^{\alpha_j - 1}, where :math:`\alpha_0 = \sum_{j=1}^{k}\alpha_j` is called the **concentration**. .. admonition:: Intuition The concentration parameter :math:`\alpha_0` controls how "peaked" the distribution is. When :math:`\alpha_0` is large, samples cluster tightly around the mean. When :math:`\alpha_0` is small, samples tend toward the corners of the simplex (sparse distributions). When all :math:`\alpha_j = 1`, you get the Uniform distribution on the simplex. Likelihood ---------- Given :math:`n` i.i.d. observations :math:`\mathbf{p}^{(1)}, \dots, \mathbf{p}^{(n)}` on the simplex: .. math:: L(\boldsymbol{\alpha}) = \left[\frac{\Gamma(\alpha_0)}{\prod_j \Gamma(\alpha_j)}\right]^n \prod_{i=1}^{n}\prod_{j=1}^{k} \bigl(p_j^{(i)}\bigr)^{\alpha_j - 1}. Log-likelihood -------------- .. math:: \ell(\boldsymbol{\alpha}) = n\!\left[\ln\Gamma(\alpha_0) - \sum_{j=1}^{k}\ln\Gamma(\alpha_j)\right] + \sum_{j=1}^{k}(\alpha_j - 1)\sum_{i=1}^{n}\ln p_j^{(i)}. Score function -------------- .. math:: \frac{\partial\ell}{\partial\alpha_j} = n\!\left[\psi(\alpha_0) - \psi(\alpha_j)\right] + \sum_{i=1}^{n}\ln p_j^{(i)}, where :math:`\psi` is the digamma function. Setting each score to zero: .. math:: \psi(\alpha_j) - \psi(\alpha_0) = \frac{1}{n}\sum_{i=1}^{n}\ln p_j^{(i)}. MLE --- Fixed-point iteration ------------------------------- No closed form exists. The standard method is the **fixed-point iteration** of Minka (2000). Define :math:`g_j = (1/n)\sum_i \ln p_j^{(i)}`. Then iterate: .. math:: \alpha_j^{(\text{new})} = \psi^{-1}\!\left(\psi(\alpha_0^{(\text{old})}) + g_j\right), where :math:`\psi^{-1}` is the inverse digamma function (computed via Newton's method), and update :math:`\alpha_0 = \sum_j \alpha_j` after each sweep. This converges reliably for reasonable starting values (e.g., method of moments). **Minka's fixed-point iteration with convergence tracking.** The Dirichlet MLE has no closed form, making it a good case study for iterative optimization. We implement the full algorithm, print a convergence table showing how the estimates improve each iteration, and compare with the true parameters. .. code-block:: python # Dirichlet MLE: Minka's fixed-point iteration with convergence table import numpy as np from scipy.special import digamma, polygamma np.random.seed(42) k = 4 alpha_true = np.array([2.0, 5.0, 3.0, 1.5]) n = 300 # Simulate Dirichlet observations data = np.random.dirichlet(alpha_true, size=n) # Sufficient statistics g = np.log(data).mean(axis=0) # Inverse digamma via Newton's method def inv_digamma(y, tol=1e-10): x = np.exp(y) + 0.5 if y >= -2.22 else -1.0 / (y - digamma(1)) for _ in range(20): x -= (digamma(x) - y) / polygamma(1, x) return x # Fixed-point iteration with convergence tracking alpha_hat = np.ones(k) # starting values print(f"{'Iter':<6} {'alpha_1':<10} {'alpha_2':<10} " f"{'alpha_3':<10} {'alpha_4':<10} {'max change':<12}") print("-" * 58) for iteration in range(50): alpha0 = alpha_hat.sum() psi_alpha0 = digamma(alpha0) alpha_new = np.array([inv_digamma(psi_alpha0 + g[j]) for j in range(k)]) max_change = np.max(np.abs(alpha_new - alpha_hat)) alpha_hat = alpha_new if iteration < 8 or max_change < 1e-6: print(f"{iteration+1:<6} {alpha_hat[0]:<10.4f} {alpha_hat[1]:<10.4f} " f"{alpha_hat[2]:<10.4f} {alpha_hat[3]:<10.4f} {max_change:<12.2e}") if max_change < 1e-8: print(f"\nConverged in {iteration+1} iterations.") break print(f"\nTrue alpha: {alpha_true}") print(f"MLE alpha: {np.round(alpha_hat, 4)}") # Verify score ~ 0 at MLE alpha0_hat = alpha_hat.sum() score = n * (digamma(alpha0_hat) - digamma(alpha_hat)) + n * g print(f"\nScore at MLE: {np.round(score, 8)}") print(f" (should be ~0 for each component)") .. admonition:: Real-World Example **Text topic modelling.** In Latent Dirichlet Allocation (LDA), each document's topic distribution is modelled as a draw from a Dirichlet. Estimating the Dirichlet parameters tells us about the "average" topic mixture and how much variation there is across documents. .. code-block:: text Scenario: 1000 documents, each assigned to a mixture of 5 topics Data: for each document, the proportion of words in each topic Goal: estimate the Dirichlet concentration parameters Approach: Dirichlet(alpha) MLE via Minka's fixed-point iteration .. code-block:: python # Topic modelling: Dirichlet MLE for document-topic mixtures import numpy as np from scipy.special import digamma, polygamma np.random.seed(42) n_docs = 1000 n_topics = 5 alpha_true = np.array([0.5, 1.0, 0.3, 0.8, 0.4]) # sparse topics topic_mixtures = np.random.dirichlet(alpha_true, size=n_docs) # Check sparsity: many documents dominated by 1-2 topics max_topic_prob = topic_mixtures.max(axis=1) print(f"Mean max topic proportion: {max_topic_prob.mean():.3f}") print(f"Proportion with one dominant topic (>0.5): " f"{(max_topic_prob > 0.5).mean():.2%}") # Sufficient statistics and MLE g = np.log(topic_mixtures).mean(axis=0) def inv_digamma(y, tol=1e-10): x = np.exp(y) + 0.5 if y >= -2.22 else -1.0 / (y - digamma(1)) for _ in range(20): x -= (digamma(x) - y) / polygamma(1, x) return x alpha_hat = np.ones(n_topics) for _ in range(100): alpha0 = alpha_hat.sum() alpha_hat = np.array([inv_digamma(digamma(alpha0) + g[j]) for j in range(n_topics)]) print(f"\nTrue alpha: {alpha_true}") print(f"MLE alpha: {np.round(alpha_hat, 4)}") print(f"Concentration: {alpha_hat.sum():.2f} " f"(true: {alpha_true.sum():.2f})") Fisher information ------------------ The Fisher information matrix for a single Dirichlet observation is .. math:: \mathcal{I}_{jl} = \begin{cases} n\bigl[\psi'(\alpha_j) - \psi'(\alpha_0)\bigr], & j = l, \\ -n\,\psi'(\alpha_0), & j \ne l, \end{cases} which can be written compactly as .. math:: \mathcal{I} = \text{diag}\!\left(\psi'(\alpha_1), \dots, \psi'(\alpha_k)\right) - \psi'(\alpha_0)\,\mathbf{1}\mathbf{1}^\top. The matrix is positive definite because the diagonal elements exceed :math:`\psi'(\alpha_0)` (a consequence of the strict convexity of :math:`\ln\Gamma`). .. _sec_multinomial_mv: 7.6 Multinomial Distribution (Multivariate Treatment) ======================================================= Motivation ---------- We gave a brief treatment of the Multinomial in :ref:`sec_multinomial` from the discrete-distribution perspective. Here we revisit it using matrix notation and the Lagrange-multiplier derivation in full detail, as preparation for more sophisticated constrained optimisation problems. This revisit lets you see the same distribution through a different lens. The result is the same, but the derivation via Lagrange multipliers is a technique you will use again and again in multivariate settings. Log-likelihood (restated) -------------------------- Pooling :math:`n` independent multinomial vectors, each with :math:`m` trials and :math:`k` categories: .. math:: \ell(\mathbf{p}) = \text{const} + \sum_{j=1}^{k} S_j \ln p_j, where :math:`S_j = \sum_{i=1}^{n} x_j^{(i)}` is the total count in category :math:`j` and :math:`\sum_j S_j = nm`. MLE with Lagrange multiplier --- full derivation --------------------------------------------------- We want to maximise :math:`\ell(\mathbf{p})` subject to :math:`g(\mathbf{p}) = \sum_{j=1}^{k} p_j - 1 = 0`. **Step 1.** Form the Lagrangian: .. math:: \mathcal{L}(\mathbf{p}, \mu) = \sum_{j=1}^{k} S_j \ln p_j - \mu\!\left(\sum_{j=1}^{k} p_j - 1\right). **Step 2.** Take partial derivatives and set to zero: .. math:: \frac{\partial\mathcal{L}}{\partial p_j} = \frac{S_j}{p_j} - \mu = 0 \;\Longrightarrow\; p_j = \frac{S_j}{\mu}, \qquad j = 1, \dots, k. **Step 3.** Determine :math:`\mu` from the constraint: .. math:: \sum_{j=1}^{k} p_j = 1 \;\Longrightarrow\; \sum_{j=1}^{k}\frac{S_j}{\mu} = 1 \;\Longrightarrow\; \mu = \sum_{j=1}^{k} S_j = nm. **Step 4.** Substitute back: .. math:: \hat{p}_j = \frac{S_j}{nm}. **Step 5.** Verify second-order conditions. The bordered Hessian confirms this is a maximum because each :math:`\partial^2\mathcal{L}/\partial p_j^2 = -S_j/p_j^2 < 0`. **Lagrange multiplier MLE with constraint verification.** The Multinomial MLE is the most natural example of constrained optimization in statistics: we maximize the log-likelihood subject to the constraint that probabilities sum to one. Below we implement the full Lagrange multiplier approach, compute the MLE, and verify that both the first-order conditions and the constraint are satisfied. .. code-block:: python # Multinomial MLE via Lagrange multiplier, with constraint verification import numpy as np np.random.seed(42) k = 5 # categories m = 10 # trials per observation n = 100 # number of observations p_true = np.array([0.1, 0.25, 0.35, 0.15, 0.15]) # Simulate multinomial data data = np.random.multinomial(m, p_true, size=n) # n x k matrix of counts # Sufficient statistics: total counts per category S = data.sum(axis=0) print(f"Total counts S: {S}") print(f"Sum of S (= nm): {S.sum()} (expected: {n*m})") # MLE from Lagrange multiplier derivation: p_hat_j = S_j / (nm) p_hat = S / (n * m) print(f"\nTrue p: {p_true}") print(f"MLE p_hat: {np.round(p_hat, 4)}") # Verify constraint: sum(p_hat) = 1 print(f"\nConstraint check: sum(p_hat) = {p_hat.sum():.10f}") # Verify first-order conditions: S_j / p_hat_j = mu for all j # The Lagrange multiplier mu = nm mu_lagrange = n * m ratios = S / p_hat print(f"\nLagrange multiplier mu = nm = {mu_lagrange}") print(f"S_j / p_hat_j for each j: {ratios}") print(f"All equal to mu? {np.allclose(ratios, mu_lagrange)}") # Log-likelihood at MLE vs at true p def multinomial_loglik(p_vec, counts): return np.sum(counts * np.log(p_vec)) ll_mle = multinomial_loglik(p_hat, S) ll_true = multinomial_loglik(p_true, S) print(f"\nLog-lik at MLE: {ll_mle:.4f}") print(f"Log-lik at true p: {ll_true:.4f}") print(f"Difference: {ll_mle - ll_true:.4f} (MLE >= true, as expected)") # Second-order condition: diagonal of Hessian < 0 hessian_diag = -S / p_hat**2 print(f"\nHessian diagonal: {np.round(hessian_diag, 1)}") print(f"All negative? {np.all(hessian_diag < 0)}") Fisher information matrix (unrestricted parametrisation) --------------------------------------------------------- Working with the free parameters :math:`p_1, \dots, p_{k-1}` (with :math:`p_k = 1 - \sum_{j