Appendix C: Notation and Glossary¶
This appendix provides a centralized reference for all mathematical notation and terminology used throughout this guide. When a symbol has multiple common meanings, the one adopted in this text is indicated.
Mathematical Notation¶
Sets and Spaces¶
Symbol |
Meaning |
|---|---|
\(\mathbb{R}\) |
The set of real numbers |
\(\mathbb{R}^n\) |
The set of real \(n\)-dimensional column vectors |
\(\mathbb{R}^{n \times m}\) |
The set of real \(n \times m\) matrices |
\(\mathbb{R}^+\) |
The set of strictly positive real numbers \((0, \infty)\) |
\(\mathbb{Z}\) |
The set of integers |
\(\mathbb{Z}^+\) |
The set of positive integers \(\{1, 2, 3, \ldots\}\) |
\(\mathbb{N}_0\) |
The set of non-negative integers \(\{0, 1, 2, \ldots\}\) |
\(\emptyset\) |
The empty set |
\(\mathcal{X}\) |
The sample space (set of all possible outcomes) |
\(\Theta\) |
The parameter space |
\(\mathcal{S}_n^{++}\) |
The cone of \(n \times n\) symmetric positive definite matrices |
\(\in\) |
Element of |
\(\subset, \subseteq\) |
Proper subset, subset (or equal) |
\(\cup, \cap\) |
Union, intersection |
\(A^c\) |
Complement of set \(A\) |
Probability Notation¶
Symbol |
Meaning |
|---|---|
\(P(A)\) |
Probability of event \(A\) |
\(P(A \mid B)\) |
Conditional probability of \(A\) given \(B\) |
\(P(A \cap B)\) |
Joint probability of \(A\) and \(B\) |
\(f(x)\) or \(f_X(x)\) |
Probability density function (PDF) of a continuous r.v. |
\(p(x)\) or \(p_X(x)\) |
Probability mass function (PMF) of a discrete r.v. |
\(F(x)\) or \(F_X(x)\) |
Cumulative distribution function (CDF) |
\(f(x \mid \theta)\) |
Density of \(X\) given parameter \(\theta\) |
\(f(\mathbf{x} \mid \boldsymbol{\theta})\) |
Joint density of data vector \(\mathbf{x}\) given parameters |
\(\sim\) |
“is distributed as” (e.g., \(X \sim \mathcal{N}(\mu, \sigma^2)\)) |
\(\stackrel{d}{\to}\) |
Convergence in distribution |
\(\stackrel{p}{\to}\) |
Convergence in probability |
\(\stackrel{a.s.}{\to}\) |
Almost sure convergence |
\(\perp\!\!\!\perp\) |
Statistical independence |
\(\text{i.i.d.}\) |
Independent and identically distributed |
Random Variables and Expectations¶
Symbol |
Meaning |
|---|---|
\(X, Y, Z\) |
Random variables (uppercase) |
\(x, y, z\) |
Observed values / realizations (lowercase) |
\(\mathbf{X}\) |
Random vector or random matrix |
\(E[X]\) or \(\mu\) |
Expected value (mean) of \(X\) |
\(E[X \mid Y]\) |
Conditional expectation of \(X\) given \(Y\) |
\(\operatorname{Var}(X)\) or \(\sigma^2\) |
Variance of \(X\) |
\(\operatorname{Cov}(X, Y)\) |
Covariance of \(X\) and \(Y\) |
\(\operatorname{Corr}(X, Y)\) or \(\rho\) |
Correlation of \(X\) and \(Y\) |
\(\boldsymbol{\Sigma}\) |
Covariance matrix |
\(M_X(t)\) |
Moment generating function of \(X\) |
\(\phi_X(t)\) |
Characteristic function of \(X\) |
\(E_\theta[\cdot]\) |
Expectation taken under the distribution indexed by \(\theta\) |
Likelihood and Inference¶
Symbol |
Meaning |
|---|---|
\(L(\theta)\) or \(L(\theta ; \mathbf{x})\) |
Likelihood function |
\(\ell(\theta)\) or \(\ell(\theta ; \mathbf{x})\) |
Log-likelihood function, \(\ell = \log L\) |
\(\hat{\theta}\) or \(\hat{\theta}_{\text{MLE}}\) |
Maximum likelihood estimator / estimate |
\(U(\theta)\) or \(S(\theta)\) |
Score function, \(U(\theta) = \partial \ell / \partial \theta\) |
\(\mathcal{I}(\theta)\) |
Fisher information (expected information) |
\(\mathcal{J}(\theta)\) or \(J(\hat{\theta})\) |
Observed information, \(-\partial^2 \ell / \partial \theta^2\) |
\(\Lambda\) |
Likelihood ratio statistic |
\(R(\theta)\) |
Profile likelihood or relative likelihood |
\(\ell_p(\psi)\) |
Profile log-likelihood for parameter of interest \(\psi\) |
\(\text{se}(\hat{\theta})\) |
Standard error of estimator \(\hat{\theta}\) |
\(\text{AIC}\) |
Akaike Information Criterion, \(-2\ell(\hat\theta) + 2p\) |
\(\text{BIC}\) |
Bayesian Information Criterion, \(-2\ell(\hat\theta) + p\log n\) |
Optimization Notation¶
Symbol |
Meaning |
|---|---|
\(\nabla f\) or \(\nabla_{\mathbf{x}} f\) |
Gradient of \(f\) with respect to \(\mathbf{x}\) |
\(\mathbf{H}\) or \(\nabla^2 f\) |
Hessian matrix (matrix of second partial derivatives) |
\(\mathbf{J}\) |
Jacobian matrix |
\(\arg\max_\theta f(\theta)\) |
Value of \(\theta\) that maximizes \(f\) |
\(\arg\min_\theta f(\theta)\) |
Value of \(\theta\) that minimizes \(f\) |
\(\eta\) |
Learning rate / step size |
\(\theta^{(k)}\) |
Parameter value at iteration \(k\) of an iterative algorithm |
\(\epsilon\) |
Convergence tolerance |
\(O(\cdot)\) |
Big-O notation (asymptotic upper bound) |
\(o(\cdot)\) |
Little-o notation (asymptotically negligible) |
\(O_p(\cdot)\) |
Stochastic big-O (bounded in probability) |
\(o_p(\cdot)\) |
Stochastic little-o (converges to zero in probability) |
Matrix Notation¶
Symbol |
Meaning |
|---|---|
\(\mathbf{A}, \mathbf{B}, \mathbf{C}\) |
Matrices (bold uppercase) |
\(\mathbf{x}, \mathbf{y}, \mathbf{z}\) |
Vectors (bold lowercase) |
\(\mathbf{I}\) or \(\mathbf{I}_n\) |
Identity matrix (\(n \times n\)) |
\(\mathbf{0}\) |
Zero vector or zero matrix |
\(\mathbf{1}\) or \(\mathbf{1}_n\) |
Vector of ones |
\(\mathbf{A}^\top\) |
Transpose of \(\mathbf{A}\) |
\(\mathbf{A}^{-1}\) |
Inverse of \(\mathbf{A}\) |
\(\mathbf{A}^{-\top}\) |
\((\mathbf{A}^{-1})^\top = (\mathbf{A}^\top)^{-1}\) |
\(\det(\mathbf{A})\) or \(|\mathbf{A}|\) |
Determinant of \(\mathbf{A}\) |
\(\operatorname{tr}(\mathbf{A})\) |
Trace of \(\mathbf{A}\) |
\(\operatorname{rank}(\mathbf{A})\) |
Rank of \(\mathbf{A}\) |
\(\operatorname{diag}(\mathbf{A})\) |
Vector of diagonal entries of \(\mathbf{A}\) |
\(\operatorname{diag}(\mathbf{x})\) |
Diagonal matrix with entries of \(\mathbf{x}\) on the diagonal |
\(\lambda_i(\mathbf{A})\) |
\(i\)-th eigenvalue of \(\mathbf{A}\) |
\(\sigma_i(\mathbf{A})\) |
\(i\)-th singular value of \(\mathbf{A}\) |
\(\|\mathbf{x}\|\) or \(\|\mathbf{x}\|_2\) |
Euclidean (L2) norm, \(\sqrt{\mathbf{x}^\top\mathbf{x}}\) |
\(\|\mathbf{A}\|_F\) |
Frobenius norm, \(\sqrt{\operatorname{tr}(\mathbf{A}^\top\mathbf{A})}\) |
\(\mathbf{A} \succ 0\) |
\(\mathbf{A}\) is positive definite |
\(\mathbf{A} \succeq 0\) |
\(\mathbf{A}\) is positive semi-definite |
\(\mathbf{A} \otimes \mathbf{B}\) |
Kronecker product of \(\mathbf{A}\) and \(\mathbf{B}\) |
Named Distributions¶
The following shorthand is used for standard distributions:
Notation |
Distribution |
|---|---|
\(\text{Bernoulli}(p)\) |
Bernoulli with success probability \(p\) |
\(\text{Bin}(n, p)\) |
Binomial with \(n\) trials and success probability \(p\) |
\(\text{Poisson}(\lambda)\) |
Poisson with rate \(\lambda\) |
\(\text{Geom}(p)\) |
Geometric with success probability \(p\) |
\(\text{NegBin}(r, p)\) |
Negative binomial with \(r\) successes, probability \(p\) |
\(\mathcal{U}(a, b)\) |
Uniform on \([a, b]\) |
\(\text{Exp}(\lambda)\) |
Exponential with rate \(\lambda\) |
\(\mathcal{N}(\mu, \sigma^2)\) |
Normal with mean \(\mu\) and variance \(\sigma^2\) |
\(\mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma})\) |
\(p\)-variate normal with mean \(\boldsymbol{\mu}\) and covariance \(\boldsymbol{\Sigma}\) |
\(\text{Gamma}(\alpha, \beta)\) |
Gamma with shape \(\alpha\) and rate \(\beta\) |
\(\text{Beta}(\alpha, \beta)\) |
Beta with shape parameters \(\alpha\) and \(\beta\) |
\(\chi^2_n\) |
Chi-squared with \(n\) degrees of freedom |
\(t_n\) |
Student’s \(t\) with \(n\) degrees of freedom |
\(F_{m,n}\) |
\(F\)-distribution with \(m\) and \(n\) degrees of freedom |
\(\text{Mult}(n, \mathbf{p})\) |
Multinomial with \(n\) trials and probability vector \(\mathbf{p}\) |
\(\text{Dir}(\boldsymbol{\alpha})\) |
Dirichlet with concentration parameter \(\boldsymbol{\alpha}\) |
\(\text{Wishart}_p(n, \mathbf{V})\) |
Wishart with \(n\) degrees of freedom and scale matrix \(\mathbf{V}\) |
Greek Letters and Their Typical Uses¶
Letter |
Name |
Typical Use in Statistics |
|---|---|---|
\(\alpha\) |
alpha |
Significance level; shape parameter; Type I error rate |
\(\beta\) |
beta |
Regression coefficient; rate parameter; Type II error rate |
\(\gamma\) |
gamma |
Skewness; Euler–Mascheroni constant; threshold parameter |
\(\delta\) |
delta |
Effect size; small perturbation; Kronecker delta |
\(\epsilon, \varepsilon\) |
epsilon |
Error term; small positive quantity; convergence tolerance |
\(\zeta\) |
zeta |
Latent variable; link function parameter |
\(\eta\) |
eta |
Natural (canonical) parameter; learning rate |
\(\theta\) |
theta |
Generic parameter (the most common choice) |
\(\iota\) |
iota |
(Rarely used in statistics) |
\(\kappa\) |
kappa |
Cumulant; condition number; concentration parameter |
\(\lambda\) |
lambda |
Rate parameter; eigenvalue; Lagrange multiplier; penalty |
\(\mu\) |
mu |
Mean; location parameter |
\(\nu\) |
nu |
Degrees of freedom |
\(\xi\) |
xi |
Latent variable; auxiliary parameter |
\(\pi\) |
pi |
Prior probability; the constant 3.14159… |
\(\rho\) |
rho |
Correlation coefficient; spectral radius |
\(\sigma\) |
sigma |
Standard deviation (\(\sigma^2\) = variance) |
\(\tau\) |
tau |
Precision (\(1/\sigma^2\)); Kendall’s rank correlation |
\(\upsilon\) |
upsilon |
(Rarely used in statistics) |
\(\phi, \varphi\) |
phi |
Standard normal density; dispersion parameter; basis function |
\(\chi\) |
chi |
Chi-squared distribution |
\(\psi\) |
psi |
Digamma function; parameter of interest |
\(\omega\) |
omega |
Weight; angular frequency |
Uppercase Greek letters with common statistical uses:
Letter |
Name |
Typical Use |
|---|---|---|
\(\Gamma\) |
Gamma |
Gamma function; Gamma distribution |
\(\Delta\) |
Delta |
Change or difference |
\(\Theta\) |
Theta |
Parameter space |
\(\Lambda\) |
Lambda |
Likelihood ratio; diagonal matrix of eigenvalues |
\(\Sigma\) |
Sigma |
Covariance matrix; summation (\(\sum\)) |
\(\Phi\) |
Phi |
Standard normal CDF |
\(\Psi\) |
Psi |
Polygamma function |
\(\Omega\) |
Omega |
Sample space; precision matrix (\(\Sigma^{-1}\)) |
Glossary of Key Terms¶
- Asymptotic normality¶
The property that the distribution of an estimator approaches a normal distribution as the sample size grows. Under regularity conditions, \(\sqrt{n}(\hat\theta - \theta_0) \stackrel{d}{\to} \mathcal{N}(0, \mathcal{I}(\theta_0)^{-1})\).
- Bias¶
The difference \(E[\hat\theta] - \theta\) between the expected value of an estimator and the true parameter value.
- Completeness¶
A statistic \(T\) is complete if the only function \(g\) with \(E_\theta[g(T)] = 0\) for all \(\theta\) is \(g \equiv 0\).
- Confidence interval¶
An interval \([L(\mathbf{X}), U(\mathbf{X})]\) that contains the true parameter with a specified probability (the confidence level).
- Conjugate prior¶
A prior distribution that, when combined with the likelihood via Bayes’ theorem, yields a posterior of the same parametric family.
- Consistency¶
An estimator \(\hat\theta_n\) is consistent if \(\hat\theta_n \stackrel{p}{\to} \theta_0\) as \(n \to \infty\).
- Cramér–Rao lower bound¶
The minimum variance achievable by any unbiased estimator: \(\operatorname{Var}(\hat\theta) \geq \mathcal{I}(\theta)^{-1}\).
- Deviance¶
Twice the difference between the log-likelihood of the saturated model and the fitted model: \(D = 2[\ell_{\text{sat}} - \ell(\hat\theta)]\).
- Efficiency¶
The ratio of the Cramér–Rao lower bound to the actual variance of an estimator. An efficient estimator achieves the bound.
- EM algorithm¶
Expectation–Maximization algorithm, an iterative method for finding MLEs when the model involves latent variables or missing data.
- Estimator¶
A function of the data used to estimate an unknown parameter. The distinction between “estimator” (the rule) and “estimate” (the numerical value) is maintained in this text.
- Exponential family¶
A parametric family whose density can be written as \(f(x|\theta) = h(x)\exp[\eta(\theta)^\top T(x) - A(\theta)]\).
- Fisher information¶
The variance of the score function, or equivalently the negative expected Hessian of the log-likelihood: \(\mathcal{I}(\theta) = E[U(\theta)^2] = -E[\ell''(\theta)]\).
- Gradient descent¶
An iterative optimization algorithm: \(\theta^{(k+1)} = \theta^{(k)} + \eta\,\nabla\ell(\theta^{(k)})\).
- Hessian matrix¶
The matrix of second partial derivatives of a function: \(H_{ij} = \partial^2 f / \partial \theta_i \partial \theta_j\).
- Kullback–Leibler divergence¶
A measure of the difference between two distributions: \(\text{KL}(p \| q) = \int p(x)\log\frac{p(x)}{q(x)}\,dx\).
- Likelihood function¶
The joint density or mass function of the data, viewed as a function of the parameters: \(L(\theta) = f(\mathbf{x} \mid \theta)\).
- Likelihood ratio test¶
A hypothesis test based on the statistic \(\Lambda = 2[\ell(\hat\theta) - \ell(\theta_0)]\), which is asymptotically \(\chi^2\).
- Log-likelihood¶
The natural logarithm of the likelihood function: \(\ell(\theta) = \log L(\theta)\).
- Maximum likelihood estimator (MLE)¶
The parameter value that maximizes the likelihood (or equivalently the log-likelihood): \(\hat\theta = \arg\max_\theta \ell(\theta)\).
- Method of moments¶
An estimation approach that equates sample moments to population moments and solves for the parameters.
- Newton–Raphson method¶
An iterative root-finding algorithm applied to the score equation: \(\theta^{(k+1)} = \theta^{(k)} - [\ell''(\theta^{(k)})]^{-1}\,\ell'(\theta^{(k)})\).
- Nuisance parameter¶
A parameter that is not of direct interest but must be accounted for in the inference procedure.
- Observed information¶
The negative Hessian of the log-likelihood evaluated at the MLE: \(\mathcal{J}(\hat\theta) = -\ell''(\hat\theta)\).
- p-value¶
The probability, under the null hypothesis, of observing a test statistic as extreme as or more extreme than the observed value.
- Power¶
The probability of correctly rejecting a false null hypothesis: \(1 - \beta\), where \(\beta\) is the Type II error rate.
- Profile likelihood¶
The likelihood maximized over nuisance parameters: \(L_p(\psi) = \max_\lambda L(\psi, \lambda)\).
- Regularity conditions¶
Technical conditions (differentiability, integrability, parameter not on boundary) that ensure standard asymptotic results hold for MLEs.
- Score function¶
The derivative of the log-likelihood with respect to the parameter: \(U(\theta) = \partial\ell / \partial\theta\). Under regularity conditions, \(E[U(\theta_0)] = 0\).
- Sufficient statistic¶
A statistic \(T(\mathbf{X})\) that captures all the information in the data about the parameter. By the Fisher–Neyman factorization theorem, \(T\) is sufficient if \(f(\mathbf{x}|\theta) = g(T(\mathbf{x}), \theta)\,h(\mathbf{x})\).
- Wald test¶
A hypothesis test based on the statistic \(W = (\hat\theta - \theta_0)^2 / \widehat{\operatorname{Var}}(\hat\theta)\), which is asymptotically \(\chi^2_1\).
Common Abbreviations¶
Abbreviation |
Full Name |
|---|---|
AIC |
Akaike Information Criterion |
BIC |
Bayesian Information Criterion |
CDF |
Cumulative distribution function |
CLT |
Central Limit Theorem |
CRLB |
Cramér–Rao lower bound |
EM |
Expectation–Maximization |
GLM |
Generalized linear model |
i.i.d. |
Independent and identically distributed |
IRLS |
Iteratively reweighted least squares |
KL |
Kullback–Leibler |
LLN |
Law of large numbers |
LRT |
Likelihood ratio test |
MGF |
Moment generating function |
MLE |
Maximum likelihood estimator / estimate |
MSE |
Mean squared error |
MVN |
Multivariate normal |
NR |
Newton–Raphson |
Probability density function |
|
PMF |
Probability mass function |
r.v. |
Random variable |
SVD |
Singular value decomposition |
UMVUE |
Uniformly minimum variance unbiased estimator |
w.r.t. |
With respect to |