Appendix C: Notation and Glossary¶

This appendix provides a centralized reference for all mathematical notation and terminology used throughout this guide. When a symbol has multiple common meanings, the one adopted in this text is indicated.

Mathematical Notation¶

Sets and Spaces¶

Symbol	Meaning
\(\mathbb{R}\)	The set of real numbers
\(\mathbb{R}^n\)	The set of real \(n\)-dimensional column vectors
\(\mathbb{R}^{n \times m}\)	The set of real \(n \times m\) matrices
\(\mathbb{R}^+\)	The set of strictly positive real numbers \((0, \infty)\)
\(\mathbb{Z}\)	The set of integers
\(\mathbb{Z}^+\)	The set of positive integers \(\{1, 2, 3, \ldots\}\)
\(\mathbb{N}_0\)	The set of non-negative integers \(\{0, 1, 2, \ldots\}\)
\(\emptyset\)	The empty set
\(\mathcal{X}\)	The sample space (set of all possible outcomes)
\(\Theta\)	The parameter space
\(\mathcal{S}_n^{++}\)	The cone of \(n \times n\) symmetric positive definite matrices
\(\in\)	Element of
\(\subset, \subseteq\)	Proper subset, subset (or equal)
\(\cup, \cap\)	Union, intersection
\(A^c\)	Complement of set \(A\)

Probability Notation¶

Symbol	Meaning
\(P(A)\)	Probability of event \(A\)
\(P(A \mid B)\)	Conditional probability of \(A\) given \(B\)
\(P(A \cap B)\)	Joint probability of \(A\) and \(B\)
\(f(x)\) or \(f_X(x)\)	Probability density function (PDF) of a continuous r.v.
\(p(x)\) or \(p_X(x)\)	Probability mass function (PMF) of a discrete r.v.
\(F(x)\) or \(F_X(x)\)	Cumulative distribution function (CDF)
\(f(x \mid \theta)\)	Density of \(X\) given parameter \(\theta\)
\(f(\mathbf{x} \mid \boldsymbol{\theta})\)	Joint density of data vector \(\mathbf{x}\) given parameters
\(\sim\)	“is distributed as” (e.g., \(X \sim \mathcal{N}(\mu, \sigma^2)\))
\(\stackrel{d}{\to}\)	Convergence in distribution
\(\stackrel{p}{\to}\)	Convergence in probability
\(\stackrel{a.s.}{\to}\)	Almost sure convergence
\(\perp\!\!\!\perp\)	Statistical independence
\(\text{i.i.d.}\)	Independent and identically distributed

Random Variables and Expectations¶

Symbol	Meaning
\(X, Y, Z\)	Random variables (uppercase)
\(x, y, z\)	Observed values / realizations (lowercase)
\(\mathbf{X}\)	Random vector or random matrix
\(E[X]\) or \(\mu\)	Expected value (mean) of \(X\)
\(E[X \mid Y]\)	Conditional expectation of \(X\) given \(Y\)
\(\operatorname{Var}(X)\) or \(\sigma^2\)	Variance of \(X\)
\(\operatorname{Cov}(X, Y)\)	Covariance of \(X\) and \(Y\)
\(\operatorname{Corr}(X, Y)\) or \(\rho\)	Correlation of \(X\) and \(Y\)
\(\boldsymbol{\Sigma}\)	Covariance matrix
\(M_X(t)\)	Moment generating function of \(X\)
\(\phi_X(t)\)	Characteristic function of \(X\)
\(E_\theta[\cdot]\)	Expectation taken under the distribution indexed by \(\theta\)

Likelihood and Inference¶

Symbol	Meaning
\(L(\theta)\) or \(L(\theta ; \mathbf{x})\)	Likelihood function
\(\ell(\theta)\) or \(\ell(\theta ; \mathbf{x})\)	Log-likelihood function, \(\ell = \log L\)
\(\hat{\theta}\) or \(\hat{\theta}_{\text{MLE}}\)	Maximum likelihood estimator / estimate
\(U(\theta)\) or \(S(\theta)\)	Score function, \(U(\theta) = \partial \ell / \partial \theta\)
\(\mathcal{I}(\theta)\)	Fisher information (expected information)
\(\mathcal{J}(\theta)\) or \(J(\hat{\theta})\)	Observed information, \(-\partial^2 \ell / \partial \theta^2\)
\(\Lambda\)	Likelihood ratio statistic
\(R(\theta)\)	Profile likelihood or relative likelihood
\(\ell_p(\psi)\)	Profile log-likelihood for parameter of interest \(\psi\)
\(\text{se}(\hat{\theta})\)	Standard error of estimator \(\hat{\theta}\)
\(\text{AIC}\)	Akaike Information Criterion, \(-2\ell(\hat\theta) + 2p\)
\(\text{BIC}\)	Bayesian Information Criterion, \(-2\ell(\hat\theta) + p\log n\)

Optimization Notation¶

Symbol	Meaning
\(\nabla f\) or \(\nabla_{\mathbf{x}} f\)	Gradient of \(f\) with respect to \(\mathbf{x}\)
\(\mathbf{H}\) or \(\nabla^2 f\)	Hessian matrix (matrix of second partial derivatives)
\(\mathbf{J}\)	Jacobian matrix
\(\arg\max_\theta f(\theta)\)	Value of \(\theta\) that maximizes \(f\)
\(\arg\min_\theta f(\theta)\)	Value of \(\theta\) that minimizes \(f\)
\(\eta\)	Learning rate / step size
\(\theta^{(k)}\)	Parameter value at iteration \(k\) of an iterative algorithm
\(\epsilon\)	Convergence tolerance
\(O(\cdot)\)	Big-O notation (asymptotic upper bound)
\(o(\cdot)\)	Little-o notation (asymptotically negligible)
\(O_p(\cdot)\)	Stochastic big-O (bounded in probability)
\(o_p(\cdot)\)	Stochastic little-o (converges to zero in probability)

Matrix Notation¶

Symbol	Meaning
\(\mathbf{A}, \mathbf{B}, \mathbf{C}\)	Matrices (bold uppercase)
\(\mathbf{x}, \mathbf{y}, \mathbf{z}\)	Vectors (bold lowercase)
\(\mathbf{I}\) or \(\mathbf{I}_n\)	Identity matrix (\(n \times n\))
\(\mathbf{0}\)	Zero vector or zero matrix
\(\mathbf{1}\) or \(\mathbf{1}_n\)	Vector of ones
\(\mathbf{A}^\top\)	Transpose of \(\mathbf{A}\)
\(\mathbf{A}^{-1}\)	Inverse of \(\mathbf{A}\)
\(\mathbf{A}^{-\top}\)	\((\mathbf{A}^{-1})^\top = (\mathbf{A}^\top)^{-1}\)
\(\det(\mathbf{A})\) or \(\|\mathbf{A}\|\)	Determinant of \(\mathbf{A}\)
\(\operatorname{tr}(\mathbf{A})\)	Trace of \(\mathbf{A}\)
\(\operatorname{rank}(\mathbf{A})\)	Rank of \(\mathbf{A}\)
\(\operatorname{diag}(\mathbf{A})\)	Vector of diagonal entries of \(\mathbf{A}\)
\(\operatorname{diag}(\mathbf{x})\)	Diagonal matrix with entries of \(\mathbf{x}\) on the diagonal
\(\lambda_i(\mathbf{A})\)	\(i\)-th eigenvalue of \(\mathbf{A}\)
\(\sigma_i(\mathbf{A})\)	\(i\)-th singular value of \(\mathbf{A}\)
\(\\|\mathbf{x}\\|\) or \(\\|\mathbf{x}\\|_2\)	Euclidean (L2) norm, \(\sqrt{\mathbf{x}^\top\mathbf{x}}\)
\(\\|\mathbf{A}\\|_F\)	Frobenius norm, \(\sqrt{\operatorname{tr}(\mathbf{A}^\top\mathbf{A})}\)
\(\mathbf{A} \succ 0\)	\(\mathbf{A}\) is positive definite
\(\mathbf{A} \succeq 0\)	\(\mathbf{A}\) is positive semi-definite
\(\mathbf{A} \otimes \mathbf{B}\)	Kronecker product of \(\mathbf{A}\) and \(\mathbf{B}\)

Named Distributions¶

The following shorthand is used for standard distributions:

Notation	Distribution
\(\text{Bernoulli}(p)\)	Bernoulli with success probability \(p\)
\(\text{Bin}(n, p)\)	Binomial with \(n\) trials and success probability \(p\)
\(\text{Poisson}(\lambda)\)	Poisson with rate \(\lambda\)
\(\text{Geom}(p)\)	Geometric with success probability \(p\)
\(\text{NegBin}(r, p)\)	Negative binomial with \(r\) successes, probability \(p\)
\(\mathcal{U}(a, b)\)	Uniform on \([a, b]\)
\(\text{Exp}(\lambda)\)	Exponential with rate \(\lambda\)
\(\mathcal{N}(\mu, \sigma^2)\)	Normal with mean \(\mu\) and variance \(\sigma^2\)
\(\mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma})\)	\(p\)-variate normal with mean \(\boldsymbol{\mu}\) and covariance \(\boldsymbol{\Sigma}\)
\(\text{Gamma}(\alpha, \beta)\)	Gamma with shape \(\alpha\) and rate \(\beta\)
\(\text{Beta}(\alpha, \beta)\)	Beta with shape parameters \(\alpha\) and \(\beta\)
\(\chi^2_n\)	Chi-squared with \(n\) degrees of freedom
\(t_n\)	Student’s \(t\) with \(n\) degrees of freedom
\(F_{m,n}\)	\(F\)-distribution with \(m\) and \(n\) degrees of freedom
\(\text{Mult}(n, \mathbf{p})\)	Multinomial with \(n\) trials and probability vector \(\mathbf{p}\)
\(\text{Dir}(\boldsymbol{\alpha})\)	Dirichlet with concentration parameter \(\boldsymbol{\alpha}\)
\(\text{Wishart}_p(n, \mathbf{V})\)	Wishart with \(n\) degrees of freedom and scale matrix \(\mathbf{V}\)

Greek Letters and Their Typical Uses¶

Letter	Name	Typical Use in Statistics
\(\alpha\)	alpha	Significance level; shape parameter; Type I error rate
\(\beta\)	beta	Regression coefficient; rate parameter; Type II error rate
\(\gamma\)	gamma	Skewness; Euler–Mascheroni constant; threshold parameter
\(\delta\)	delta	Effect size; small perturbation; Kronecker delta
\(\epsilon, \varepsilon\)	epsilon	Error term; small positive quantity; convergence tolerance
\(\zeta\)	zeta	Latent variable; link function parameter
\(\eta\)	eta	Natural (canonical) parameter; learning rate
\(\theta\)	theta	Generic parameter (the most common choice)
\(\iota\)	iota	(Rarely used in statistics)
\(\kappa\)	kappa	Cumulant; condition number; concentration parameter
\(\lambda\)	lambda	Rate parameter; eigenvalue; Lagrange multiplier; penalty
\(\mu\)	mu	Mean; location parameter
\(\nu\)	nu	Degrees of freedom
\(\xi\)	xi	Latent variable; auxiliary parameter
\(\pi\)	pi	Prior probability; the constant 3.14159…
\(\rho\)	rho	Correlation coefficient; spectral radius
\(\sigma\)	sigma	Standard deviation (\(\sigma^2\) = variance)
\(\tau\)	tau	Precision (\(1/\sigma^2\)); Kendall’s rank correlation
\(\upsilon\)	upsilon	(Rarely used in statistics)
\(\phi, \varphi\)	phi	Standard normal density; dispersion parameter; basis function
\(\chi\)	chi	Chi-squared distribution
\(\psi\)	psi	Digamma function; parameter of interest
\(\omega\)	omega	Weight; angular frequency

Uppercase Greek letters with common statistical uses:

Letter	Name	Typical Use
\(\Gamma\)	Gamma	Gamma function; Gamma distribution
\(\Delta\)	Delta	Change or difference
\(\Theta\)	Theta	Parameter space
\(\Lambda\)	Lambda	Likelihood ratio; diagonal matrix of eigenvalues
\(\Sigma\)	Sigma	Covariance matrix; summation (\(\sum\))
\(\Phi\)	Phi	Standard normal CDF
\(\Psi\)	Psi	Polygamma function
\(\Omega\)	Omega	Sample space; precision matrix (\(\Sigma^{-1}\))

Glossary of Key Terms¶

Asymptotic normality¶: The property that the distribution of an estimator approaches a normal distribution as the sample size grows. Under regularity conditions, \(\sqrt{n}(\hat\theta - \theta_0) \stackrel{d}{\to} \mathcal{N}(0, \mathcal{I}(\theta_0)^{-1})\).
Bias¶: The difference \(E[\hat\theta] - \theta\) between the expected value of an estimator and the true parameter value.
Completeness¶: A statistic \(T\) is complete if the only function \(g\) with \(E_\theta[g(T)] = 0\) for all \(\theta\) is \(g \equiv 0\).
Confidence interval¶: An interval \([L(\mathbf{X}), U(\mathbf{X})]\) that contains the true parameter with a specified probability (the confidence level).
Conjugate prior¶: A prior distribution that, when combined with the likelihood via Bayes’ theorem, yields a posterior of the same parametric family.
Consistency¶: An estimator \(\hat\theta_n\) is consistent if \(\hat\theta_n \stackrel{p}{\to} \theta_0\) as \(n \to \infty\).
Cramér–Rao lower bound¶: The minimum variance achievable by any unbiased estimator: \(\operatorname{Var}(\hat\theta) \geq \mathcal{I}(\theta)^{-1}\).
Deviance¶: Twice the difference between the log-likelihood of the saturated model and the fitted model: \(D = 2[\ell_{\text{sat}} - \ell(\hat\theta)]\).
Efficiency¶: The ratio of the Cramér–Rao lower bound to the actual variance of an estimator. An efficient estimator achieves the bound.
EM algorithm¶: Expectation–Maximization algorithm, an iterative method for finding MLEs when the model involves latent variables or missing data.
Estimator¶: A function of the data used to estimate an unknown parameter. The distinction between “estimator” (the rule) and “estimate” (the numerical value) is maintained in this text.
Exponential family¶: A parametric family whose density can be written as \(f(x|\theta) = h(x)\exp[\eta(\theta)^\top T(x) - A(\theta)]\).
Fisher information¶: The variance of the score function, or equivalently the negative expected Hessian of the log-likelihood: \(\mathcal{I}(\theta) = E[U(\theta)^2] = -E[\ell''(\theta)]\).
Gradient descent¶: An iterative optimization algorithm: \(\theta^{(k+1)} = \theta^{(k)} + \eta\,\nabla\ell(\theta^{(k)})\).
Hessian matrix¶: The matrix of second partial derivatives of a function: \(H_{ij} = \partial^2 f / \partial \theta_i \partial \theta_j\).
Kullback–Leibler divergence¶: A measure of the difference between two distributions: \(\text{KL}(p \| q) = \int p(x)\log\frac{p(x)}{q(x)}\,dx\).
Likelihood function¶: The joint density or mass function of the data, viewed as a function of the parameters: \(L(\theta) = f(\mathbf{x} \mid \theta)\).
Likelihood ratio test¶: A hypothesis test based on the statistic \(\Lambda = 2[\ell(\hat\theta) - \ell(\theta_0)]\), which is asymptotically \(\chi^2\).
Log-likelihood¶: The natural logarithm of the likelihood function: \(\ell(\theta) = \log L(\theta)\).
Maximum likelihood estimator (MLE)¶: The parameter value that maximizes the likelihood (or equivalently the log-likelihood): \(\hat\theta = \arg\max_\theta \ell(\theta)\).
Method of moments¶: An estimation approach that equates sample moments to population moments and solves for the parameters.
Newton–Raphson method¶: An iterative root-finding algorithm applied to the score equation: \(\theta^{(k+1)} = \theta^{(k)} - [\ell''(\theta^{(k)})]^{-1}\,\ell'(\theta^{(k)})\).
Nuisance parameter¶: A parameter that is not of direct interest but must be accounted for in the inference procedure.
Observed information¶: The negative Hessian of the log-likelihood evaluated at the MLE: \(\mathcal{J}(\hat\theta) = -\ell''(\hat\theta)\).
p-value¶: The probability, under the null hypothesis, of observing a test statistic as extreme as or more extreme than the observed value.
Power¶: The probability of correctly rejecting a false null hypothesis: \(1 - \beta\), where \(\beta\) is the Type II error rate.
Profile likelihood¶: The likelihood maximized over nuisance parameters: \(L_p(\psi) = \max_\lambda L(\psi, \lambda)\).
Regularity conditions¶: Technical conditions (differentiability, integrability, parameter not on boundary) that ensure standard asymptotic results hold for MLEs.
Score function¶: The derivative of the log-likelihood with respect to the parameter: \(U(\theta) = \partial\ell / \partial\theta\). Under regularity conditions, \(E[U(\theta_0)] = 0\).
Sufficient statistic¶: A statistic \(T(\mathbf{X})\) that captures all the information in the data about the parameter. By the Fisher–Neyman factorization theorem, \(T\) is sufficient if \(f(\mathbf{x}|\theta) = g(T(\mathbf{x}), \theta)\,h(\mathbf{x})\).
Wald test¶: A hypothesis test based on the statistic \(W = (\hat\theta - \theta_0)^2 / \widehat{\operatorname{Var}}(\hat\theta)\), which is asymptotically \(\chi^2_1\).

Common Abbreviations¶

Abbreviation	Full Name
AIC	Akaike Information Criterion
BIC	Bayesian Information Criterion
CDF	Cumulative distribution function
CLT	Central Limit Theorem
CRLB	Cramér–Rao lower bound
EM	Expectation–Maximization
GLM	Generalized linear model
i.i.d.	Independent and identically distributed
IRLS	Iteratively reweighted least squares
KL	Kullback–Leibler
LLN	Law of large numbers
LRT	Likelihood ratio test
MGF	Moment generating function
MLE	Maximum likelihood estimator / estimate
MSE	Mean squared error
MVN	Multivariate normal
NR	Newton–Raphson
PDF	Probability density function
PMF	Probability mass function
r.v.	Random variable
SVD	Singular value decomposition
UMVUE	Uniformly minimum variance unbiased estimator
w.r.t.	With respect to