Appendix C: Notation and Glossary

This appendix provides a centralized reference for all mathematical notation and terminology used throughout this guide. When a symbol has multiple common meanings, the one adopted in this text is indicated.

Mathematical Notation

Sets and Spaces

Symbol

Meaning

\(\mathbb{R}\)

The set of real numbers

\(\mathbb{R}^n\)

The set of real \(n\)-dimensional column vectors

\(\mathbb{R}^{n \times m}\)

The set of real \(n \times m\) matrices

\(\mathbb{R}^+\)

The set of strictly positive real numbers \((0, \infty)\)

\(\mathbb{Z}\)

The set of integers

\(\mathbb{Z}^+\)

The set of positive integers \(\{1, 2, 3, \ldots\}\)

\(\mathbb{N}_0\)

The set of non-negative integers \(\{0, 1, 2, \ldots\}\)

\(\emptyset\)

The empty set

\(\mathcal{X}\)

The sample space (set of all possible outcomes)

\(\Theta\)

The parameter space

\(\mathcal{S}_n^{++}\)

The cone of \(n \times n\) symmetric positive definite matrices

\(\in\)

Element of

\(\subset, \subseteq\)

Proper subset, subset (or equal)

\(\cup, \cap\)

Union, intersection

\(A^c\)

Complement of set \(A\)

Probability Notation

Symbol

Meaning

\(P(A)\)

Probability of event \(A\)

\(P(A \mid B)\)

Conditional probability of \(A\) given \(B\)

\(P(A \cap B)\)

Joint probability of \(A\) and \(B\)

\(f(x)\) or \(f_X(x)\)

Probability density function (PDF) of a continuous r.v.

\(p(x)\) or \(p_X(x)\)

Probability mass function (PMF) of a discrete r.v.

\(F(x)\) or \(F_X(x)\)

Cumulative distribution function (CDF)

\(f(x \mid \theta)\)

Density of \(X\) given parameter \(\theta\)

\(f(\mathbf{x} \mid \boldsymbol{\theta})\)

Joint density of data vector \(\mathbf{x}\) given parameters

\(\sim\)

“is distributed as” (e.g., \(X \sim \mathcal{N}(\mu, \sigma^2)\))

\(\stackrel{d}{\to}\)

Convergence in distribution

\(\stackrel{p}{\to}\)

Convergence in probability

\(\stackrel{a.s.}{\to}\)

Almost sure convergence

\(\perp\!\!\!\perp\)

Statistical independence

\(\text{i.i.d.}\)

Independent and identically distributed

Random Variables and Expectations

Symbol

Meaning

\(X, Y, Z\)

Random variables (uppercase)

\(x, y, z\)

Observed values / realizations (lowercase)

\(\mathbf{X}\)

Random vector or random matrix

\(E[X]\) or \(\mu\)

Expected value (mean) of \(X\)

\(E[X \mid Y]\)

Conditional expectation of \(X\) given \(Y\)

\(\operatorname{Var}(X)\) or \(\sigma^2\)

Variance of \(X\)

\(\operatorname{Cov}(X, Y)\)

Covariance of \(X\) and \(Y\)

\(\operatorname{Corr}(X, Y)\) or \(\rho\)

Correlation of \(X\) and \(Y\)

\(\boldsymbol{\Sigma}\)

Covariance matrix

\(M_X(t)\)

Moment generating function of \(X\)

\(\phi_X(t)\)

Characteristic function of \(X\)

\(E_\theta[\cdot]\)

Expectation taken under the distribution indexed by \(\theta\)

Likelihood and Inference

Symbol

Meaning

\(L(\theta)\) or \(L(\theta ; \mathbf{x})\)

Likelihood function

\(\ell(\theta)\) or \(\ell(\theta ; \mathbf{x})\)

Log-likelihood function, \(\ell = \log L\)

\(\hat{\theta}\) or \(\hat{\theta}_{\text{MLE}}\)

Maximum likelihood estimator / estimate

\(U(\theta)\) or \(S(\theta)\)

Score function, \(U(\theta) = \partial \ell / \partial \theta\)

\(\mathcal{I}(\theta)\)

Fisher information (expected information)

\(\mathcal{J}(\theta)\) or \(J(\hat{\theta})\)

Observed information, \(-\partial^2 \ell / \partial \theta^2\)

\(\Lambda\)

Likelihood ratio statistic

\(R(\theta)\)

Profile likelihood or relative likelihood

\(\ell_p(\psi)\)

Profile log-likelihood for parameter of interest \(\psi\)

\(\text{se}(\hat{\theta})\)

Standard error of estimator \(\hat{\theta}\)

\(\text{AIC}\)

Akaike Information Criterion, \(-2\ell(\hat\theta) + 2p\)

\(\text{BIC}\)

Bayesian Information Criterion, \(-2\ell(\hat\theta) + p\log n\)

Optimization Notation

Symbol

Meaning

\(\nabla f\) or \(\nabla_{\mathbf{x}} f\)

Gradient of \(f\) with respect to \(\mathbf{x}\)

\(\mathbf{H}\) or \(\nabla^2 f\)

Hessian matrix (matrix of second partial derivatives)

\(\mathbf{J}\)

Jacobian matrix

\(\arg\max_\theta f(\theta)\)

Value of \(\theta\) that maximizes \(f\)

\(\arg\min_\theta f(\theta)\)

Value of \(\theta\) that minimizes \(f\)

\(\eta\)

Learning rate / step size

\(\theta^{(k)}\)

Parameter value at iteration \(k\) of an iterative algorithm

\(\epsilon\)

Convergence tolerance

\(O(\cdot)\)

Big-O notation (asymptotic upper bound)

\(o(\cdot)\)

Little-o notation (asymptotically negligible)

\(O_p(\cdot)\)

Stochastic big-O (bounded in probability)

\(o_p(\cdot)\)

Stochastic little-o (converges to zero in probability)

Matrix Notation

Symbol

Meaning

\(\mathbf{A}, \mathbf{B}, \mathbf{C}\)

Matrices (bold uppercase)

\(\mathbf{x}, \mathbf{y}, \mathbf{z}\)

Vectors (bold lowercase)

\(\mathbf{I}\) or \(\mathbf{I}_n\)

Identity matrix (\(n \times n\))

\(\mathbf{0}\)

Zero vector or zero matrix

\(\mathbf{1}\) or \(\mathbf{1}_n\)

Vector of ones

\(\mathbf{A}^\top\)

Transpose of \(\mathbf{A}\)

\(\mathbf{A}^{-1}\)

Inverse of \(\mathbf{A}\)

\(\mathbf{A}^{-\top}\)

\((\mathbf{A}^{-1})^\top = (\mathbf{A}^\top)^{-1}\)

\(\det(\mathbf{A})\) or \(|\mathbf{A}|\)

Determinant of \(\mathbf{A}\)

\(\operatorname{tr}(\mathbf{A})\)

Trace of \(\mathbf{A}\)

\(\operatorname{rank}(\mathbf{A})\)

Rank of \(\mathbf{A}\)

\(\operatorname{diag}(\mathbf{A})\)

Vector of diagonal entries of \(\mathbf{A}\)

\(\operatorname{diag}(\mathbf{x})\)

Diagonal matrix with entries of \(\mathbf{x}\) on the diagonal

\(\lambda_i(\mathbf{A})\)

\(i\)-th eigenvalue of \(\mathbf{A}\)

\(\sigma_i(\mathbf{A})\)

\(i\)-th singular value of \(\mathbf{A}\)

\(\|\mathbf{x}\|\) or \(\|\mathbf{x}\|_2\)

Euclidean (L2) norm, \(\sqrt{\mathbf{x}^\top\mathbf{x}}\)

\(\|\mathbf{A}\|_F\)

Frobenius norm, \(\sqrt{\operatorname{tr}(\mathbf{A}^\top\mathbf{A})}\)

\(\mathbf{A} \succ 0\)

\(\mathbf{A}\) is positive definite

\(\mathbf{A} \succeq 0\)

\(\mathbf{A}\) is positive semi-definite

\(\mathbf{A} \otimes \mathbf{B}\)

Kronecker product of \(\mathbf{A}\) and \(\mathbf{B}\)

Named Distributions

The following shorthand is used for standard distributions:

Notation

Distribution

\(\text{Bernoulli}(p)\)

Bernoulli with success probability \(p\)

\(\text{Bin}(n, p)\)

Binomial with \(n\) trials and success probability \(p\)

\(\text{Poisson}(\lambda)\)

Poisson with rate \(\lambda\)

\(\text{Geom}(p)\)

Geometric with success probability \(p\)

\(\text{NegBin}(r, p)\)

Negative binomial with \(r\) successes, probability \(p\)

\(\mathcal{U}(a, b)\)

Uniform on \([a, b]\)

\(\text{Exp}(\lambda)\)

Exponential with rate \(\lambda\)

\(\mathcal{N}(\mu, \sigma^2)\)

Normal with mean \(\mu\) and variance \(\sigma^2\)

\(\mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma})\)

\(p\)-variate normal with mean \(\boldsymbol{\mu}\) and covariance \(\boldsymbol{\Sigma}\)

\(\text{Gamma}(\alpha, \beta)\)

Gamma with shape \(\alpha\) and rate \(\beta\)

\(\text{Beta}(\alpha, \beta)\)

Beta with shape parameters \(\alpha\) and \(\beta\)

\(\chi^2_n\)

Chi-squared with \(n\) degrees of freedom

\(t_n\)

Student’s \(t\) with \(n\) degrees of freedom

\(F_{m,n}\)

\(F\)-distribution with \(m\) and \(n\) degrees of freedom

\(\text{Mult}(n, \mathbf{p})\)

Multinomial with \(n\) trials and probability vector \(\mathbf{p}\)

\(\text{Dir}(\boldsymbol{\alpha})\)

Dirichlet with concentration parameter \(\boldsymbol{\alpha}\)

\(\text{Wishart}_p(n, \mathbf{V})\)

Wishart with \(n\) degrees of freedom and scale matrix \(\mathbf{V}\)

Greek Letters and Their Typical Uses

Letter

Name

Typical Use in Statistics

\(\alpha\)

alpha

Significance level; shape parameter; Type I error rate

\(\beta\)

beta

Regression coefficient; rate parameter; Type II error rate

\(\gamma\)

gamma

Skewness; Euler–Mascheroni constant; threshold parameter

\(\delta\)

delta

Effect size; small perturbation; Kronecker delta

\(\epsilon, \varepsilon\)

epsilon

Error term; small positive quantity; convergence tolerance

\(\zeta\)

zeta

Latent variable; link function parameter

\(\eta\)

eta

Natural (canonical) parameter; learning rate

\(\theta\)

theta

Generic parameter (the most common choice)

\(\iota\)

iota

(Rarely used in statistics)

\(\kappa\)

kappa

Cumulant; condition number; concentration parameter

\(\lambda\)

lambda

Rate parameter; eigenvalue; Lagrange multiplier; penalty

\(\mu\)

mu

Mean; location parameter

\(\nu\)

nu

Degrees of freedom

\(\xi\)

xi

Latent variable; auxiliary parameter

\(\pi\)

pi

Prior probability; the constant 3.14159…

\(\rho\)

rho

Correlation coefficient; spectral radius

\(\sigma\)

sigma

Standard deviation (\(\sigma^2\) = variance)

\(\tau\)

tau

Precision (\(1/\sigma^2\)); Kendall’s rank correlation

\(\upsilon\)

upsilon

(Rarely used in statistics)

\(\phi, \varphi\)

phi

Standard normal density; dispersion parameter; basis function

\(\chi\)

chi

Chi-squared distribution

\(\psi\)

psi

Digamma function; parameter of interest

\(\omega\)

omega

Weight; angular frequency

Uppercase Greek letters with common statistical uses:

Letter

Name

Typical Use

\(\Gamma\)

Gamma

Gamma function; Gamma distribution

\(\Delta\)

Delta

Change or difference

\(\Theta\)

Theta

Parameter space

\(\Lambda\)

Lambda

Likelihood ratio; diagonal matrix of eigenvalues

\(\Sigma\)

Sigma

Covariance matrix; summation (\(\sum\))

\(\Phi\)

Phi

Standard normal CDF

\(\Psi\)

Psi

Polygamma function

\(\Omega\)

Omega

Sample space; precision matrix (\(\Sigma^{-1}\))

Glossary of Key Terms

Asymptotic normality

The property that the distribution of an estimator approaches a normal distribution as the sample size grows. Under regularity conditions, \(\sqrt{n}(\hat\theta - \theta_0) \stackrel{d}{\to} \mathcal{N}(0, \mathcal{I}(\theta_0)^{-1})\).

Bias

The difference \(E[\hat\theta] - \theta\) between the expected value of an estimator and the true parameter value.

Completeness

A statistic \(T\) is complete if the only function \(g\) with \(E_\theta[g(T)] = 0\) for all \(\theta\) is \(g \equiv 0\).

Confidence interval

An interval \([L(\mathbf{X}), U(\mathbf{X})]\) that contains the true parameter with a specified probability (the confidence level).

Conjugate prior

A prior distribution that, when combined with the likelihood via Bayes’ theorem, yields a posterior of the same parametric family.

Consistency

An estimator \(\hat\theta_n\) is consistent if \(\hat\theta_n \stackrel{p}{\to} \theta_0\) as \(n \to \infty\).

Cramér–Rao lower bound

The minimum variance achievable by any unbiased estimator: \(\operatorname{Var}(\hat\theta) \geq \mathcal{I}(\theta)^{-1}\).

Deviance

Twice the difference between the log-likelihood of the saturated model and the fitted model: \(D = 2[\ell_{\text{sat}} - \ell(\hat\theta)]\).

Efficiency

The ratio of the Cramér–Rao lower bound to the actual variance of an estimator. An efficient estimator achieves the bound.

EM algorithm

Expectation–Maximization algorithm, an iterative method for finding MLEs when the model involves latent variables or missing data.

Estimator

A function of the data used to estimate an unknown parameter. The distinction between “estimator” (the rule) and “estimate” (the numerical value) is maintained in this text.

Exponential family

A parametric family whose density can be written as \(f(x|\theta) = h(x)\exp[\eta(\theta)^\top T(x) - A(\theta)]\).

Fisher information

The variance of the score function, or equivalently the negative expected Hessian of the log-likelihood: \(\mathcal{I}(\theta) = E[U(\theta)^2] = -E[\ell''(\theta)]\).

Gradient descent

An iterative optimization algorithm: \(\theta^{(k+1)} = \theta^{(k)} + \eta\,\nabla\ell(\theta^{(k)})\).

Hessian matrix

The matrix of second partial derivatives of a function: \(H_{ij} = \partial^2 f / \partial \theta_i \partial \theta_j\).

Kullback–Leibler divergence

A measure of the difference between two distributions: \(\text{KL}(p \| q) = \int p(x)\log\frac{p(x)}{q(x)}\,dx\).

Likelihood function

The joint density or mass function of the data, viewed as a function of the parameters: \(L(\theta) = f(\mathbf{x} \mid \theta)\).

Likelihood ratio test

A hypothesis test based on the statistic \(\Lambda = 2[\ell(\hat\theta) - \ell(\theta_0)]\), which is asymptotically \(\chi^2\).

Log-likelihood

The natural logarithm of the likelihood function: \(\ell(\theta) = \log L(\theta)\).

Maximum likelihood estimator (MLE)

The parameter value that maximizes the likelihood (or equivalently the log-likelihood): \(\hat\theta = \arg\max_\theta \ell(\theta)\).

Method of moments

An estimation approach that equates sample moments to population moments and solves for the parameters.

Newton–Raphson method

An iterative root-finding algorithm applied to the score equation: \(\theta^{(k+1)} = \theta^{(k)} - [\ell''(\theta^{(k)})]^{-1}\,\ell'(\theta^{(k)})\).

Nuisance parameter

A parameter that is not of direct interest but must be accounted for in the inference procedure.

Observed information

The negative Hessian of the log-likelihood evaluated at the MLE: \(\mathcal{J}(\hat\theta) = -\ell''(\hat\theta)\).

p-value

The probability, under the null hypothesis, of observing a test statistic as extreme as or more extreme than the observed value.

Power

The probability of correctly rejecting a false null hypothesis: \(1 - \beta\), where \(\beta\) is the Type II error rate.

Profile likelihood

The likelihood maximized over nuisance parameters: \(L_p(\psi) = \max_\lambda L(\psi, \lambda)\).

Regularity conditions

Technical conditions (differentiability, integrability, parameter not on boundary) that ensure standard asymptotic results hold for MLEs.

Score function

The derivative of the log-likelihood with respect to the parameter: \(U(\theta) = \partial\ell / \partial\theta\). Under regularity conditions, \(E[U(\theta_0)] = 0\).

Sufficient statistic

A statistic \(T(\mathbf{X})\) that captures all the information in the data about the parameter. By the Fisher–Neyman factorization theorem, \(T\) is sufficient if \(f(\mathbf{x}|\theta) = g(T(\mathbf{x}), \theta)\,h(\mathbf{x})\).

Wald test

A hypothesis test based on the statistic \(W = (\hat\theta - \theta_0)^2 / \widehat{\operatorname{Var}}(\hat\theta)\), which is asymptotically \(\chi^2_1\).

Common Abbreviations

Abbreviation

Full Name

AIC

Akaike Information Criterion

BIC

Bayesian Information Criterion

CDF

Cumulative distribution function

CLT

Central Limit Theorem

CRLB

Cramér–Rao lower bound

EM

Expectation–Maximization

GLM

Generalized linear model

i.i.d.

Independent and identically distributed

IRLS

Iteratively reweighted least squares

KL

Kullback–Leibler

LLN

Law of large numbers

LRT

Likelihood ratio test

MGF

Moment generating function

MLE

Maximum likelihood estimator / estimate

MSE

Mean squared error

MVN

Multivariate normal

NR

Newton–Raphson

PDF

Probability density function

PMF

Probability mass function

r.v.

Random variable

SVD

Singular value decomposition

UMVUE

Uniformly minimum variance unbiased estimator

w.r.t.

With respect to