.. _ch2_random_variables:

====================================
Chapter 2: Random Variables
====================================

In :ref:`ch1_probability` we learned how to assign probabilities to *events*
--- subsets of a sample space.  But in practice we almost always care about
*numerical* quantities: the number of heads in 10 flips, the average height of
a sample, the time until a server fails.  **Random variables** bridge the gap
between the abstract world of sample spaces and the numerical world of data.

Once you can attach numbers to outcomes, you can compute averages, measure
spread, and --- most importantly for this book --- write down likelihood
functions.

Two running examples will thread through this chapter.  The first is
**insurance**: an insurer needs to price policies by computing expected claim
costs and quantifying risk.  The second is **stock returns**: a portfolio
analyst needs to understand how two assets co-move and how diversification
reduces risk.  Every formula will be applied to one or both scenarios so you
can see the mathematics earning its keep.

.. code-block:: python

   # ------------------------------------------------------------------
   # Setup: parameters for our two running examples
   # ------------------------------------------------------------------
   import numpy as np
   from scipy import stats

   np.random.seed(42)

   # --- Insurance scenario ---
   # A simple auto-insurance model:
   #   With prob 0.85 — no claim ($0)
   #   With prob 0.10 — minor claim (mean $800)
   #   With prob 0.04 — moderate claim (mean $5,000)
   #   With prob 0.01 — major claim (mean $25,000)
   claim_probs = np.array([0.85, 0.10, 0.04, 0.01])
   claim_costs = np.array([0, 800, 5_000, 25_000])
   claim_labels = ["None", "Minor", "Moderate", "Major"]

   # --- Stock returns scenario ---
   # Two stocks: Tech (higher return, higher volatility) and Utility (lower, steadier)
   mu_tech, sigma_tech   = 0.12, 0.25    # annualized mean return 12%, std 25%
   mu_util, sigma_util   = 0.06, 0.12    # annualized mean return  6%, std 12%
   rho_stocks            = 0.30           # correlation between the two

   print("=== Insurance scenario ===")
   for label, prob, cost in zip(claim_labels, claim_probs, claim_costs):
       print(f"  {label:>10s}: P = {prob:.2f}, Cost = ${cost:>7,}")
   print()
   print("=== Stock returns scenario ===")
   print(f"  Tech:    mu = {mu_tech:.0%}, sigma = {sigma_tech:.0%}")
   print(f"  Utility: mu = {mu_util:.0%}, sigma = {sigma_util:.0%}")
   print(f"  Correlation: rho = {rho_stocks}")

.. contents:: Chapter Contents
   :local:
   :depth: 2
   :class: this-will-duplicate-information-and-it-is-still-useful-here


2.1 What Is a Random Variable?
================================

Your insurance company has 50,000 policyholders.  Each one might or might not
file a claim this year, and if they do, the claim amount could be anything from
a few hundred dollars to tens of thousands.  To set premiums, you need to
attach *numbers* to the random experiment "what happens to policyholder #i this
year."  That mapping from outcome to number is precisely what a random variable
provides.

**Plain-English definition.**  A random variable is a rule that assigns a
number to every outcome of a random experiment.

*Example.*  Roll two dice and let :math:`X` be the sum of the faces.  The
outcome :math:`(3, 5)` maps to :math:`X = 8`.  The outcome :math:`(1, 1)` maps
to :math:`X = 2`.

**Formal definition.**  Given a probability space :math:`(\Omega, \mathcal{F}, P)`,
a **random variable** is a function

.. math::

   X : \Omega \longrightarrow \mathbb{R}

such that for every real number :math:`x`, the set
:math:`\{\omega \in \Omega : X(\omega) \leq x\}` belongs to :math:`\mathcal{F}`.
This measurability condition ensures that we can compute probabilities like
:math:`P(X \leq x)`.

Random variables come in two main flavors:

- **Discrete**: :math:`X` takes on a finite or countably infinite set of
  values (e.g., 0, 1, 2, ...).
- **Continuous**: :math:`X` takes on an uncountably infinite set of values
  (e.g., any real number in an interval).

.. code-block:: python

   # A random variable in action: insurance claim amount
   # The "random experiment" is what happens to one policyholder in a year.
   # The random variable X maps the outcome to a dollar amount.

   # Simulate one year for 10 policyholders
   n_demo = 10
   claim_types = np.random.choice(len(claim_probs), size=n_demo, p=claim_probs)
   X = claim_costs[claim_types]  # the random variable: outcome -> dollar amount

   print("Policyholder  Outcome      X (claim $)")
   print("-" * 42)
   for i in range(n_demo):
       print(f"     {i+1:2d}       {claim_labels[claim_types[i]]:>10s}     ${X[i]:>7,}")

.. admonition:: Intuition

   Think of a random variable as a "measurement device" pointed at a random
   experiment.  The experiment produces an outcome; the random variable reads
   off a number.  Different random variables can measure different things about
   the same experiment --- for instance, the sum of two dice versus whether the
   sum exceeds 7.


2.2 Probability Mass Function (PMF)
=====================================

Suppose you are the insurer's actuary.  You know the probability of each claim
type (none, minor, moderate, major).  The **PMF** is simply the table that
lists every possible value of the random variable alongside its probability ---
exactly the information you need to price a policy.

For a **discrete** random variable :math:`X`, the **probability mass function**
is

.. math::

   p_X(x) = P(X = x).

The PMF must satisfy two properties:

1. :math:`p_X(x) \geq 0` for all :math:`x`.
2. :math:`\displaystyle\sum_{\text{all } x} p_X(x) = 1`.

.. code-block:: python

   # PMF of the insurance claim amount X
   print("x (claim $)    p_X(x)")
   print("-" * 26)
   for cost, prob in zip(claim_costs, claim_probs):
       print(f"  ${cost:>7,}     {prob:.2f}")
   print(f"\n  Sum of p_X(x) = {claim_probs.sum():.2f}  (must be 1.0)")

*Example.*  Let :math:`X` be the result of a single fair die roll.

.. math::

   p_X(x) = \frac{1}{6}, \quad x \in \{1, 2, 3, 4, 5, 6\}.

Let us also build the PMF for the sum of two dice --- a slightly richer
example.

.. code-block:: python

   # PMF of the sum of two fair dice
   counts = np.zeros(13)  # indices 0..12; we use 2..12
   for i in range(1, 7):
       for j in range(1, 7):
           counts[i + j] += 1
   pmf_dice = counts / 36

   print("x   p_X(x) = P(Sum = x)")
   for x in range(2, 13):
       bar = "#" * int(pmf_dice[x] * 120)  # simple text histogram
       print(f"{x:2d}  {pmf_dice[x]:.4f}  {bar}")
   print(f"\nSum of PMF: {pmf_dice.sum():.4f}  (should be 1.0)")


2.3 Probability Density Function (PDF)
========================================

Insurance claims are not always nice round numbers.  A car repair might cost
$1,247.63 or $3,891.02 --- any non-negative real number is possible.  For
such continuous random variables, asking "what is the probability the claim is
exactly $1,247.63?" gives the unhelpful answer zero.  Instead, we describe the
distribution through a *density*: a curve whose area over any interval gives
the probability of landing in that interval.

For a **continuous** random variable :math:`X`, individual points have zero
probability: :math:`P(X = x) = 0`.  Instead we describe the distribution
through a **probability density function** :math:`f_X(x)`, defined by the
property

.. math::

   P(a \leq X \leq b) = \int_a^b f_X(x)\,dx.

The PDF must satisfy:

1. :math:`f_X(x) \geq 0` for all :math:`x`.
2. :math:`\displaystyle\int_{-\infty}^{\infty} f_X(x)\,dx = 1`.

Note: :math:`f_X(x)` is **not** a probability.  It can exceed 1 at some
points.  It is a *density*: probability per unit length.

.. code-block:: python

   # PDF example: Exponential distribution for claim sizes
   # If a minor claim follows Exp(rate = 1/800), the mean claim is $800
   rate = 1 / 800  # lambda
   x_vals = np.linspace(0, 5000, 500)
   f_X = rate * np.exp(-rate * x_vals)  # f_X(x) = lambda * exp(-lambda * x)

   # Verify: integral over [0, inf) should be 1
   dx = x_vals[1] - x_vals[0]
   approx_integral = np.sum(f_X * dx)
   print(f"Exp(rate={rate:.5f}) — approximate integral over [0, 5000]: {approx_integral:.4f}")

   # P(500 <= X <= 1500)
   mask = (x_vals >= 500) & (x_vals <= 1500)
   P_500_1500_numerical = np.sum(f_X[mask] * dx)
   P_500_1500_exact = np.exp(-rate * 500) - np.exp(-rate * 1500)
   print(f"P(500 <= X <= 1500): numerical = {P_500_1500_numerical:.4f}, exact = {P_500_1500_exact:.4f}")

.. admonition:: Common Pitfall

   A common source of confusion: a PDF value of 3.0 at some point does *not*
   mean a 300 % probability.  The density is only meaningful when integrated
   over an interval.  For example, the Uniform(0, 0.1) distribution has a
   constant density of 10, yet it is a perfectly valid probability distribution
   because :math:`10 \times 0.1 = 1`.

.. code-block:: python

   # Demonstrating the pitfall: Uniform(0, 0.1) has density = 10 everywhere
   f_uniform = 1 / 0.1  # = 10
   print(f"Uniform(0, 0.1): f(x) = {f_uniform:.1f} for x in [0, 0.1]")
   print(f"  f(0.05) = {f_uniform}  (NOT a probability!)")
   print(f"  P(0 <= X <= 0.1) = {f_uniform} × 0.1 = {f_uniform * 0.1:.1f}  (this is a probability)")


2.4 Cumulative Distribution Function (CDF)
============================================

The insurer often asks threshold questions: "What is the probability that a
claim is at most $2,000?"  or "What is the probability that annual losses
exceed $1 million?"  The CDF directly answers the first question; the second
follows via the complement rule.

Whether discrete or continuous, every random variable has a **cumulative
distribution function** (CDF):

.. math::

   F_X(x) = P(X \leq x).

The CDF is a unified tool that works for both types of random variables.  It
answers the question: "What is the probability that :math:`X` lands at or
below this threshold?"

**Properties of the CDF** (provable from the axioms):

1. :math:`F_X` is non-decreasing: if :math:`a < b` then :math:`F_X(a) \leq F_X(b)`.
2. :math:`\displaystyle\lim_{x \to -\infty} F_X(x) = 0` and
   :math:`\displaystyle\lim_{x \to +\infty} F_X(x) = 1`.
3. :math:`F_X` is right-continuous: :math:`\displaystyle\lim_{h \downarrow 0} F_X(x+h) = F_X(x)`.

**Relationship to PDF/PMF.**

- Continuous case: :math:`F_X(x) = \int_{-\infty}^{x} f_X(t)\,dt`, hence
  :math:`f_X(x) = F_X'(x)` wherever the derivative exists.
- Discrete case: :math:`F_X(x) = \sum_{t \leq x} p_X(t)`.

**Useful identity.**

.. math::

   P(a < X \leq b) = F_X(b) - F_X(a).

.. code-block:: python

   # CDF of the Exponential(rate = 1/800) claim distribution
   # F_X(x) = 1 - exp(-lambda * x)  for x >= 0
   thresholds = [500, 1000, 2000, 5000, 10000]
   print("Exponential(mean=$800) CDF:")
   print(f"{'Threshold':>12s}  {'F_X(x)':>8s}  {'P(X > x)':>8s}")
   for t in thresholds:
       F_x = 1 - np.exp(-rate * t)
       print(f"  ${t:>8,}  {F_x:8.4f}  {1 - F_x:8.4f}")

   # Useful identity: P(500 < X <= 2000)
   F_500  = 1 - np.exp(-rate * 500)
   F_2000 = 1 - np.exp(-rate * 2000)
   P_interval = F_2000 - F_500
   print(f"\nP(500 < X <= 2000) = F(2000) - F(500) = {F_2000:.4f} - {F_500:.4f} = {P_interval:.4f}")


2.5 Expectation (Expected Value)
==================================

The actuary needs a single number that summarizes "on average, how much will
this policy cost us?"  That number is the **expected value** --- the
probability-weighted average of every possible claim amount.  If the insurer
sells 50,000 identical policies, the total payout will be close to 50,000
times this expected value.

The **expected value** (or **mean**) of a random variable is the
probability-weighted average of its possible values.  It tells you the "center
of mass" of the distribution.

Here is the key idea: if you could repeat the random experiment millions of
times and average the results, you would get something very close to :math:`E[X]`.
The expected value is the theoretical version of that long-run average.

**Discrete case:**

.. math::

   E[X] = \sum_{x} x \, p_X(x).

**Continuous case:**

.. math::

   E[X] = \int_{-\infty}^{\infty} x \, f_X(x) \, dx.

**Insurance example — expected claim cost.**

.. math::

   E[X] = \sum_{x} x \cdot p_X(x)
        = 0(0.85) + 800(0.10) + 5000(0.04) + 25000(0.01)

.. code-block:: python

   # E[X] for the insurance claim random variable
   # E[X] = sum of x * p_X(x)
   E_X = np.sum(claim_costs * claim_probs)

   print("E[X] = sum of x * p_X(x):")
   for cost, prob in zip(claim_costs, claim_probs):
       print(f"  {cost:>7,} × {prob:.2f} = {cost * prob:>8.2f}")
   print(f"  {'':>7s}   {'':>4s}   --------")
   print(f"  {'E[X]':>7s}   {'':>4s}   ${E_X:>7.2f}")
   print(f"\nThe insurer should charge at least ${E_X:.2f} per policy to break even.")

.. code-block:: python

   # Verify E[X] by simulation: 1 million policyholders
   n_sim = 1_000_000
   claim_types_sim = np.random.choice(len(claim_probs), size=n_sim, p=claim_probs)
   X_sim = claim_costs[claim_types_sim]

   print(f"E[X] (formula):     ${E_X:.2f}")
   print(f"E[X] (simulation):  ${X_sim.mean():.2f}  (n = {n_sim:,})")

*Example.*  Fair die: :math:`E[X] = \sum_{x=1}^{6} x \cdot \frac{1}{6}
= \frac{1+2+3+4+5+6}{6} = 3.5`.

.. code-block:: python

   # E[X] for a fair die: theory vs. simulation
   E_X_die = sum(x * (1/6) for x in range(1, 7))
   rolls = np.random.randint(1, 7, size=100_000)
   print(f"E[X] (theory)     = {E_X_die:.4f}")
   print(f"E[X] (simulation) = {rolls.mean():.4f}")

**Law of the Unconscious Statistician (LOTUS).**  If :math:`g` is a function,
then

.. math::

   E[g(X)] = \sum_x g(x)\,p_X(x) \quad \text{(discrete)}, \qquad
   E[g(X)] = \int_{-\infty}^{\infty} g(x)\,f_X(x)\,dx \quad \text{(continuous)}.

Why does this matter?  LOTUS lets you compute the expected value of any
transformation of :math:`X` without first deriving the distribution of
:math:`g(X)`.  This saves enormous effort.

.. code-block:: python

   # LOTUS example: the insurer charges a $500 deductible.
   # Actual payout g(X) = max(X - 500, 0).  What is E[g(X)]?
   def g(x):
       """Payout after $500 deductible."""
       return np.maximum(x - 500, 0)

   E_gX = np.sum(g(claim_costs) * claim_probs)

   print("LOTUS: E[g(X)] where g(x) = max(x - 500, 0)")
   for cost, prob in zip(claim_costs, claim_probs):
       print(f"  g({cost:>7,}) = {g(np.array([cost]))[0]:>7,},  × {prob:.2f} = {g(np.array([cost]))[0] * prob:>8.2f}")
   print(f"\n  E[g(X)] = ${E_gX:.2f}  (expected payout after deductible)")
   print(f"  Compare to E[X] = ${E_X:.2f}  (expected claim without deductible)")
   print(f"  The deductible saves the insurer ${E_X - E_gX:.2f} per policy on average.")


2.5.1 Linearity of Expectation
--------------------------------

You insure not just one policyholder but 50,000.  The total expected payout is
simply 50,000 times the expected payout per policy --- no matter how the claims
are correlated across policyholders.  This is linearity of expectation at work.

**Theorem.**  For any random variables :math:`X, Y` and constants :math:`a, b, c`,

.. math::

   E[aX + bY + c] = a\,E[X] + b\,E[Y] + c.

This holds **regardless** of whether :math:`X` and :math:`Y` are independent.

.. code-block:: python

   # Linearity of expectation: total payout for 50,000 policies
   n_policies = 50_000
   E_total = n_policies * E_X
   print(f"E[X per policy]   = ${E_X:.2f}")
   print(f"E[total for {n_policies:,}] = {n_policies} × ${E_X:.2f} = ${E_total:,.2f}")

   # Verify with simulation
   total_sim = sum(
       claim_costs[np.random.choice(len(claim_probs), size=n_policies, p=claim_probs)]
   )
   print(f"Simulated total   = ${total_sim:,.2f}")

.. admonition:: Intuition

   Linearity of expectation is deceptively powerful.  It says that averages
   distribute over sums --- even when the variables are dependent.  This one
   property lets you compute the expected number of fixed points in a random
   permutation, the expected value of a Binomial, and countless other results
   with almost no effort.

**Proof (discrete case for two variables).**  Let :math:`(X, Y)` have joint
PMF :math:`p_{X,Y}(x,y)`.

.. math::

   E[aX + bY]
   &= \sum_x \sum_y (ax + by)\,p_{X,Y}(x,y) \\
   &= a \sum_x \sum_y x\,p_{X,Y}(x,y) + b \sum_x \sum_y y\,p_{X,Y}(x,y) \\
   &= a \sum_x x \underbrace{\sum_y p_{X,Y}(x,y)}_{p_X(x)} \;+\;
      b \sum_y y \underbrace{\sum_x p_{X,Y}(x,y)}_{p_Y(y)} \\
   &= a\,E[X] + b\,E[Y]. \quad \square

Adding a constant :math:`c` contributes :math:`c \sum_x \sum_y p_{X,Y}(x,y) = c`.


2.6 Variance
==============

Two insurance policies might have the same expected cost but very different
risk profiles.  Policy A might pay $530 almost every year; Policy B might
pay $0 most years but occasionally pay $53,000.  The actuary needs to
distinguish these --- and that is exactly what **variance** does.

Expectation tells us the center; **variance** measures the *spread* of a
distribution --- how far values typically fall from the mean.

**Definition.**

.. math::

   \operatorname{Var}(X) = E\!\left[(X - E[X])^2\right].

We also write :math:`\sigma^2 = \operatorname{Var}(X)`.  The **standard
deviation** is :math:`\sigma = \sqrt{\operatorname{Var}(X)}`.


2.6.1 Computational Formula
-----------------------------

The definition involves :math:`E[X]` inside a squared term, which is unwieldy.
A cleaner formula is:

**Theorem.**

.. math::

   \operatorname{Var}(X) = E[X^2] - (E[X])^2.

This is the formula you will reach for most often in practice.  It is
sometimes called the "shortcut formula" because it avoids computing deviations
from the mean one by one.

**Proof.**  Let :math:`\mu = E[X]`.

.. math::

   \operatorname{Var}(X)
   &= E[(X - \mu)^2] \\
   &= E[X^2 - 2\mu X + \mu^2] \\
   &= E[X^2] - 2\mu\,E[X] + \mu^2 \qquad \text{(linearity)} \\
   &= E[X^2] - 2\mu^2 + \mu^2 \\
   &= E[X^2] - \mu^2
    = E[X^2] - (E[X])^2. \quad \square

.. code-block:: python

   # Var(X) for the insurance claim variable, computed both ways
   mu = E_X  # E[X] = 530 from above

   # Method 1: Definition — E[(X - mu)^2]
   Var_X_def = np.sum((claim_costs - mu)**2 * claim_probs)

   # Method 2: Shortcut — E[X^2] - (E[X])^2
   E_X2 = np.sum(claim_costs**2 * claim_probs)
   Var_X_shortcut = E_X2 - mu**2

   sigma = np.sqrt(Var_X_shortcut)

   print("Variance of insurance claim X:")
   print(f"  E[X]              = ${mu:.2f}")
   print(f"  E[X^2]            = {E_X2:,.2f}")
   print(f"  Var(X) (def)      = {Var_X_def:,.2f}")
   print(f"  Var(X) (shortcut) = E[X^2] - (E[X])^2 = {E_X2:,.2f} - {mu:.2f}^2 = {Var_X_shortcut:,.2f}")
   print(f"  Std(X)            = ${sigma:,.2f}")
   print(f"\n  The standard deviation (${sigma:,.0f}) is much larger than the mean (${mu:.0f}),")
   print(f"  reflecting the heavy tail of rare large claims.")

.. code-block:: python

   # Verify Var(X) by simulation
   print(f"Var(X) (formula):    {Var_X_shortcut:>12,.2f}")
   print(f"Var(X) (simulation): {X_sim.var():>12,.2f}")
   print(f"Std(X) (formula):    ${sigma:>10,.2f}")
   print(f"Std(X) (simulation): ${X_sim.std():>10,.2f}")

.. code-block:: python

   # Var(X) for a fair die — a simpler check
   x_die = np.arange(1, 7)
   pmf_die = np.ones(6) / 6

   E_X_die  = np.sum(x_die * pmf_die)
   E_X2_die = np.sum(x_die**2 * pmf_die)
   Var_X_die = E_X2_die - E_X_die**2

   print(f"Fair die:")
   print(f"  E[X]   = {E_X_die:.4f}")
   print(f"  E[X^2] = {E_X2_die:.4f}")
   print(f"  Var(X) = E[X^2] - (E[X])^2 = {E_X2_die:.4f} - {E_X_die:.4f}^2 = {Var_X_die:.4f}")
   print(f"  Std(X) = {np.sqrt(Var_X_die):.4f}")

   # Simulation check
   die_rolls = np.random.randint(1, 7, size=100_000)
   print(f"\n  Simulated: mean = {die_rolls.mean():.4f}, var = {die_rolls.var():.4f}")


2.6.2 Properties of Variance
------------------------------

1. :math:`\operatorname{Var}(X) \geq 0`, with equality if and only if :math:`X`
   is constant almost surely.

2. **Scaling:** :math:`\operatorname{Var}(aX + b) = a^2 \operatorname{Var}(X)`.

   *Proof.*

   .. math::

      \operatorname{Var}(aX + b)
      &= E[(aX + b - E[aX + b])^2] \\
      &= E[(aX + b - a\mu - b)^2] \\
      &= E[a^2(X - \mu)^2] = a^2 \operatorname{Var}(X). \quad \square

   Notice that adding a constant :math:`b` does not change the variance ---
   it shifts the distribution without stretching it.

3. **Sum of independent variables:**  If :math:`X` and :math:`Y` are
   independent,

   .. math::

      \operatorname{Var}(X + Y) = \operatorname{Var}(X) + \operatorname{Var}(Y).

   (We prove this after introducing covariance.)

.. code-block:: python

   # Variance scaling: currency conversion
   # If X is a claim in USD with Var(X), what is the variance in Euros?
   # Let Y = a*X + b where a = exchange rate, b = fixed processing fee
   a = 0.92   # USD to EUR exchange rate
   b = 25     # fixed €25 processing fee (doesn't affect variance)

   Var_Y = a**2 * Var_X_shortcut
   sigma_Y = np.sqrt(Var_Y)

   print(f"Var(X) in USD           = {Var_X_shortcut:>14,.2f}")
   print(f"Var(aX+b) = a^2·Var(X)  = {a}^2 × {Var_X_shortcut:,.2f} = {Var_Y:>14,.2f}")
   print(f"Std in EUR              = €{sigma_Y:>10,.2f}")
   print(f"(The fixed fee b = €{b} does not affect variance.)")


2.7 Moment Generating Functions
=================================

The actuary has computed :math:`E[X]` and :math:`\operatorname{Var}(X)` for the
claim distribution.  But the risk department also wants the skewness (are large
claims much more common than small ones?) and the kurtosis (how heavy is the
tail?).  Computing each moment separately is tedious.  The **moment generating
function** packages *all* moments into a single function, and any individual
moment can be extracted by differentiation.

The **moment generating function** (MGF) of :math:`X` is

.. math::

   M_X(t) = E[e^{tX}],

defined for all :math:`t` in some neighborhood of zero.

**Why is it useful?**

1. **Moments from derivatives.**  The :math:`n`-th moment of :math:`X` can be
   extracted by differentiating the MGF and evaluating at :math:`t = 0`:

   .. math::

      E[X^n] = M_X^{(n)}(0) = \left.\frac{d^n}{dt^n} M_X(t)\right|_{t=0}.

   *Derivation.*  Expand the exponential in a Taylor series:

   .. math::

      M_X(t) = E[e^{tX}] = E\!\left[\sum_{n=0}^{\infty} \frac{(tX)^n}{n!}\right]
      = \sum_{n=0}^{\infty} \frac{t^n}{n!}\,E[X^n].

   The coefficient of :math:`t^n / n!` is :math:`E[X^n]`, so differentiating
   :math:`n` times and setting :math:`t=0` extracts :math:`E[X^n]`.

2. **Uniqueness.**  If two random variables have the same MGF in a neighborhood
   of zero, they have the same distribution.

3. **Sums of independent variables.**  If :math:`X` and :math:`Y` are
   independent, then

   .. math::

      M_{X+Y}(t) = E[e^{t(X+Y)}] = E[e^{tX}]\,E[e^{tY}] = M_X(t)\,M_Y(t).

   The MGF of a sum is the product of the MGFs.

*Example.*  For :math:`X \sim \text{Bernoulli}(p)`:

.. math::

   M_X(t) = E[e^{tX}] = (1-p)\,e^{0} + p\,e^{t} = 1 - p + p\,e^t.

Differentiating: :math:`M_X'(t) = p\,e^t`, so :math:`E[X] = M_X'(0) = p`.
:math:`M_X''(t) = p\,e^t`, so :math:`E[X^2] = p`.  Then
:math:`\operatorname{Var}(X) = p - p^2 = p(1-p)`.

.. code-block:: python

   # MGF of Bernoulli(p): extract moments via numerical differentiation
   p = 0.3

   def mgf_bernoulli(t, p):
       """M_X(t) = (1-p) + p*exp(t)"""
       return (1 - p) + p * np.exp(t)

   # Numerical derivatives at t=0 using central differences
   dt = 1e-7
   # First moment: E[X] = M'(0)
   M_prime_0 = (mgf_bernoulli(dt, p) - mgf_bernoulli(-dt, p)) / (2 * dt)
   # Second moment: E[X^2] = M''(0)
   M_double_prime_0 = (mgf_bernoulli(dt, p) - 2*mgf_bernoulli(0, p)
                        + mgf_bernoulli(-dt, p)) / dt**2

   print(f"Bernoulli(p={p}):")
   print(f"  E[X]   from MGF: {M_prime_0:.6f}  (exact: {p})")
   print(f"  E[X^2] from MGF: {M_double_prime_0:.6f}  (exact: {p})")
   print(f"  Var(X) from MGF: {M_double_prime_0 - M_prime_0**2:.6f}  "
         f"(exact: {p*(1-p):.6f})")

.. code-block:: python

   # MGF of Exponential(rate=lambda): M_X(t) = lambda / (lambda - t) for t < lambda
   # Used for claim-size modeling
   lam = 1 / 800  # rate parameter for minor claim distribution
   def mgf_exponential(t, lam):
       """M_X(t) = lambda / (lambda - t), defined for t < lambda."""
       return lam / (lam - t)

   # Extract moments
   dt = 1e-8
   M1 = (mgf_exponential(dt, lam) - mgf_exponential(-dt, lam)) / (2 * dt)
   M2 = (mgf_exponential(dt, lam) - 2*mgf_exponential(0, lam)
         + mgf_exponential(-dt, lam)) / dt**2

   print(f"\nExponential(rate={lam:.5f}), mean = 1/lambda = ${1/lam:.0f}:")
   print(f"  E[X]   from MGF: ${M1:>10.2f}  (exact: ${1/lam:.2f})")
   print(f"  E[X^2] from MGF: {M2:>12.2f}  (exact: {2/lam**2:.2f})")
   print(f"  Var(X) from MGF: {M2 - M1**2:>12.2f}  (exact: {1/lam**2:.2f})")

.. code-block:: python

   # MGF property: sum of independent variables
   # If X1, X2 ~ iid Exponential(lam), then X1+X2 ~ Gamma(2, lam)
   # M_{X1+X2}(t) = M_X1(t) * M_X2(t) = [lam/(lam-t)]^2
   def mgf_sum_of_two_exp(t, lam):
       return (lam / (lam - t)) ** 2

   # Extract E[X1+X2] — should be 2/lam = 2 * 800 = 1600
   M1_sum = (mgf_sum_of_two_exp(dt, lam) - mgf_sum_of_two_exp(-dt, lam)) / (2 * dt)
   print(f"\nSum of 2 independent Exp(1/800):")
   print(f"  E[X1+X2] from MGF product: ${M1_sum:>10.2f}  (exact: ${2/lam:.2f})")

   # Verify by simulation
   X1_sim = np.random.exponential(1/lam, size=100_000)
   X2_sim = np.random.exponential(1/lam, size=100_000)
   print(f"  E[X1+X2] simulated:        ${(X1_sim + X2_sim).mean():>10.2f}")


2.8 Joint Distributions
=========================

An insurance company does not insure just one risk.  A homeowner's policy
covers *both* fire damage and water damage.  To price the policy correctly, the
actuary must understand how these two claim amounts behave *together* --- does
a year with high fire risk also tend to have high water risk?  This requires
the **joint distribution**.

When we study two (or more) random variables simultaneously, we need their
**joint distribution**.  This tells us not only how each variable behaves on
its own, but how they behave *together*.

**Discrete case.**  The **joint PMF** of :math:`(X, Y)` is

.. math::

   p_{X,Y}(x,y) = P(X = x, Y = y).

**Continuous case.**  The **joint PDF** :math:`f_{X,Y}(x,y)` satisfies

.. math::

   P((X,Y) \in A) = \iint_A f_{X,Y}(x,y)\,dx\,dy.

.. code-block:: python

   # Joint distribution: two correlated stock returns
   # We model (R_tech, R_util) as bivariate normal
   cov_tech_util = rho_stocks * sigma_tech * sigma_util
   mean_vec = [mu_tech, mu_util]
   cov_matrix = [
       [sigma_tech**2,  cov_tech_util],
       [cov_tech_util,  sigma_util**2],
   ]

   # Draw 10,000 joint observations
   returns = np.random.multivariate_normal(mean_vec, cov_matrix, size=10_000)
   R_tech = returns[:, 0]
   R_util = returns[:, 1]

   print("Joint distribution of (R_tech, R_util):")
   print(f"  Sample means:  Tech = {R_tech.mean():.4f}, Utility = {R_util.mean():.4f}")
   print(f"  Sample stds:   Tech = {R_tech.std():.4f}, Utility = {R_util.std():.4f}")
   print(f"  Sample corr:   {np.corrcoef(R_tech, R_util)[0, 1]:.4f}  (target: {rho_stocks})")


2.8.1 Marginal Distributions
------------------------------

The portfolio analyst has the joint distribution of Tech and Utility returns.
But a client who holds only Tech stock cares about the **marginal** distribution
of Tech alone --- the joint information about Utility is irrelevant for them.

The **marginal** distribution of :math:`X` alone is obtained by "summing (or
integrating) out" :math:`Y`:

.. math::

   p_X(x) = \sum_y p_{X,Y}(x,y) \quad \text{(discrete)},

.. math::

   f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y)\,dy \quad \text{(continuous)}.

.. code-block:: python

   # Marginal distributions: the individual stock returns
   # From the joint simulation above, the marginals should be normal
   # with the parameters we specified
   print("Marginal of R_tech:")
   print(f"  Mean: {R_tech.mean():.4f}  (target: {mu_tech})")
   print(f"  Std:  {R_tech.std():.4f}  (target: {sigma_tech})")
   _, p_val_tech = stats.shapiro(R_tech[:500])
   print(f"  Shapiro-Wilk p-value: {p_val_tech:.4f}  (large p => consistent with normal)")

   print(f"\nMarginal of R_util:")
   print(f"  Mean: {R_util.mean():.4f}  (target: {mu_util})")
   print(f"  Std:  {R_util.std():.4f}  (target: {sigma_util})")
   _, p_val_util = stats.shapiro(R_util[:500])
   print(f"  Shapiro-Wilk p-value: {p_val_util:.4f}")


2.8.2 Independence of Random Variables
----------------------------------------

Are Tech and Utility returns independent?  If they were, knowing Tech had a
great year would tell you nothing about Utility.  In practice, stocks are
rarely independent --- they share exposure to the broader economy.  But
independence is a critical concept: when it holds, joint distributions factor
into products of marginals, and the likelihood function decomposes.

Random variables :math:`X` and :math:`Y` are **independent** if their joint
distribution factors:

.. math::

   p_{X,Y}(x,y) = p_X(x)\,p_Y(y) \quad \text{for all } x, y.

(Similarly for PDFs.)  Independence means knowing :math:`X` tells you nothing
about :math:`Y`, and vice versa.

.. code-block:: python

   # Test for independence: compare joint density to product of marginals
   # We will use a simple binning approach
   from scipy.stats import chi2_contingency

   # Bin the returns into quartiles
   tech_bins = np.digitize(R_tech, np.quantile(R_tech, [0.25, 0.5, 0.75]))
   util_bins = np.digitize(R_util, np.quantile(R_util, [0.25, 0.5, 0.75]))

   # Contingency table
   table = np.zeros((4, 4), dtype=int)
   for t, u in zip(tech_bins, util_bins):
       table[t, u] += 1

   chi2, p_value, dof, expected = chi2_contingency(table)
   print(f"Chi-squared test for independence:")
   print(f"  chi2 = {chi2:.2f}, dof = {dof}, p-value = {p_value:.6f}")
   if p_value < 0.05:
       print(f"  Reject independence (p < 0.05): returns are NOT independent.")
   else:
       print(f"  Cannot reject independence at 5% level.")


2.9 Covariance and Correlation
================================

A portfolio manager is deciding how to split money between Tech and Utility
stocks.  If the two stocks tend to move together (positive covariance),
diversification helps less.  If they move in opposite directions (negative
covariance), holding both dramatically reduces risk.  **Covariance** quantifies
this co-movement.

**Covariance** measures the *linear association* between two random variables.
When :math:`X` tends to be above its mean at the same time :math:`Y` is above
its mean, the covariance is positive.  When they move in opposite directions,
it is negative.

.. math::

   \operatorname{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])].

Expanding:

.. math::

   \operatorname{Cov}(X, Y)
   &= E[XY - X\,E[Y] - E[X]\,Y + E[X]\,E[Y]] \\
   &= E[XY] - E[X]\,E[Y] - E[X]\,E[Y] + E[X]\,E[Y] \\
   &= E[XY] - E[X]\,E[Y].

.. code-block:: python

   # Covariance: compute both ways for Tech and Utility returns
   # Method 1: Definition — E[(X - mu_X)(Y - mu_Y)]
   Cov_def = np.mean((R_tech - R_tech.mean()) * (R_util - R_util.mean()))

   # Method 2: Shortcut — E[XY] - E[X]*E[Y]
   E_XY = np.mean(R_tech * R_util)
   Cov_shortcut = E_XY - R_tech.mean() * R_util.mean()

   # Theoretical value
   Cov_theory = rho_stocks * sigma_tech * sigma_util

   print("Cov(R_tech, R_util):")
   print(f"  Theoretical:        {Cov_theory:.6f}")
   print(f"  Definition method:  {Cov_def:.6f}")
   print(f"  Shortcut method:    {Cov_shortcut:.6f}")
   print(f"  NumPy:              {np.cov(R_tech, R_util, ddof=0)[0, 1]:.6f}")

**Properties:**

- :math:`\operatorname{Cov}(X, X) = \operatorname{Var}(X)`.
- :math:`\operatorname{Cov}(X, Y) = \operatorname{Cov}(Y, X)`.
- :math:`\operatorname{Cov}(aX + b, Y) = a\,\operatorname{Cov}(X, Y)`.
- If :math:`X, Y` are independent, :math:`\operatorname{Cov}(X,Y) = 0`
  (but not conversely).

**Variance of a sum (general):**

.. math::

   \operatorname{Var}(X + Y)
   &= E[(X + Y - E[X+Y])^2] \\
   &= E[((X-\mu_X) + (Y-\mu_Y))^2] \\
   &= E[(X-\mu_X)^2] + 2E[(X-\mu_X)(Y-\mu_Y)] + E[(Y-\mu_Y)^2] \\
   &= \operatorname{Var}(X) + 2\operatorname{Cov}(X,Y) + \operatorname{Var}(Y).

When :math:`X` and :math:`Y` are independent, :math:`\operatorname{Cov}(X,Y) = 0`,
recovering :math:`\operatorname{Var}(X+Y) = \operatorname{Var}(X) + \operatorname{Var}(Y)`.

.. code-block:: python

   # Variance of a portfolio: Var(w*R_tech + (1-w)*R_util)
   # This is the formula that makes portfolio theory work.
   weights = np.linspace(0, 1, 11)  # 0% to 100% in Tech

   print(f"{'w_tech':>6s}  {'E[R_p]':>8s}  {'Std(R_p)':>9s}")
   print("-" * 28)
   for w in weights:
       E_Rp = w * mu_tech + (1 - w) * mu_util
       Var_Rp = (w**2 * sigma_tech**2
                 + (1 - w)**2 * sigma_util**2
                 + 2 * w * (1 - w) * Cov_theory)
       Std_Rp = np.sqrt(Var_Rp)
       print(f"{w:6.1%}  {E_Rp:8.2%}  {Std_Rp:9.2%}")

   print("\nNotice: 100% Utility is NOT the lowest-risk portfolio!")
   print("Diversification (mixing) reduces volatility because rho < 1.")

**Correlation** standardizes covariance to lie in :math:`[-1, 1]`:

.. math::

   \rho(X, Y) = \frac{\operatorname{Cov}(X, Y)}{\sqrt{\operatorname{Var}(X)\,\operatorname{Var}(Y)}}.

Values of :math:`\rho = \pm 1` correspond to a perfect linear relationship.

.. code-block:: python

   # Correlation: compute from covariance and standard deviations
   # rho = Cov(X,Y) / (Std(X) * Std(Y))
   rho_computed = Cov_def / (R_tech.std() * R_util.std())

   print(f"Correlation(R_tech, R_util):")
   print(f"  From formula: Cov/(Std_X·Std_Y) = {Cov_def:.6f} / ({R_tech.std():.4f}×{R_util.std():.4f}) = {rho_computed:.4f}")
   print(f"  NumPy:        {np.corrcoef(R_tech, R_util)[0, 1]:.4f}")
   print(f"  Target:       {rho_stocks}")

.. code-block:: python

   # What does correlation LOOK like? Simulate 3 scenarios.
   for rho_val in [-0.8, 0.0, 0.8]:
       cov_val = rho_val * sigma_tech * sigma_util
       cov_mat = [[sigma_tech**2, cov_val], [cov_val, sigma_util**2]]
       sample = np.random.multivariate_normal(mean_vec, cov_mat, size=5000)
       r = np.corrcoef(sample[:, 0], sample[:, 1])[0, 1]
       print(f"  rho = {rho_val:+.1f}: sample correlation = {r:+.4f}")


2.10 Conditional Distributions and Conditional Expectation
===========================================================

The portfolio analyst learns that Tech stock dropped 10 % this year.
Given this bad news, what should they *expect* for Utility?  This is not the
unconditional mean of Utility (6 %); the correlation between the stocks means
that a bad year for Tech shifts our expectation for Utility.  **Conditional
expectation** formalizes this shift.

**Discrete conditional PMF.**  Given :math:`Y = y` (with :math:`p_Y(y) > 0`),

.. math::

   p_{X|Y}(x \mid y) = \frac{p_{X,Y}(x, y)}{p_Y(y)}.

**Continuous conditional PDF.**  Given :math:`Y = y` (with :math:`f_Y(y) > 0`),

.. math::

   f_{X|Y}(x \mid y) = \frac{f_{X,Y}(x, y)}{f_Y(y)}.

**Conditional expectation.**

.. math::

   E[X \mid Y = y] = \sum_x x\,p_{X|Y}(x \mid y) \quad \text{(discrete)},
   \qquad
   E[X \mid Y = y] = \int_{-\infty}^{\infty} x\,f_{X|Y}(x \mid y)\,dx \quad \text{(continuous)}.

.. code-block:: python

   # Conditional expectation for bivariate normal:
   # E[R_util | R_tech = r] = mu_util + rho*(sigma_util/sigma_tech)*(r - mu_tech)
   #
   # If Tech dropped 10%:
   r_tech_observed = -0.10
   E_util_given_tech = (mu_util
                        + rho_stocks * (sigma_util / sigma_tech)
                        * (r_tech_observed - mu_tech))

   print(f"Observed R_tech = {r_tech_observed:.0%}")
   print(f"E[R_util | R_tech = {r_tech_observed:.0%}]")
   print(f"  = mu_util + rho·(sigma_util/sigma_tech)·(r_tech - mu_tech)")
   print(f"  = {mu_util} + {rho_stocks}·({sigma_util}/{sigma_tech})·({r_tech_observed} - {mu_tech})")
   print(f"  = {E_util_given_tech:.4f}  ({E_util_given_tech:.2%})")
   print(f"\nUnconditional E[R_util] = {mu_util:.2%}")
   print(f"The bad Tech year lowers our Utility expectation from {mu_util:.2%} to {E_util_given_tech:.2%}.")

.. code-block:: python

   # Verify by simulation: subset returns where R_tech is near -10%
   mask_near = (R_tech >= -0.12) & (R_tech <= -0.08)
   if mask_near.sum() > 0:
       conditional_mean_sim = R_util[mask_near].mean()
       print(f"Simulated E[R_util | R_tech near -10%] = {conditional_mean_sim:.4f}")
       print(f"Formula:                                  {E_util_given_tech:.4f}")
       print(f"(Based on {mask_near.sum()} observations with R_tech in [-12%, -8%])")

**Law of Total Expectation (iterated expectation).**

.. math::

   E[X] = E\!\big[E[X \mid Y]\big].

This is one of the most useful formulas in probability.  It says you can
compute :math:`E[X]` in two steps: first compute the conditional expectation
given :math:`Y`, then average that over :math:`Y`.

*Proof sketch (discrete).*

.. math::

   E[E[X \mid Y]]
   &= \sum_y E[X \mid Y=y]\,p_Y(y) \\
   &= \sum_y \left(\sum_x x\,p_{X|Y}(x \mid y)\right) p_Y(y) \\
   &= \sum_y \sum_x x\,\frac{p_{X,Y}(x,y)}{p_Y(y)}\,p_Y(y) \\
   &= \sum_x x \sum_y p_{X,Y}(x,y) \\
   &= \sum_x x\,p_X(x) = E[X]. \quad \square

.. code-block:: python

   # Law of total expectation: insurance payout
   # E[X] = E[E[X | claim_type]]
   #       = sum over types: E[X|type] * P(type)
   #
   # This is exactly how we computed E_X earlier — now we see it's an
   # instance of the law of total expectation!
   E_X_via_total_exp = np.sum(claim_costs * claim_probs)
   print(f"E[X] via law of total expectation:")
   for label, cost, prob in zip(claim_labels, claim_costs, claim_probs):
       print(f"  E[X|{label:>10s}]·P({label:>10s}) = ${cost:>7,} × {prob:.2f} = ${cost*prob:>8.2f}")
   print(f"  E[X] = ${E_X_via_total_exp:.2f}")

**Law of Total Variance.**

.. math::

   \operatorname{Var}(X) = E[\operatorname{Var}(X \mid Y)] + \operatorname{Var}(E[X \mid Y]).

.. code-block:: python

   # Law of total variance: insurance example
   # Suppose claim sizes within each type are exponentially distributed.
   # Var(X|type) = (mean_type)^2 for exponential.
   # E[X|type] = mean_type.
   claim_vars = claim_costs**2  # Var of Exp(1/mu) = mu^2; for "None", 0^2=0

   # E[Var(X|type)] = sum P(type) * Var(X|type)
   E_var_given_type = np.sum(claim_probs * claim_vars)

   # Var(E[X|type]) = E[(E[X|type])^2] - (E[E[X|type]])^2
   E_cond_mean_sq = np.sum(claim_probs * claim_costs**2)
   E_cond_mean = np.sum(claim_probs * claim_costs)  # = E[X]
   Var_cond_mean = E_cond_mean_sq - E_cond_mean**2

   Var_total = E_var_given_type + Var_cond_mean

   print("Law of Total Variance:")
   print(f"  E[Var(X|Type)]          = {E_var_given_type:>14,.2f}")
   print(f"  Var(E[X|Type])          = {Var_cond_mean:>14,.2f}")
   print(f"  Total = E[Var] + Var[E] = {Var_total:>14,.2f}")
   print(f"  Direct Var(X)           = {Var_X_shortcut:>14,.2f}")
   print(f"  (They match because Var(X|type=None) = 0 and claim_costs are")
   print(f"   the conditional means, so this reduces to the same calculation.)")


2.11 The Law of Large Numbers
===============================

The insurance company sells 50,000 policies.  Can the CEO sleep well knowing
that the average claim cost will be close to the theoretical expected value ---
so that premium income will cover payouts?  The **Law of Large Numbers**
provides exactly this guarantee: with enough policies, the sample average
converges to the true mean.

The **Law of Large Numbers** (LLN) says that the sample average converges to
the population mean as the sample size grows.  This is the theoretical
guarantee behind the idea that "more data is better."

Let :math:`X_1, X_2, \dots` be i.i.d. random variables with :math:`E[X_i] = \mu`
and :math:`\operatorname{Var}(X_i) = \sigma^2 < \infty`.  Define the sample
mean

.. math::

   \bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i.

**Weak Law of Large Numbers.**  For every :math:`\varepsilon > 0`,

.. math::

   P(|\bar{X}_n - \mu| > \varepsilon) \to 0 \quad \text{as } n \to \infty.

*Intuition.*  The variance of the sample mean is

.. math::

   \operatorname{Var}(\bar{X}_n)
   = \operatorname{Var}\!\left(\frac{1}{n}\sum_{i=1}^n X_i\right)
   = \frac{1}{n^2} \sum_{i=1}^n \operatorname{Var}(X_i)
   = \frac{\sigma^2}{n}.

As :math:`n \to \infty`, the variance shrinks to zero, so the sample mean
concentrates around :math:`\mu`.

A quick formal proof uses **Chebyshev's inequality**, which states that for
any random variable :math:`Z` with mean :math:`\mu_Z` and variance
:math:`\sigma_Z^2`,

.. math::

   P(|Z - \mu_Z| > \varepsilon) \leq \frac{\sigma_Z^2}{\varepsilon^2}.

Applying this to :math:`\bar{X}_n`:

.. math::

   P(|\bar{X}_n - \mu| > \varepsilon) \leq \frac{\sigma^2}{n\varepsilon^2} \to 0.
   \quad \square

The **Strong Law of Large Numbers** strengthens this to almost-sure
convergence:

.. math::

   P\!\left(\lim_{n \to \infty} \bar{X}_n = \mu\right) = 1.

The LLN justifies using sample averages as estimates of population quantities
and is the theoretical backbone of frequentist statistics.

.. code-block:: python

   # LLN in action: insurance — average claim converges to E[X]
   # Simulate claims for increasing numbers of policyholders
   mu_claim = E_X  # theoretical mean ($530)
   max_n = 50_000

   # Generate all claims at once, then compute running averages
   all_claim_types = np.random.choice(len(claim_probs), size=max_n, p=claim_probs)
   all_claims = claim_costs[all_claim_types]
   running_mean = np.cumsum(all_claims) / np.arange(1, max_n + 1)

   # Report at selected sample sizes
   checkpoints = [10, 100, 500, 1_000, 5_000, 10_000, 50_000]
   print(f"{'n':>8s}  {'X_bar_n':>10s}  {'|X_bar - mu|':>14s}  {'sigma^2/n':>10s}")
   print("-" * 50)
   for n in checkpoints:
       x_bar = running_mean[n - 1]
       chebyshev_bound = Var_X_shortcut / n  # variance of sample mean
       print(f"{n:>8,}  ${x_bar:>9.2f}  ${abs(x_bar - mu_claim):>13.2f}  {chebyshev_bound:>10.2f}")

   print(f"\nTrue E[X] = ${mu_claim:.2f}")
   print("As n grows, the sample mean converges to the true mean.")

.. code-block:: python

   # Var(X_bar_n) = sigma^2 / n — verify the shrinkage
   n_trials = 5_000
   sample_sizes = [10, 50, 200, 1000, 5000]

   print(f"{'n':>6s}  {'Var(X_bar) theory':>18s}  {'Var(X_bar) simulated':>22s}")
   print("-" * 52)
   for n in sample_sizes:
       # Simulate n_trials sample means, each from n claims
       x_bars = np.array([
           claim_costs[np.random.choice(len(claim_probs), size=n, p=claim_probs)].mean()
           for _ in range(n_trials)
       ])
       var_theory = Var_X_shortcut / n
       var_sim = x_bars.var()
       print(f"{n:>6,}  {var_theory:>18,.2f}  {var_sim:>22,.2f}")


2.12 The Central Limit Theorem
================================

The insurance company has 50,000 policies.  Even though individual claims
follow a highly skewed distribution (most are zero, a few are very large), the
*total* payout across all policies is approximately normal.  This seemingly
magical result is the **Central Limit Theorem** --- arguably the most important
theorem in statistics.  It explains why the normal distribution appears
everywhere and why confidence intervals work.

**Theorem.**  Let :math:`X_1, X_2, \dots` be i.i.d. with mean :math:`\mu` and
variance :math:`\sigma^2 \in (0, \infty)`.  Define the standardized sample
mean

.. math::

   Z_n = \frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}}
       = \frac{\sum_{i=1}^n X_i - n\mu}{\sigma\sqrt{n}}.

Then, as :math:`n \to \infty`,

.. math::

   Z_n \xrightarrow{d} Z \sim N(0, 1).

That is, the distribution of :math:`Z_n` converges to a standard normal.

Equivalently,

.. math::

   \bar{X}_n \;\dot\sim\; N\!\left(\mu, \frac{\sigma^2}{n}\right)
   \quad \text{for large } n.

**Intuition.**  No matter what the original distribution of the :math:`X_i`
looks like (as long as it has finite mean and variance), the sample average
of many observations is approximately normally distributed.  This is why the
normal distribution is so prevalent --- many real-world measurements are sums
or averages of many small, independent contributions.

**Proof sketch using MGFs.**  The standardized variable has MGF

.. math::

   M_{Z_n}(t) = \left[M_{X_1}\!\left(\frac{t}{\sigma\sqrt{n}}\right)\,
   e^{-\mu t / (\sigma\sqrt{n})}\right]^n.

Expanding the MGF of :math:`X_1` in a Taylor series around :math:`t = 0` and
taking the limit as :math:`n \to \infty` yields :math:`e^{t^2/2}`, which is
the MGF of the standard normal.

The CLT is critical for constructing confidence intervals, hypothesis tests,
and understanding the behavior of maximum likelihood estimators
(see :ref:`ch4_likelihood` and later chapters).

.. code-block:: python

   # CLT in action: watch the distribution of Z_n become standard normal
   # Source distribution: our insurance claims (highly skewed!)
   mu_claim = E_X
   sigma_claim = np.sqrt(Var_X_shortcut)

   print("CLT demonstration: insurance claims (highly skewed source)\n")
   print(f"Source: E[X] = ${mu_claim:.2f}, Std(X) = ${sigma_claim:,.2f}\n")
   print(f"{'n':>5s}  {'mean(Z_n)':>10s}  {'std(Z_n)':>10s}  {'theory std':>10s}  {'Shapiro p':>10s}")
   print("-" * 58)

   for n in [1, 5, 10, 30, 50, 100, 500]:
       n_reps = 10_000
       x_bars = np.array([
           claim_costs[np.random.choice(len(claim_probs), size=n, p=claim_probs)].mean()
           for _ in range(n_reps)
       ])
       # Standardize: Z_n = (X_bar - mu) / (sigma / sqrt(n))
       Z_n = (x_bars - mu_claim) / (sigma_claim / np.sqrt(n))

       _, p_val = stats.shapiro(Z_n[:500])
       print(f"{n:>5d}  {Z_n.mean():>10.4f}  {Z_n.std():>10.4f}  {1.0:>10.4f}  {p_val:>10.4f}")

   print("\nAs n increases, mean(Z_n) -> 0, std(Z_n) -> 1, and Shapiro p increases,")
   print("confirming convergence to N(0,1).")

.. code-block:: python

   # CLT with a different source: Exponential(rate=2)
   # This shows the CLT is universal — it works for ANY distribution
   lam_clt = 2.0
   mu_exp = 1 / lam_clt        # 0.5
   sigma_exp = 1 / lam_clt     # 0.5

   print(f"\nCLT with Exponential(rate={lam_clt}): mu={mu_exp}, sigma={sigma_exp}\n")
   print(f"{'n':>5s}  {'mean(X_bar)':>12s}  {'std(X_bar)':>12s}  {'theory std':>12s}  {'Shapiro p':>10s}")
   print("-" * 60)

   for n in [1, 5, 30, 100]:
       sample_means = np.array([
           np.random.exponential(1/lam_clt, size=n).mean()
           for _ in range(10_000)
       ])
       z_scores = (sample_means - mu_exp) / (sigma_exp / np.sqrt(n))
       _, p_val = stats.shapiro(z_scores[:500])
       print(f"{n:>5d}  {sample_means.mean():>12.4f}  {sample_means.std():>12.4f}  "
             f"{sigma_exp/np.sqrt(n):>12.4f}  {p_val:>10.4f}")

.. code-block:: python

   # CLT application: poll sampling — why 1,000 respondents are often enough
   true_p = 0.53
   n_respondents = 1000
   n_polls = 10_000

   # Simulate many polls
   polls = np.random.binomial(n_respondents, true_p, size=n_polls) / n_respondents
   se_theory = np.sqrt(true_p * (1 - true_p) / n_respondents)
   margin = 1.96 * se_theory

   print(f"Poll sampling (CLT in action):")
   print(f"  True proportion:       {true_p}")
   print(f"  n respondents:         {n_respondents}")
   print(f"  Theoretical SE:        {se_theory:.4f}")
   print(f"  Simulated SE:          {polls.std():.4f}")
   print(f"  95% margin of error:   +/- {margin:.4f}")
   print(f"  95% CI:                ({true_p - margin:.4f}, {true_p + margin:.4f})")
   coverage = np.mean(np.abs(polls - true_p) <= margin)
   print(f"  Simulated coverage:    {coverage:.4f}  (should be ~0.95)")


2.13 Summary
==============

- A **random variable** maps outcomes to numbers, enabling us to do arithmetic
  with randomness.
- **PMFs** describe discrete variables; **PDFs** describe continuous ones;
  **CDFs** work for both.
- **Expectation** is the probability-weighted average, and it is **linear**.
- **Variance** measures spread; the formula
  :math:`\operatorname{Var}(X) = E[X^2] - (E[X])^2` is indispensable.
- **MGFs** encode all moments and are a powerful tool for proving distributional
  results.
- **Joint, marginal, and conditional** distributions extend these ideas to
  multiple random variables.
- **Covariance** and **correlation** quantify linear association.
- The **Law of Large Numbers** guarantees that averages converge to means.
- The **Central Limit Theorem** guarantees that averages are approximately
  normal.

Throughout this chapter we computed every formula using two concrete scenarios
--- insurance pricing and stock portfolio analysis --- and verified results
with simulation.  The insurance example showed how :math:`E[X]` sets the break-even
premium, how :math:`\operatorname{Var}(X)` measures risk, and how the CLT
makes the total payout predictable.  The stock example showed how covariance
and correlation drive diversification benefits.

With these tools in hand, we are ready to study the most important
distributions in detail in :ref:`ch3_distributions`.