.. _ch16_constrained:

========================================
Chapter 16: Constrained Optimization
========================================

You manage a portfolio of 5 assets. You want to maximize expected return, but
you have a risk budget: the portfolio variance must not exceed 2%. The weights
must be non-negative (no short selling) and sum to one (fully invested). This
is a constrained optimization problem---and if you ignore the constraints, your
optimizer will happily tell you to put 500% of your money in the highest-return
asset and short-sell everything else.

.. code-block:: python

   # What happens when you ignore constraints
   import numpy as np

   np.random.seed(42)
   p = 5
   mu = np.array([0.12, 0.10, 0.07, 0.03, 0.14])   # expected returns
   A = np.random.randn(p, p) * 0.1
   Sigma = A.T @ A + 0.05 * np.eye(p)                # covariance of returns

   # Unconstrained "optimal" portfolio: just maximize mu^T w
   # Without constraints, w -> infinity in the direction of max return
   w_unconstrained = mu / np.linalg.norm(mu)  # just point at highest return
   w_unconstrained *= 10  # scale up for "more return"
   print("Unconstrained 'solution':")
   print(f"  Weights: {np.round(w_unconstrained, 2)}")
   print(f"  Sum of weights: {w_unconstrained.sum():.2f} (should be 1.00)")
   print(f"  Min weight: {w_unconstrained.min():.2f} (should be >= 0)")
   print(f"  Portfolio variance: {w_unconstrained @ Sigma @ w_unconstrained:.4f} (budget: 0.02)")
   print("\nThis is nonsense. We NEED constraints.")

In many likelihood problems the parameters cannot take arbitrary values.
Probabilities must be non-negative and sum to one. Variances must be positive.
Correlation matrices must be positive semi-definite. This chapter develops the
theory and algorithms for optimization under constraints, starting from
Lagrange multipliers for equality constraints, extending to the
Karush--Kuhn--Tucker (KKT) conditions for inequality constraints, and then
covering the main algorithmic approaches: augmented Lagrangian methods,
barrier (interior-point) methods, and penalty methods. We close with the
practical alternative of reparameterization, which transforms a constrained
problem into an unconstrained one.

.. admonition:: Why Constrained Optimization Matters

   Every time you fit a model with probabilities, variances, or proportions,
   you are doing constrained optimization --- whether you realize it or not.
   The unconstrained methods from earlier chapters (:ref:`ch12_gradient`,
   :ref:`ch13_newton`, :ref:`ch14_quasi_newton`) are powerful, but they can
   happily produce a negative variance or probabilities that sum to 1.3.
   Constrained optimization gives you the mathematical framework to keep
   parameters in their proper domains while still finding the optimum.


16.1 Equality Constraints and Lagrange Multipliers
====================================================

Problem Formulation
--------------------

Consider the problem

.. math::
   :label: eq_constrained

   \min_{\boldsymbol{\theta} \in \mathbb{R}^p}\;
   f(\boldsymbol{\theta})
   \qquad\text{subject to}\qquad
   h_j(\boldsymbol{\theta}) = 0,
   \quad j = 1, \dots, m,

where :math:`f` and :math:`h_1, \dots, h_m` are continuously differentiable.

Geometric Intuition
--------------------

At a constrained minimum :math:`\boldsymbol{\theta}^*`, the gradient of
:math:`f` cannot have a component in the feasible set --- otherwise we could
decrease :math:`f` by moving along the constraint surface. More precisely, the
gradient :math:`\nabla f(\boldsymbol{\theta}^*)` must be *normal* to the
constraint surface.

Think of it this way: you are hiking on a mountain and are confined to a trail
(the constraint surface).  At the lowest point of the trail, the slope of the
mountain must be perpendicular to the trail --- if there were any downhill
component along the trail, you could walk further and go lower.

The constraint surface is (locally) a manifold defined by
:math:`\mathbf{h}(\boldsymbol{\theta}) = \mathbf{0}`. Its tangent space at
:math:`\boldsymbol{\theta}^*` is the null space of the Jacobian matrix

.. math::

   \mathbf{J}_h(\boldsymbol{\theta}^*)
   = \begin{pmatrix}
     \nabla h_1(\boldsymbol{\theta}^*)^{\!\top} \\
     \vdots \\
     \nabla h_m(\boldsymbol{\theta}^*)^{\!\top}
     \end{pmatrix}
   \in \mathbb{R}^{m \times p}.

The normal space to the constraint surface is spanned by the rows of
:math:`\mathbf{J}_h`, i.e., by the gradients :math:`\nabla h_j`. The condition
that :math:`\nabla f` lies in this normal space is

.. math::

   \nabla f(\boldsymbol{\theta}^*)
   = \sum_{j=1}^m \lambda_j^*\,\nabla h_j(\boldsymbol{\theta}^*),

or equivalently

.. math::
   :label: tangent_cond

   \nabla f(\boldsymbol{\theta}^*)
   - \sum_{j=1}^m \lambda_j^*\,\nabla h_j(\boldsymbol{\theta}^*)
   = \mathbf{0}.

The scalars :math:`\lambda_1^*, \dots, \lambda_m^*` are the **Lagrange
multipliers**.

The Lagrangian
--------------

Define the **Lagrangian function**

.. math::
   :label: lagrangian

   \mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\lambda})
   = f(\boldsymbol{\theta})
   - \sum_{j=1}^m \lambda_j\,h_j(\boldsymbol{\theta}).

Reading this formula: the Lagrangian takes the objective :math:`f` and subtracts
a penalty for each constraint, weighted by the multiplier :math:`\lambda_j`.
If :math:`\lambda_j` is large, the corresponding constraint exerts a strong
pull on the solution. The Lagrangian converts the constrained problem into an
unconstrained one: finding a stationary point of
:math:`\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\lambda})` with respect to
*both* :math:`\boldsymbol{\theta}` and :math:`\boldsymbol{\lambda}` gives you
the constrained optimum and the multipliers simultaneously.

The necessary conditions for a constrained minimum are found by setting the
partial derivatives to zero:

.. math::
   :label: lag_conditions

   \nabla_{\boldsymbol{\theta}} \mathcal{L}
   &= \nabla f(\boldsymbol{\theta})
   - \sum_{j=1}^m \lambda_j\,\nabla h_j(\boldsymbol{\theta})
   = \mathbf{0}, \\
   \nabla_{\boldsymbol{\lambda}} \mathcal{L}
   &= -\mathbf{h}(\boldsymbol{\theta}) = \mathbf{0}.

The first equation is :eq:`tangent_cond` --- the gradient of :math:`f` must lie
in the span of the constraint gradients. The second equation is simply the
requirement that all constraints are satisfied (:math:`h_j = 0`). Together
they form a system of :math:`p + m` equations in :math:`p + m` unknowns
:math:`(\boldsymbol{\theta}, \boldsymbol{\lambda})`, which can be solved by
Newton's method or other root-finding techniques.

Let us solve this both analytically and numerically and compare.

.. code-block:: python

   # Lagrange multipliers: analytical vs. numerical
   # Problem: minimize x^2 + 2y^2 subject to x + y = 1
   import numpy as np
   from scipy.optimize import fsolve

   # --- Analytical solution ---
   # Lagrangian: L = x^2 + 2y^2 - lambda*(x + y - 1)
   # dL/dx = 2x - lambda = 0  =>  x = lambda/2
   # dL/dy = 4y - lambda = 0  =>  y = lambda/4
   # Constraint: x + y = 1  =>  lambda/2 + lambda/4 = 1  =>  lambda = 4/3
   lam_analytic = 4.0 / 3.0
   x_analytic = lam_analytic / 2.0
   y_analytic = lam_analytic / 4.0

   # --- Numerical solution (fsolve) ---
   def lagrange_system(vars):
       x, y, lam = vars
       return [2*x - lam,        # dL/dx = 0
               4*y - lam,        # dL/dy = 0
               x + y - 1]        # constraint

   sol = fsolve(lagrange_system, [0.5, 0.5, 1.0])
   x_num, y_num, lam_num = sol

   print("Minimize x^2 + 2y^2 subject to x + y = 1")
   print(f"\n{'':>12s}  {'x':>8s}  {'y':>8s}  {'lambda':>8s}  {'f(x,y)':>8s}")
   print("-" * 50)
   print(f"{'Analytical':>12s}  {x_analytic:8.4f}  {y_analytic:8.4f}  "
         f"{lam_analytic:8.4f}  {x_analytic**2 + 2*y_analytic**2:8.4f}")
   print(f"{'Numerical':>12s}  {x_num:8.4f}  {y_num:8.4f}  "
         f"{lam_num:8.4f}  {x_num**2 + 2*y_num**2:8.4f}")
   print(f"\nConstraint satisfied: x + y = {x_num + y_num:.8f}")

.. admonition:: Intuition

   The Lagrange multiplier :math:`\lambda^*` has a beautiful economic
   interpretation: it tells you the *rate* at which the optimal value of
   :math:`f` changes if you relax the constraint by an infinitesimal amount.
   In operations research, this is called the **shadow price** of the
   constraint.  If :math:`\lambda^*` is large, the constraint is "expensive"
   --- the objective would improve significantly if the constraint were
   loosened.

.. code-block:: python

   # Shadow price interpretation: relax the constraint and observe
   import numpy as np
   from scipy.optimize import fsolve

   def solve_relaxed(c):
       """Solve min x^2 + 2y^2 s.t. x + y = c, return (x*, y*, f*)."""
       # Analytical: x = 2c/3, y = c/3, f = 2c^2/3
       return 2*c/3, c/3, 2*c**2/3

   print("Shadow price of the constraint x + y = c:")
   print(f"{'c':>6s}  {'x*':>8s}  {'y*':>8s}  {'f*':>8s}  {'df*/dc':>8s}")
   print("-" * 44)
   prev_f = None
   for c in [0.9, 0.95, 1.0, 1.05, 1.1]:
       x, y, f = solve_relaxed(c)
       df_dc = f - prev_f if prev_f is not None else 0.0
       # The analytical derivative df*/dc = 4c/3, at c=1 this is 4/3 = lambda*
       print(f"{c:6.2f}  {x:8.4f}  {y:8.4f}  {f:8.4f}  {df_dc / 0.05:8.4f}")
       prev_f = f

   print(f"\ndf*/dc at c=1 = {4.0/3.0:.4f} = lambda* (the Lagrange multiplier)")

Second-Order Sufficient Conditions
------------------------------------

The first-order conditions :eq:`lag_conditions` are necessary but not
sufficient. A second-order sufficient condition for a local minimum is that the
**bordered Hessian** (or equivalently the Hessian of the Lagrangian restricted
to the tangent space of the constraints) is positive definite:

.. math::

   \mathbf{v}^{\!\top}
   \nabla^2_{\boldsymbol{\theta}\boldsymbol{\theta}} \mathcal{L}
   (\boldsymbol{\theta}^*, \boldsymbol{\lambda}^*)\,\mathbf{v} > 0
   \quad\text{for all } \mathbf{v} \neq \mathbf{0}
   \text{ with } \mathbf{J}_h(\boldsymbol{\theta}^*)\mathbf{v} = \mathbf{0},

where

.. math::

   \nabla^2_{\boldsymbol{\theta}\boldsymbol{\theta}} \mathcal{L}
   = \nabla^2 f - \sum_{j=1}^m \lambda_j\,\nabla^2 h_j.


16.2 Inequality Constraints and KKT Conditions
=================================================

Problem Formulation
--------------------

The general constrained optimization problem is

.. math::
   :label: gen_constrained

   \min_{\boldsymbol{\theta}}\; f(\boldsymbol{\theta})
   \qquad\text{subject to}\qquad
   &h_j(\boldsymbol{\theta}) = 0, \quad j = 1, \dots, m, \\
   &g_i(\boldsymbol{\theta}) \leq 0, \quad i = 1, \dots, q.

Inequality constraints bring a new subtlety: at the solution, some constraints
will be *active* (holding with equality) and others will be *inactive* (strict
inequality).  The KKT conditions elegantly handle both cases.

Deriving the KKT Conditions
-----------------------------

Define the Lagrangian with both equality and inequality constraints:

.. math::
   :label: kkt_lagrangian

   \mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\lambda}, \boldsymbol{\mu})
   = f(\boldsymbol{\theta})
   - \sum_{j=1}^m \lambda_j\,h_j(\boldsymbol{\theta})
   + \sum_{i=1}^q \mu_i\,g_i(\boldsymbol{\theta}).

At a constrained minimum :math:`(\boldsymbol{\theta}^*, \boldsymbol{\lambda}^*,
\boldsymbol{\mu}^*)`, the following **Karush--Kuhn--Tucker (KKT) conditions**
must hold (under a constraint qualification such as LICQ):

1. **Stationarity:**

   .. math::

      \nabla_{\boldsymbol{\theta}} \mathcal{L}
      = \nabla f(\boldsymbol{\theta}^*)
      - \sum_{j=1}^m \lambda_j^*\,\nabla h_j(\boldsymbol{\theta}^*)
      + \sum_{i=1}^q \mu_i^*\,\nabla g_i(\boldsymbol{\theta}^*)
      = \mathbf{0}.

2. **Primal feasibility:**

   .. math::

      h_j(\boldsymbol{\theta}^*) = 0 \quad\forall\, j,
      \qquad
      g_i(\boldsymbol{\theta}^*) \leq 0 \quad\forall\, i.

3. **Dual feasibility:**

   .. math::

      \mu_i^* \geq 0 \quad\forall\, i.

4. **Complementary slackness:**

   .. math::
      :label: comp_slack

      \mu_i^*\,g_i(\boldsymbol{\theta}^*) = 0 \quad\forall\, i.

Complementary Slackness Explained
-----------------------------------

Condition :eq:`comp_slack` says that for each inequality constraint, either the
constraint is *active* (:math:`g_i(\boldsymbol{\theta}^*) = 0`) and the
multiplier can be positive, or the constraint is *inactive*
(:math:`g_i(\boldsymbol{\theta}^*) < 0`) and the multiplier must be zero.

**Intuition:** An inactive constraint does not affect the solution --- it is as
if the constraint were not there. Only active constraints "push" on the
solution, and the multiplier :math:`\mu_i^*` measures the strength of that push.

The KKT conditions are *sufficient* for a global minimum when :math:`f` is
convex and the constraints define a convex feasible set (i.e., :math:`g_i`
convex, :math:`h_j` affine).

Let us verify the KKT conditions in code on the portfolio problem.

.. code-block:: python

   # Checking KKT conditions on the portfolio problem
   import numpy as np
   from scipy.optimize import minimize

   np.random.seed(42)
   p = 5
   mu = np.array([0.12, 0.10, 0.07, 0.03, 0.14])
   A = np.random.randn(p, p) * 0.1
   Sigma = A.T @ A + 0.05 * np.eye(p)
   sigma2_max = 0.02

   # Solve with scipy
   constraints = [
       {'type': 'eq',   'fun': lambda w: np.sum(w) - 1},
       {'type': 'ineq', 'fun': lambda w: sigma2_max - w @ Sigma @ w},
   ]
   bounds = [(0, 1)] * p
   w0 = np.ones(p) / p
   res = minimize(lambda w: -mu @ w, w0, method='SLSQP',
                  bounds=bounds, constraints=constraints)
   w_star = res.x

   # Check all four KKT conditions
   print("=== KKT Condition Check ===\n")

   # 1. Primal feasibility
   eq_violation = abs(np.sum(w_star) - 1)
   risk = w_star @ Sigma @ w_star
   print("1. PRIMAL FEASIBILITY")
   print(f"   Sum of weights = {np.sum(w_star):.8f} (violation: {eq_violation:.2e})")
   print(f"   Portfolio variance = {risk:.6f} <= {sigma2_max} : "
         f"{'OK' if risk <= sigma2_max + 1e-8 else 'VIOLATED'}")
   for i in range(p):
       status = "ACTIVE" if w_star[i] < 1e-6 else "inactive"
       print(f"   w[{i}] = {w_star[i]:.6f} >= 0 : {status}")

   # 2. Complementary slackness (for bound constraints)
   print("\n2. COMPLEMENTARY SLACKNESS")
   print("   For each w_i >= 0 constraint:")
   for i in range(p):
       # mu_i * w_i should = 0 (either mu_i = 0 or w_i = 0)
       active = w_star[i] < 1e-6
       print(f"   Asset {i}: w={w_star[i]:.6f}, "
             f"{'constraint active (multiplier can be > 0)' if active else 'constraint inactive (multiplier = 0)'}")

   # 3. Risk constraint
   risk_active = abs(risk - sigma2_max) < 1e-4
   print(f"\n   Risk constraint: {'ACTIVE' if risk_active else 'inactive'} "
         f"(variance = {risk:.6f}, budget = {sigma2_max})")

   # Solution quality
   print(f"\n=== Optimal Portfolio ===")
   print(f"{'Asset':>5s}  {'Weight':>8s}  {'Return':>8s}  {'Contribution':>12s}")
   print("-" * 38)
   for i in range(p):
       print(f"{i:5d}  {w_star[i]:8.4f}  {mu[i]:8.2%}  {w_star[i]*mu[i]:12.4%}")
   print(f"{'Total':>5s}  {np.sum(w_star):8.4f}  {mu @ w_star:8.2%}")


16.3 Examples in Maximum Likelihood
=====================================

Multinomial MLE with Simplex Constraint
-----------------------------------------

Suppose :math:`X = (X_1, \dots, X_K) \sim \text{Multinomial}(n, \boldsymbol{\pi})`
with :math:`\boldsymbol{\pi} = (\pi_1, \dots, \pi_K)`. The log-likelihood is

.. math::

   \ell(\boldsymbol{\pi})
   = \sum_{k=1}^K X_k \log \pi_k + \text{const}.

We maximize subject to :math:`\sum_k \pi_k = 1` and :math:`\pi_k \geq 0`.

**Using Lagrange multipliers** for the equality constraint (the positivity
constraints turn out to be inactive at the interior solution):

.. math::

   \mathcal{L} = \sum_{k=1}^K X_k \log \pi_k + \lambda\Bigl(1 - \sum_{k=1}^K \pi_k\Bigr).

Setting :math:`\partial\mathcal{L}/\partial\pi_k = 0`:

.. math::

   \frac{X_k}{\pi_k} - \lambda = 0
   \quad\Longrightarrow\quad
   \pi_k = \frac{X_k}{\lambda}.

Summing over :math:`k`: :math:`1 = n/\lambda`, so :math:`\lambda = n` and

.. math::

   \hat{\pi}_k = \frac{X_k}{n},

as expected. Let us verify this Lagrange multiplier derivation numerically.

.. code-block:: python

   # Multinomial MLE: Lagrange multiplier solution vs. direct formula
   import numpy as np
   from scipy.optimize import minimize

   np.random.seed(42)
   K = 6
   counts = np.array([45, 30, 15, 5, 3, 2])
   n = counts.sum()

   # Analytical solution from Lagrange multipliers
   pi_analytical = counts / n

   # Numerical solution using scipy with constraints
   constraints = [{'type': 'eq', 'fun': lambda pi: np.sum(pi) - 1}]
   bounds = [(1e-10, 1)] * K
   pi0 = np.ones(K) / K
   res = minimize(lambda pi: -np.sum(counts * np.log(pi)), pi0,
                  method='SLSQP', bounds=bounds, constraints=constraints)
   pi_numerical = res.x

   print(f"{'Category':>8s}  {'Count':>5s}  {'Analytical':>10s}  {'Numerical':>10s}  {'Match':>6s}")
   print("-" * 48)
   for k in range(K):
       match = "yes" if abs(pi_analytical[k] - pi_numerical[k]) < 1e-6 else "no"
       print(f"{k+1:8d}  {counts[k]:5d}  {pi_analytical[k]:10.4f}  "
             f"{pi_numerical[k]:10.4f}  {match:>6s}")
   print(f"\nLagrange multiplier lambda = n = {n}")

The Full Portfolio Optimization
----------------------------------

Now let us solve the portfolio problem properly with constraints, comparing
the analytical Lagrange approach with numerical optimization.

.. code-block:: python

   # Portfolio optimization: solve analytically (equality only) and numerically
   import numpy as np
   from scipy.optimize import minimize

   np.random.seed(42)
   p = 5
   mu = np.array([0.12, 0.10, 0.07, 0.03, 0.14])
   A = np.random.randn(p, p) * 0.1
   Sigma = A.T @ A + 0.05 * np.eye(p)

   # Minimum-variance portfolio (equality constraint only: w^T 1 = 1)
   # Lagrangian: L = 0.5 w^T Sigma w - lambda*(1^T w - 1)
   # dL/dw = Sigma w - lambda 1 = 0  =>  w = lambda * Sigma^{-1} 1
   # Constraint: 1^T w = 1  =>  lambda = 1 / (1^T Sigma^{-1} 1)
   Sigma_inv = np.linalg.inv(Sigma)
   ones = np.ones(p)
   lam_mv = 1.0 / (ones @ Sigma_inv @ ones)
   w_mv_analytic = lam_mv * Sigma_inv @ ones

   # Numerical
   res_mv = minimize(lambda w: 0.5 * w @ Sigma @ w, np.ones(p)/p,
                     method='SLSQP',
                     constraints=[{'type': 'eq', 'fun': lambda w: np.sum(w) - 1}])
   w_mv_numerical = res_mv.x

   print("Minimum-variance portfolio (equality constraint only):")
   print(f"{'Asset':>5s}  {'Analytic':>10s}  {'Numerical':>10s}")
   print("-" * 30)
   for i in range(p):
       print(f"{i:5d}  {w_mv_analytic[i]:10.4f}  {w_mv_numerical[i]:10.4f}")
   print(f"{'Var':>5s}  {w_mv_analytic @ Sigma @ w_mv_analytic:10.6f}  "
         f"{w_mv_numerical @ Sigma @ w_mv_numerical:10.6f}")

   # Full problem: maximize return with risk budget + no short selling
   sigma2_max = 0.02
   constraints_full = [
       {'type': 'eq',   'fun': lambda w: np.sum(w) - 1},
       {'type': 'ineq', 'fun': lambda w: sigma2_max - w @ Sigma @ w},
   ]
   bounds = [(0, 1)] * p
   res_full = minimize(lambda w: -mu @ w, np.ones(p)/p,
                       method='SLSQP', bounds=bounds, constraints=constraints_full)

   print(f"\nFull constrained portfolio (risk budget {sigma2_max}):")
   print(f"  Weights: {np.round(res_full.x, 4)}")
   print(f"  Return:  {mu @ res_full.x:.4f}")
   print(f"  Risk:    {res_full.x @ Sigma @ res_full.x:.6f}")

Positive-Definiteness Constraint on Covariance
-------------------------------------------------

In multivariate normal MLE, :math:`\boldsymbol{\Sigma}` must be positive
definite. This is a matrix inequality constraint. In practice, this is handled
by reparameterization (Section 16.7) rather than by explicit KKT conditions ---
for instance, working with the Cholesky factor :math:`\mathbf{L}` such that
:math:`\boldsymbol{\Sigma} = \mathbf{L}\mathbf{L}^{\!\top}`.

Variance Positivity
--------------------

For a normal model :math:`\mathcal{N}(\mu, \sigma^2)`, the constraint
:math:`\sigma^2 > 0` is automatically satisfied by the MLE
:math:`\hat{\sigma}^2 = \frac{1}{n}\sum_i (x_i - \bar{x})^2 > 0` (for
non-degenerate data). But in mixed models or penalized settings, the boundary
:math:`\sigma^2 = 0` can be relevant, and the KKT conditions (specifically
complementary slackness) govern the behavior.


16.4 Augmented Lagrangian Method
==================================

Motivation
----------

The basic Lagrangian approach requires solving the KKT system exactly, which
can be difficult. The **augmented Lagrangian method** adds a quadratic penalty
term to improve numerical conditioning.

Formulation
-----------

For equality constraints :math:`h_j(\boldsymbol{\theta}) = 0`, the augmented
Lagrangian is

.. math::
   :label: aug_lag

   \mathcal{L}_\rho(\boldsymbol{\theta}, \boldsymbol{\lambda})
   = f(\boldsymbol{\theta})
   - \sum_{j=1}^m \lambda_j\,h_j(\boldsymbol{\theta})
   + \frac{\rho}{2}\sum_{j=1}^m h_j(\boldsymbol{\theta})^2,

where :math:`\rho > 0` is a penalty parameter. The last term penalizes
constraint violations, encouraging feasibility even when the Lagrange
multiplier estimates are imperfect.

The augmented Lagrangian combines the best of both worlds: the Lagrange
multipliers handle the constraint "exactly" in the limit, while the quadratic
penalty provides numerical stability along the way.

Algorithm
---------

.. code-block:: text

   Initialize: theta_0, lambda_0, rho_0
   For k = 0, 1, 2, ...
       1. Minimize L_rho(theta, lambda_k) with respect to theta -> theta_{k+1}
       2. Update multipliers: lambda_{k+1,j} = lambda_{k,j} - rho_k h_j(theta_{k+1})
       3. Optionally increase rho_k -> rho_{k+1}
       4. Check convergence

The multiplier update in step 2 is a **dual ascent** step: it moves the
multipliers in the direction that increases the Lagrangian, encouraging the
constraints to be satisfied more tightly.

Let us implement the augmented Lagrangian from scratch and watch it converge.

.. code-block:: python

   # Augmented Lagrangian method: min x^2 + 2y^2 s.t. x + y = 1
   import numpy as np

   np.random.seed(42)

   # Objective and constraint
   def f(x, y):
       return x**2 + 2*y**2

   def h(x, y):
       return x + y - 1  # constraint: h = 0

   def grad_f(x, y):
       return np.array([2*x, 4*y])

   def grad_h(x, y):
       return np.array([1.0, 1.0])

   # Augmented Lagrangian method
   theta = np.array([0.0, 0.0])  # initial guess
   lam = 0.0                      # initial multiplier
   rho = 1.0                      # penalty parameter

   print(f"{'Iter':>4s}  {'x':>8s}  {'y':>8s}  {'lambda':>8s}  {'rho':>6s}  "
         f"{'f(x,y)':>8s}  {'|h(x,y)|':>10s}")
   print("-" * 62)

   for k in range(15):
       # Inner loop: minimize augmented Lagrangian w.r.t. theta using GD
       for _ in range(100):
           x, y = theta
           hval = h(x, y)
           grad = grad_f(x, y) - lam * grad_h(x, y) + rho * hval * grad_h(x, y)
           theta = theta - 0.05 * grad

       x, y = theta
       hval = h(x, y)
       print(f"{k:4d}  {x:8.5f}  {y:8.5f}  {lam:8.4f}  {rho:6.1f}  "
             f"{f(x, y):8.5f}  {abs(hval):10.2e}")

       # Update multiplier (dual ascent)
       lam = lam - rho * hval

       # Optionally increase rho
       if abs(hval) > 0.01:
           rho *= 1.5

       if abs(hval) < 1e-8:
           print(f"Converged at iteration {k}.")
           break

   print(f"\nSolution: x={theta[0]:.6f}, y={theta[1]:.6f}")
   print(f"Exact:    x={2/3:.6f}, y={1/3:.6f}")

**Advantages over a pure penalty method:** The augmented Lagrangian converges
to the correct solution for a *finite* value of :math:`\rho`; a pure penalty
method requires :math:`\rho \to \infty`, causing ill-conditioning.

Connection to ADMM
--------------------

The **Alternating Direction Method of Multipliers (ADMM)** applies the
augmented Lagrangian method to a problem that has been split into two blocks:

.. math::

   \min\; f_1(\boldsymbol{\theta}_1) + f_2(\boldsymbol{\theta}_2)
   \quad\text{s.t.}\quad
   \mathbf{A}_1\boldsymbol{\theta}_1 + \mathbf{A}_2\boldsymbol{\theta}_2
   = \mathbf{b}.

The augmented Lagrangian is minimized alternately over
:math:`\boldsymbol{\theta}_1` and :math:`\boldsymbol{\theta}_2`, followed by a
multiplier update. ADMM is widely used in machine learning for distributed
optimization and for problems with structured regularizers (e.g., LASSO,
group LASSO).


16.5 Barrier (Interior-Point) Methods
========================================

Motivation
----------

Interior-point methods approach the constrained optimum from strictly inside
the feasible region, using a barrier function that goes to infinity at the
boundary.

.. admonition:: Intuition

   Imagine you are optimizing in a room, and the walls represent constraints.
   A barrier function creates a "force field" near the walls that repels you
   from getting too close.  As the optimization proceeds, the force field is
   gradually weakened (by increasing :math:`t`), allowing you to creep closer
   and closer to the walls --- but you never actually hit them.  In the limit,
   you end up right at the constrained optimum, possibly touching one or more
   walls.

Log-Barrier Function
---------------------

For inequality constraints :math:`g_i(\boldsymbol{\theta}) \leq 0`, the
**log-barrier** function is

.. math::
   :label: log_barrier

   B(\boldsymbol{\theta}) = -\sum_{i=1}^q \log\bigl(-g_i(\boldsymbol{\theta})\bigr).

Note that :math:`B(\boldsymbol{\theta}) \to +\infty` as any constraint becomes
active (:math:`g_i \to 0^-`), so minimizing :math:`f + (1/t)\,B` keeps the
iterates strictly inside the feasible region.

The Barrier Problem
--------------------

For a parameter :math:`t > 0`, solve the unconstrained problem

.. math::
   :label: barrier_prob

   \min_{\boldsymbol{\theta}}\;
   f(\boldsymbol{\theta}) + \frac{1}{t}\,B(\boldsymbol{\theta})
   = f(\boldsymbol{\theta})
   - \frac{1}{t}\sum_{i=1}^q \log\bigl(-g_i(\boldsymbol{\theta})\bigr).

As :math:`t \to \infty`, the barrier term vanishes and the solution approaches
the constrained optimum.

Let us implement the barrier method on our portfolio problem and watch the
barrier parameter grow as the solution converges to the constraint boundary.

.. code-block:: python

   # Barrier method for portfolio optimization
   # Maximize mu^T w s.t. w^T Sigma w <= sigma2_max, sum(w)=1, w >= 0
   # Reformulated as: min -mu^T w with barrier for inequality constraints
   import numpy as np

   np.random.seed(42)
   p = 3  # small for clarity
   mu = np.array([0.12, 0.07, 0.14])
   A = np.random.randn(p, p) * 0.1
   Sigma = A.T @ A + 0.05 * np.eye(p)
   sigma2_max = 0.03

   # Constraints: w >= 0 (p constraints), w^T Sigma w <= sigma2_max (1 constraint)
   # We handle sum(w) = 1 by projecting after each step

   def barrier_obj(w, t):
       """Objective + (1/t) * barrier."""
       obj = -mu @ w
       # Barrier for w_i > 0
       if np.any(w <= 0):
           return 1e10
       obj -= (1.0 / t) * np.sum(np.log(w))
       # Barrier for sigma2_max - w^T Sigma w > 0
       slack = sigma2_max - w @ Sigma @ w
       if slack <= 0:
           return 1e10
       obj -= (1.0 / t) * np.log(slack)
       return obj

   def barrier_grad(w, t):
       """Gradient of objective + (1/t) * barrier."""
       grad = -mu
       grad -= (1.0 / t) / w                           # d/dw [-log(w_i)] = -1/w_i
       slack = sigma2_max - w @ Sigma @ w
       grad -= (1.0 / t) * (-2 * Sigma @ w) / slack    # d/dw [-log(slack)]
       return grad

   def project_simplex(w):
       """Project w onto the simplex {w >= 0, sum(w) = 1}."""
       # Simple projection: normalize positive part
       w = np.maximum(w, 1e-10)
       return w / np.sum(w)

   w = np.ones(p) / p  # start at equal weights (feasible)

   print(f"{'t':>8s}  ", end="")
   for i in range(p):
       print(f"{'w'+str(i):>8s}  ", end="")
   print(f"{'Return':>8s}  {'Risk':>8s}  {'Barrier obj':>12s}")
   print("-" * 70)

   for t in [1, 5, 20, 100, 500, 2000, 10000]:
       # Inner loop: projected gradient descent
       for step in range(500):
           g = barrier_grad(w, t)
           lr = 0.0005 / (1 + step * 0.001)
           w = w - lr * g
           w = project_simplex(w)
           # Ensure feasibility
           while w @ Sigma @ w >= sigma2_max:
               w = 0.95 * w + 0.05 * np.ones(p) / p

       ret = mu @ w
       risk = w @ Sigma @ w
       bobj = barrier_obj(w, t)
       print(f"{t:8d}  ", end="")
       for i in range(p):
           print(f"{w[i]:8.4f}  ", end="")
       print(f"{ret:8.4f}  {risk:8.6f}  {bobj:12.6f}")

   print(f"\nAs t -> inf, the barrier weakens and the solution approaches")
   print(f"the constrained optimum. Risk budget: {sigma2_max}")

The Central Path
-----------------

The set of solutions :math:`\{\boldsymbol{\theta}^*(t) : t > 0\}` traces a
smooth curve in parameter space called the **central path**. As :math:`t`
increases, the points on the central path approach the constrained optimum.

The first-order optimality condition for :eq:`barrier_prob` is

.. math::

   \nabla f(\boldsymbol{\theta})
   + \frac{1}{t}\sum_{i=1}^q \frac{\nabla g_i(\boldsymbol{\theta})}{-g_i(\boldsymbol{\theta})}
   = \mathbf{0}.

Defining :math:`\mu_i = 1/(t \cdot (-g_i(\boldsymbol{\theta})))`, this becomes

.. math::

   \nabla f(\boldsymbol{\theta})
   + \sum_{i=1}^q \mu_i\,\nabla g_i(\boldsymbol{\theta}) = \mathbf{0},
   \qquad
   \mu_i\,g_i(\boldsymbol{\theta}) = -1/t,

which converges to the KKT conditions (with complementary slackness
:math:`\mu_i g_i = 0`) as :math:`t \to \infty`.

Algorithm
---------

.. code-block:: text

   Initialize: theta_0 strictly feasible, t_0 > 0, growth factor kappa > 1
   For k = 0, 1, 2, ...
       1. Solve the barrier problem (16.7) using Newton's method -> theta_k*
       2. Increase t: t_{k+1} = kappa * t_k
       3. Use theta_k* as the starting point for the next barrier problem
       4. Stop when q/t < epsilon (duality gap bound)

Interior-point methods have excellent theoretical properties: they can solve
convex problems with :math:`q` inequality constraints in
:math:`O(\sqrt{q}\,\log(1/\epsilon))` Newton steps.


16.6 Penalty Methods
======================

Quadratic Penalty Method
--------------------------

The simplest approach to constrained optimization is to add a penalty for
constraint violations. For equality constraints:

.. math::
   :label: quad_penalty

   \min_{\boldsymbol{\theta}}\;
   f(\boldsymbol{\theta})
   + \frac{\rho}{2}\sum_{j=1}^m h_j(\boldsymbol{\theta})^2.

For inequality constraints :math:`g_i(\boldsymbol{\theta}) \leq 0`:

.. math::

   \min_{\boldsymbol{\theta}}\;
   f(\boldsymbol{\theta})
   + \frac{\rho}{2}\sum_{i=1}^q \bigl[\max(0, g_i(\boldsymbol{\theta}))\bigr]^2.

Let us implement the penalty method and see how increasing :math:`\rho` drives
the solution toward feasibility---and also how it causes ill-conditioning.

.. code-block:: python

   # Penalty method: min x^2 + 2y^2 s.t. x + y = 1
   # Compare with exact solution (2/3, 1/3) and augmented Lagrangian
   import numpy as np

   np.random.seed(42)
   exact_x, exact_y = 2.0/3, 1.0/3
   exact_f = exact_x**2 + 2*exact_y**2

   print(f"{'rho':>10s}  {'x':>8s}  {'y':>8s}  {'f(x,y)':>8s}  "
         f"{'|h(x,y)|':>10s}  {'||error||':>10s}  {'cond(H)':>10s}")
   print("-" * 78)

   for rho in [1, 10, 100, 1000, 10000, 100000]:
       # Penalized objective: f(x,y) + (rho/2)*(x+y-1)^2
       # Gradient: [2x + rho*(x+y-1), 4y + rho*(x+y-1)]
       # Hessian:  [[2+rho, rho], [rho, 4+rho]]
       # Solve: H * [x,y] = [rho, rho]  (from setting gradient = 0)
       H = np.array([[2 + rho, rho], [rho, 4 + rho]])
       b = np.array([rho, rho])
       sol = np.linalg.solve(H, b)
       x, y = sol
       fval = x**2 + 2*y**2
       hval = x + y - 1
       err = np.sqrt((x - exact_x)**2 + (y - exact_y)**2)
       cond = np.linalg.cond(H)
       print(f"{rho:10d}  {x:8.6f}  {y:8.6f}  {fval:8.6f}  "
             f"{abs(hval):10.2e}  {err:10.2e}  {cond:10.1f}")

   print(f"\nExact: x={exact_x:.6f}, y={exact_y:.6f}, f={exact_f:.6f}")
   print("Note: as rho increases, h(x,y)->0 but cond(H) grows -> ill-conditioning!")

Convergence
-----------

**Theorem.** Let :math:`\boldsymbol{\theta}(\rho)` be the minimizer of the
quadratic penalty problem for a given :math:`\rho`. Under mild conditions, as
:math:`\rho \to \infty`, any limit point of :math:`\boldsymbol{\theta}(\rho)`
is a solution of the constrained problem.

*Proof sketch.* For any feasible point :math:`\bar{\boldsymbol{\theta}}`:

.. math::

   f(\boldsymbol{\theta}(\rho))
   + \frac{\rho}{2}\|\mathbf{h}(\boldsymbol{\theta}(\rho))\|^2
   \leq f(\bar{\boldsymbol{\theta}})
   + \frac{\rho}{2}\|\mathbf{h}(\bar{\boldsymbol{\theta}})\|^2
   = f(\bar{\boldsymbol{\theta}}),

since :math:`\mathbf{h}(\bar{\boldsymbol{\theta}}) = \mathbf{0}`. Therefore
:math:`\|\mathbf{h}(\boldsymbol{\theta}(\rho))\|^2 \leq
2[f(\bar{\boldsymbol{\theta}}) - f(\boldsymbol{\theta}(\rho))]/\rho`. As
:math:`\rho \to \infty`, the constraint violation shrinks to zero (assuming
:math:`f` is bounded below on the feasible set).

**Drawback:** For large :math:`\rho`, the penalized objective becomes
ill-conditioned (its Hessian has eigenvalues of order :math:`\rho`), making
the subproblem hard to solve. This motivates the augmented Lagrangian method
(Section 16.4), which achieves feasibility without driving :math:`\rho` to
infinity.

.. admonition:: Common Pitfall

   The quadratic penalty method is tempting because of its simplicity, but
   beware of the ill-conditioning trap.  As :math:`\rho` grows, the Hessian
   becomes dominated by the penalty term, making Newton's method and even
   gradient descent struggle with numerical precision.  If you find yourself
   needing :math:`\rho > 10^6`, switch to the augmented Lagrangian method or
   reparameterize the problem.


16.7 Reparameterization as an Alternative to Constraints
==========================================================

In practice, the most common way to handle parameter constraints in
statistical models is to transform the parameters so that the transformed
parameters are unconstrained. This is especially convenient with gradient-based
optimization.

.. admonition:: Why Reparameterize?

   Reparameterization is often the simplest and most robust approach.  Instead
   of fighting the constraints with multipliers and penalties, you change
   variables so the constraints simply *vanish*.  The transformed problem is
   unconstrained, so you can use any method from Chapters 12--14 directly.
   The catch is that the transformation must be smooth and invertible, and the
   transformed parameter space may have different curvature properties.

Positivity: The Log Transform
-------------------------------

If :math:`\sigma^2 > 0`, define :math:`\phi = \log \sigma^2`. Then
:math:`\phi \in \mathbb{R}` is unconstrained, and :math:`\sigma^2 = e^{\phi}`
is automatically positive.

The gradient transforms as

.. math::

   \frac{\partial \ell}{\partial \phi}
   = \frac{\partial \ell}{\partial \sigma^2}\,
     \frac{\partial \sigma^2}{\partial \phi}
   = \frac{\partial \ell}{\partial \sigma^2}\,e^{\phi}
   = \sigma^2\,\frac{\partial \ell}{\partial \sigma^2}.

.. code-block:: python

   # Log reparameterization: MLE of normal variance
   import numpy as np

   np.random.seed(42)
   n = 100
   sigma2_true = 4.0
   x = np.random.normal(0, np.sqrt(sigma2_true), n)

   # Direct: gradient ascent on sigma^2 (must stay positive!)
   sigma2 = 1.0
   print("--- Without reparameterization (constrained) ---")
   print(f"{'Iter':>4s}  {'sigma^2':>10s}  {'grad':>10s}")
   for k in range(8):
       grad = -n / (2 * sigma2) + np.sum(x**2) / (2 * sigma2**2)
       sigma2 = sigma2 + 0.01 * grad
       sigma2 = max(sigma2, 1e-6)  # MANUAL clipping to enforce positivity
       print(f"{k:4d}  {sigma2:10.4f}  {grad:10.4f}")

   # Reparameterized: gradient ascent on phi = log(sigma^2) (unconstrained!)
   phi = 0.0  # log(1.0) = 0
   print("\n--- With log reparameterization (unconstrained) ---")
   print(f"{'Iter':>4s}  {'phi':>10s}  {'sigma^2':>10s}  {'grad_phi':>10s}")
   for k in range(8):
       sigma2_reparam = np.exp(phi)
       # Chain rule: d(ell)/d(phi) = sigma^2 * d(ell)/d(sigma^2)
       grad_sigma2 = -n / (2 * sigma2_reparam) + np.sum(x**2) / (2 * sigma2_reparam**2)
       grad_phi = sigma2_reparam * grad_sigma2  # no clipping needed!
       phi = phi + 0.01 * grad_phi
       print(f"{k:4d}  {phi:10.4f}  {np.exp(phi):10.4f}  {grad_phi:10.4f}")

   print(f"\nTrue sigma^2 = {sigma2_true:.4f}")
   print(f"Sample variance = {np.var(x):.4f}")

Bounded Parameters: The Logit Transform
------------------------------------------

If :math:`p \in (0, 1)`, define :math:`\eta = \log(p / (1-p))`. Then
:math:`\eta \in \mathbb{R}`, and :math:`p = 1/(1 + e^{-\eta})` is
automatically in :math:`(0, 1)`.

More generally, for :math:`\theta \in (a, b)`:

.. math::

   \eta = \log\frac{\theta - a}{b - \theta},
   \qquad
   \theta = a + (b-a)\,\frac{1}{1 + e^{-\eta}}.

.. code-block:: python

   # Logit reparameterization for a binomial proportion
   import numpy as np
   from scipy.special import expit, logit

   np.random.seed(42)
   p_true = 0.73
   n = 50
   x = np.random.binomial(n, p_true)  # x successes out of n trials
   print(f"Observed: {x}/{n} = {x/n:.2f}, True p = {p_true}")

   # Gradient ascent on eta = logit(p), unconstrained
   eta = 0.0  # start at p = 0.5
   print(f"\n{'Iter':>4s}  {'eta':>8s}  {'p':>8s}  {'grad_eta':>10s}  {'log-lik':>10s}")
   for k in range(15):
       p = expit(eta)
       # Log-likelihood: x*log(p) + (n-x)*log(1-p)
       ll = x * np.log(p) + (n - x) * np.log(1 - p)
       # d(ell)/d(eta) = x - n*p  (a beautiful simplification!)
       grad_eta = x - n * p
       eta = eta + 0.05 * grad_eta
       print(f"{k:4d}  {eta:8.4f}  {p:8.4f}  {grad_eta:10.4f}  {ll:10.4f}")

   print(f"\nMLE: p = {x/n:.4f} (exact), p = {expit(eta):.4f} (reparameterized GD)")

Simplex Constraint: The Softmax Transform
--------------------------------------------

If :math:`\boldsymbol{\pi} = (\pi_1, \dots, \pi_K)` with
:math:`\pi_k \geq 0` and :math:`\sum_k \pi_k = 1`, define unconstrained
:math:`\boldsymbol{\eta} = (\eta_1, \dots, \eta_{K-1}) \in \mathbb{R}^{K-1}` and set

.. math::
   :label: softmax

   \pi_k = \frac{e^{\eta_k}}{\sum_{j=1}^{K} e^{\eta_j}},
   \qquad k = 1, \dots, K,

where :math:`\eta_K = 0` for identifiability (one degree of freedom is removed
by the sum-to-one constraint).

The softmax transform is smooth and invertible (on the interior of the
simplex), so gradient methods can be applied to :math:`\boldsymbol{\eta}`
without any constraint-handling machinery.

.. code-block:: python

   # Softmax reparameterization: multinomial MLE without constraints
   import numpy as np
   from scipy.special import softmax

   np.random.seed(42)
   K = 5
   counts = np.array([45, 30, 15, 5, 5])
   n = counts.sum()
   pi_true = counts / n

   # Unconstrained parameters (K-1 free, last fixed at 0)
   eta = np.zeros(K - 1)

   print(f"{'Iter':>4s}  ", end="")
   for k in range(K):
       print(f"{'pi_'+str(k+1):>8s}  ", end="")
   print(f"{'||pi - true||':>14s}")
   print("-" * 65)

   for step in range(200):
       # Forward: softmax to get probabilities
       eta_full = np.append(eta, 0.0)
       pi = softmax(eta_full)

       # Gradient: d(ell)/d(eta_k) = counts_k - n*pi_k for k=1,...,K-1
       grad = counts[:K-1] - n * pi[:K-1]
       eta = eta + 0.01 * grad

       if step % 40 == 0 or step == 199:
           err = np.linalg.norm(pi - pi_true)
           print(f"{step:4d}  ", end="")
           for k in range(K):
               print(f"{pi[k]:8.4f}  ", end="")
           print(f"{err:14.6f}")

   print(f"\nTrue:  {pi_true}")
   print(f"Sum:   {pi.sum():.8f} (always exactly 1 by construction)")

The key insight: the softmax guarantees :math:`\pi_k > 0` and
:math:`\sum_k \pi_k = 1` by construction, so we never need to worry about
constraint violations.

Positive-Definite Matrices: The Cholesky Factor
--------------------------------------------------

For a covariance matrix :math:`\boldsymbol{\Sigma}` that must be positive
definite, write :math:`\boldsymbol{\Sigma} = \mathbf{L}\mathbf{L}^{\!\top}`
where :math:`\mathbf{L}` is lower triangular with positive diagonal entries.
Optimize over the entries of :math:`\mathbf{L}`, parameterizing the diagonal as
:math:`L_{jj} = e^{\phi_j}` to ensure positivity.

This gives :math:`p(p+1)/2` unconstrained parameters, which is the correct
number of degrees of freedom for a :math:`p \times p` symmetric positive
definite matrix.

.. code-block:: python

   # Cholesky reparameterization for covariance MLE
   import numpy as np

   np.random.seed(42)
   d = 3
   # True covariance
   Sigma_true = np.array([[4.0, 1.0, 0.5],
                           [1.0, 2.0, 0.3],
                           [0.5, 0.3, 1.0]])
   L_true = np.linalg.cholesky(Sigma_true)

   # Generate data
   n = 500
   X = np.random.randn(n, d) @ L_true.T

   # Parameterize: L has d*(d+1)/2 free parameters
   # Diagonal: L_jj = exp(phi_j) (always positive)
   # Off-diagonal: unconstrained
   n_params = d * (d + 1) // 2
   params = np.zeros(n_params)  # initial: L = I

   def params_to_L(params, d):
       """Convert unconstrained params to lower triangular L."""
       L = np.zeros((d, d))
       idx = 0
       for i in range(d):
           for j in range(i + 1):
               if i == j:
                   L[i, j] = np.exp(params[idx])  # positive diagonal
               else:
                   L[i, j] = params[idx]           # free off-diagonal
               idx += 1
       return L

   def neg_log_lik(params, X, d):
       """Negative log-likelihood for multivariate normal (mu=0)."""
       L = params_to_L(params, d)
       Sigma = L @ L.T
       n = len(X)
       sign, logdet = np.linalg.slogdet(Sigma)
       Sigma_inv = np.linalg.inv(Sigma)
       return 0.5 * n * logdet + 0.5 * np.sum(X @ Sigma_inv * X)

   # Optimize using numerical gradient descent
   print(f"{'Iter':>4s}  {'neg-LL':>12s}  {'||Sigma - Sigma_hat||':>22s}")
   print("-" * 44)
   lr = 0.0001
   for step in range(500):
       # Numerical gradient
       grad = np.zeros(n_params)
       eps = 1e-5
       f0 = neg_log_lik(params, X, d)
       for j in range(n_params):
           params_p = params.copy()
           params_p[j] += eps
           grad[j] = (neg_log_lik(params_p, X, d) - f0) / eps
       params = params - lr * grad

       if step % 100 == 0 or step == 499:
           L = params_to_L(params, d)
           Sigma_hat = L @ L.T
           err = np.linalg.norm(Sigma_hat - Sigma_true, 'fro')
           print(f"{step:4d}  {f0:12.2f}  {err:22.4f}")

   L_final = params_to_L(params, d)
   Sigma_hat = L_final @ L_final.T
   print(f"\nTrue Sigma:\n{Sigma_true}")
   print(f"\nEstimated Sigma:\n{np.round(Sigma_hat, 3)}")
   print(f"Positive definite: {np.all(np.linalg.eigvalsh(Sigma_hat) > 0)}")

Correlation Matrices: Fisher's z-Transform
---------------------------------------------

For a correlation parameter :math:`r \in (-1, 1)`, use

.. math::

   z = \operatorname{arctanh}(r) = \frac{1}{2}\log\frac{1+r}{1-r},
   \qquad
   r = \tanh(z).

Then :math:`z \in \mathbb{R}` is unconstrained.

.. code-block:: python

   # Fisher's z-transform for correlation estimation
   import numpy as np

   np.random.seed(42)
   r_true = 0.85
   n = 100

   # Generate correlated data
   Sigma = np.array([[1.0, r_true], [r_true, 1.0]])
   L = np.linalg.cholesky(Sigma)
   data = np.random.randn(n, 2) @ L.T

   # Gradient ascent on z = arctanh(r), unconstrained
   z = 0.0  # start at r = 0
   print(f"{'Iter':>4s}  {'z':>8s}  {'r=tanh(z)':>10s}  {'log-lik':>10s}")
   for k in range(20):
       r = np.tanh(z)
       # Bivariate normal log-likelihood (mu=0, sigma=1, correlation r)
       det = 1 - r**2
       ll = -0.5 * n * np.log(det) - 0.5 / det * np.sum(
           data[:, 0]**2 - 2*r*data[:, 0]*data[:, 1] + data[:, 1]**2)
       # d(ell)/d(r)
       dll_dr = n * r / det + (1 / det**2) * np.sum(
           -r * (data[:, 0]**2 + data[:, 1]**2) + (1 + r**2) * data[:, 0] * data[:, 1])
       # Chain rule: d(ell)/d(z) = d(ell)/d(r) * d(r)/d(z) = dll_dr * (1 - r^2)
       grad_z = dll_dr * (1 - r**2)
       z = z + 0.001 * grad_z
       if k % 4 == 0:
           print(f"{k:4d}  {z:8.4f}  {np.tanh(z):10.4f}  {ll:10.2f}")

   print(f"\nTrue r = {r_true}, Sample r = {np.corrcoef(data.T)[0,1]:.4f}, "
         f"Estimated r = {np.tanh(z):.4f}")

All Four Reparameterizations at a Glance
-------------------------------------------

.. code-block:: python

   # Summary: all reparameterization strategies
   import numpy as np

   print(f"{'Constraint':>25s}  {'Transform':>25s}  {'Inverse':>25s}")
   print("-" * 80)
   examples = [
       ("sigma^2 > 0",       "phi = log(sigma^2)",    "sigma^2 = exp(phi)"),
       ("p in (0, 1)",        "eta = logit(p)",         "p = sigmoid(eta)"),
       ("r in (-1, 1)",       "z = arctanh(r)",         "r = tanh(z)"),
       ("pi on simplex",      "eta_k unconstrained",    "pi = softmax(eta)"),
       ("Sigma pos. def.",    "L = chol(Sigma)",        "Sigma = L @ L.T"),
   ]
   for constraint, transform, inverse in examples:
       print(f"{constraint:>25s}  {transform:>25s}  {inverse:>25s}")

   # Numerical demo: all transforms preserve constraints
   print("\nNumerical verification:")
   phi = -2.5
   print(f"  exp({phi}) = {np.exp(phi):.4f} > 0: True")
   eta = 3.7
   p = 1 / (1 + np.exp(-eta))
   print(f"  sigmoid({eta}) = {p:.4f} in (0,1): {0 < p < 1}")
   z = 1.2
   r = np.tanh(z)
   print(f"  tanh({z}) = {r:.4f} in (-1,1): {-1 < r < 1}")
   eta_vec = np.array([1.0, -0.5, 0.3])
   from scipy.special import softmax
   pi = softmax(np.append(eta_vec, 0.0))
   print(f"  softmax({list(np.round(eta_vec, 1))} + [0]) = {np.round(pi, 3)}, "
         f"sum={pi.sum():.6f}")

When to Reparameterize vs. Use Constrained Methods
-----------------------------------------------------

.. list-table::
   :header-rows: 1
   :widths: 40 30 30

   * - Scenario
     - Reparameterize
     - Constrained method
   * - Simple positivity/boundedness
     - Preferred
     - Overkill
   * - Simplex (few categories)
     - Preferred (softmax)
     - Fine too
   * - Complex coupled constraints
     - May be difficult
     - Preferred
   * - Linear constraints (:math:`\mathbf{A}\boldsymbol{\theta} = \mathbf{b}`)
     - Null-space projection
     - Lagrangian
   * - Interpretability of Hessian
     - Harder (need Jacobian)
     - Original parameterization


16.8 Putting It All Together: Constrained MLE
================================================

A typical workflow for constrained maximum likelihood is:

1. **Identify the constraints.** Are they simple (positivity, simplex) or
   complex (coupled inequalities, linear equalities)?

2. **Try reparameterization first.** For simple constraints, the log, logit,
   softmax, or Cholesky transforms eliminate the constraints. Apply any
   unconstrained method from :ref:`ch12_gradient`, :ref:`ch13_newton`, or
   :ref:`ch14_quasi_newton`.

3. **If reparameterization is infeasible,** use:

   - **Lagrange multipliers** (for small problems with equality constraints).
   - **Interior-point methods** (for convex problems with many inequality
     constraints).
   - **Augmented Lagrangian / ADMM** (for large-scale or separable problems).
   - **Penalty methods** (for quick-and-dirty solutions).

4. **Verify the KKT conditions** at the solution. Check that all constraints
   are satisfied, multipliers have the correct sign, and complementary
   slackness holds.

5. **Compute standard errors** using the constrained observed information
   (which may involve the bordered Hessian or the Hessian of the Lagrangian
   restricted to the constraint manifold).

Let us run a final comprehensive example comparing all three approaches---
penalty, augmented Lagrangian, and reparameterization---on the same problem.

.. code-block:: python

   # Head-to-head: three approaches to the same constrained problem
   # Problem: MLE of Bernoulli p with constraint p in (0.2, 0.8)
   import numpy as np
   from scipy.special import expit

   np.random.seed(42)
   n = 30
   p_true = 0.65
   x = np.random.binomial(1, p_true, n)
   p_mle = x.mean()
   print(f"Observed: {x.sum()}/{n} successes, unconstrained MLE = {p_mle:.4f}")

   # Method 1: Quadratic penalty for constraint p >= 0.2 and p <= 0.8
   p = 0.5
   for rho in [10, 100, 1000]:
       for _ in range(200):
           grad = x.sum() / p - (n - x.sum()) / (1 - p)
           # Penalty gradient for g1: 0.2 - p <= 0 and g2: p - 0.8 <= 0
           if p < 0.2:
               grad -= rho * (0.2 - p)
           if p > 0.8:
               grad -= rho * (p - 0.8)
           p = p + 0.001 * grad
           p = np.clip(p, 0.01, 0.99)

   p_penalty = p
   print(f"\nPenalty method:          p = {p_penalty:.6f}")

   # Method 2: Reparameterization via logit with bounds
   # Map p in (0.2, 0.8) to eta in R via eta = log((p-0.2)/(0.8-p))
   a, b = 0.2, 0.8
   eta = 0.0  # start at midpoint p = 0.5
   for _ in range(500):
       p = a + (b - a) * expit(eta)
       grad_p = x.sum() / p - (n - x.sum()) / (1 - p)
       dp_deta = (b - a) * expit(eta) * (1 - expit(eta))
       grad_eta = grad_p * dp_deta
       eta = eta + 0.01 * grad_eta

   p_reparam = a + (b - a) * expit(eta)
   print(f"Reparameterization:     p = {p_reparam:.6f}")

   # Method 3: Projected gradient (project onto [0.2, 0.8] after each step)
   p = 0.5
   for _ in range(500):
       grad = x.sum() / p - (n - x.sum()) / (1 - p)
       p = p + 0.001 * grad
       p = np.clip(p, 0.2, 0.8)  # project onto feasible set

   p_projected = p
   print(f"Projected gradient:     p = {p_projected:.6f}")

   # Analytical: MLE is x.mean(), clipped to [0.2, 0.8]
   p_exact = np.clip(p_mle, 0.2, 0.8)
   print(f"Exact (clipped MLE):    p = {p_exact:.6f}")


16.9 Summary
==============

Constrained optimization is an essential part of likelihood-based inference.
Lagrange multipliers handle equality constraints by introducing dual variables
whose values measure the sensitivity of the optimum to the constraints. The
KKT conditions extend this framework to inequality constraints, with
complementary slackness determining which constraints are active. Algorithmic
approaches --- augmented Lagrangian, barrier methods, and penalty methods ---
provide practical ways to solve constrained problems numerically.

In statistical practice, the most common approach is **reparameterization**:
the log transform for positivity, the logit for bounded parameters, the
softmax for simplex constraints, and the Cholesky decomposition for positive
definiteness. These transforms convert the constrained problem into an
unconstrained one, allowing direct application of the gradient methods
(:ref:`ch12_gradient`), Newton methods (:ref:`ch13_newton`), and quasi-Newton
methods (:ref:`ch14_quasi_newton`) developed in earlier chapters. When
reparameterization is not feasible or when the constraints have complex
structure, the formal constrained-optimization methods of this chapter provide
the necessary tools.