Chapter 16: Constrained Optimization

You manage a portfolio of 5 assets. You want to maximize expected return, but you have a risk budget: the portfolio variance must not exceed 2%. The weights must be non-negative (no short selling) and sum to one (fully invested). This is a constrained optimization problem—and if you ignore the constraints, your optimizer will happily tell you to put 500% of your money in the highest-return asset and short-sell everything else.

# What happens when you ignore constraints
import numpy as np

np.random.seed(42)
p = 5
mu = np.array([0.12, 0.10, 0.07, 0.03, 0.14])   # expected returns
A = np.random.randn(p, p) * 0.1
Sigma = A.T @ A + 0.05 * np.eye(p)                # covariance of returns

# Unconstrained "optimal" portfolio: just maximize mu^T w
# Without constraints, w -> infinity in the direction of max return
w_unconstrained = mu / np.linalg.norm(mu)  # just point at highest return
w_unconstrained *= 10  # scale up for "more return"
print("Unconstrained 'solution':")
print(f"  Weights: {np.round(w_unconstrained, 2)}")
print(f"  Sum of weights: {w_unconstrained.sum():.2f} (should be 1.00)")
print(f"  Min weight: {w_unconstrained.min():.2f} (should be >= 0)")
print(f"  Portfolio variance: {w_unconstrained @ Sigma @ w_unconstrained:.4f} (budget: 0.02)")
print("\nThis is nonsense. We NEED constraints.")

In many likelihood problems the parameters cannot take arbitrary values. Probabilities must be non-negative and sum to one. Variances must be positive. Correlation matrices must be positive semi-definite. This chapter develops the theory and algorithms for optimization under constraints, starting from Lagrange multipliers for equality constraints, extending to the Karush–Kuhn–Tucker (KKT) conditions for inequality constraints, and then covering the main algorithmic approaches: augmented Lagrangian methods, barrier (interior-point) methods, and penalty methods. We close with the practical alternative of reparameterization, which transforms a constrained problem into an unconstrained one.

Why Constrained Optimization Matters

Every time you fit a model with probabilities, variances, or proportions, you are doing constrained optimization — whether you realize it or not. The unconstrained methods from earlier chapters (Chapter 12: Gradient Methods, Chapter 13: Newton and Scoring Methods, Chapter 14: Quasi-Newton Methods) are powerful, but they can happily produce a negative variance or probabilities that sum to 1.3. Constrained optimization gives you the mathematical framework to keep parameters in their proper domains while still finding the optimum.

16.1 Equality Constraints and Lagrange Multipliers

Problem Formulation

Consider the problem

(30)\[\min_{\boldsymbol{\theta} \in \mathbb{R}^p}\; f(\boldsymbol{\theta}) \qquad\text{subject to}\qquad h_j(\boldsymbol{\theta}) = 0, \quad j = 1, \dots, m,\]

where \(f\) and \(h_1, \dots, h_m\) are continuously differentiable.

Geometric Intuition

At a constrained minimum \(\boldsymbol{\theta}^*\), the gradient of \(f\) cannot have a component in the feasible set — otherwise we could decrease \(f\) by moving along the constraint surface. More precisely, the gradient \(\nabla f(\boldsymbol{\theta}^*)\) must be normal to the constraint surface.

Think of it this way: you are hiking on a mountain and are confined to a trail (the constraint surface). At the lowest point of the trail, the slope of the mountain must be perpendicular to the trail — if there were any downhill component along the trail, you could walk further and go lower.

The constraint surface is (locally) a manifold defined by \(\mathbf{h}(\boldsymbol{\theta}) = \mathbf{0}\). Its tangent space at \(\boldsymbol{\theta}^*\) is the null space of the Jacobian matrix

\[\begin{split}\mathbf{J}_h(\boldsymbol{\theta}^*) = \begin{pmatrix} \nabla h_1(\boldsymbol{\theta}^*)^{\!\top} \\ \vdots \\ \nabla h_m(\boldsymbol{\theta}^*)^{\!\top} \end{pmatrix} \in \mathbb{R}^{m \times p}.\end{split}\]

The normal space to the constraint surface is spanned by the rows of \(\mathbf{J}_h\), i.e., by the gradients \(\nabla h_j\). The condition that \(\nabla f\) lies in this normal space is

\[\nabla f(\boldsymbol{\theta}^*) = \sum_{j=1}^m \lambda_j^*\,\nabla h_j(\boldsymbol{\theta}^*),\]

or equivalently

(31)\[\nabla f(\boldsymbol{\theta}^*) - \sum_{j=1}^m \lambda_j^*\,\nabla h_j(\boldsymbol{\theta}^*) = \mathbf{0}.\]

The scalars \(\lambda_1^*, \dots, \lambda_m^*\) are the Lagrange multipliers.

The Lagrangian

Define the Lagrangian function

(32)\[\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\lambda}) = f(\boldsymbol{\theta}) - \sum_{j=1}^m \lambda_j\,h_j(\boldsymbol{\theta}).\]

Reading this formula: the Lagrangian takes the objective \(f\) and subtracts a penalty for each constraint, weighted by the multiplier \(\lambda_j\). If \(\lambda_j\) is large, the corresponding constraint exerts a strong pull on the solution. The Lagrangian converts the constrained problem into an unconstrained one: finding a stationary point of \(\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\lambda})\) with respect to both \(\boldsymbol{\theta}\) and \(\boldsymbol{\lambda}\) gives you the constrained optimum and the multipliers simultaneously.

The necessary conditions for a constrained minimum are found by setting the partial derivatives to zero:

(33)\[\begin{split}\nabla_{\boldsymbol{\theta}} \mathcal{L} &= \nabla f(\boldsymbol{\theta}) - \sum_{j=1}^m \lambda_j\,\nabla h_j(\boldsymbol{\theta}) = \mathbf{0}, \\ \nabla_{\boldsymbol{\lambda}} \mathcal{L} &= -\mathbf{h}(\boldsymbol{\theta}) = \mathbf{0}.\end{split}\]

The first equation is (31) — the gradient of \(f\) must lie in the span of the constraint gradients. The second equation is simply the requirement that all constraints are satisfied (\(h_j = 0\)). Together they form a system of \(p + m\) equations in \(p + m\) unknowns \((\boldsymbol{\theta}, \boldsymbol{\lambda})\), which can be solved by Newton’s method or other root-finding techniques.

Let us solve this both analytically and numerically and compare.

# Lagrange multipliers: analytical vs. numerical
# Problem: minimize x^2 + 2y^2 subject to x + y = 1
import numpy as np
from scipy.optimize import fsolve

# --- Analytical solution ---
# Lagrangian: L = x^2 + 2y^2 - lambda*(x + y - 1)
# dL/dx = 2x - lambda = 0  =>  x = lambda/2
# dL/dy = 4y - lambda = 0  =>  y = lambda/4
# Constraint: x + y = 1  =>  lambda/2 + lambda/4 = 1  =>  lambda = 4/3
lam_analytic = 4.0 / 3.0
x_analytic = lam_analytic / 2.0
y_analytic = lam_analytic / 4.0

# --- Numerical solution (fsolve) ---
def lagrange_system(vars):
    x, y, lam = vars
    return [2*x - lam,        # dL/dx = 0
            4*y - lam,        # dL/dy = 0
            x + y - 1]        # constraint

sol = fsolve(lagrange_system, [0.5, 0.5, 1.0])
x_num, y_num, lam_num = sol

print("Minimize x^2 + 2y^2 subject to x + y = 1")
print(f"\n{'':>12s}  {'x':>8s}  {'y':>8s}  {'lambda':>8s}  {'f(x,y)':>8s}")
print("-" * 50)
print(f"{'Analytical':>12s}  {x_analytic:8.4f}  {y_analytic:8.4f}  "
      f"{lam_analytic:8.4f}  {x_analytic**2 + 2*y_analytic**2:8.4f}")
print(f"{'Numerical':>12s}  {x_num:8.4f}  {y_num:8.4f}  "
      f"{lam_num:8.4f}  {x_num**2 + 2*y_num**2:8.4f}")
print(f"\nConstraint satisfied: x + y = {x_num + y_num:.8f}")

Intuition

The Lagrange multiplier \(\lambda^*\) has a beautiful economic interpretation: it tells you the rate at which the optimal value of \(f\) changes if you relax the constraint by an infinitesimal amount. In operations research, this is called the shadow price of the constraint. If \(\lambda^*\) is large, the constraint is “expensive” — the objective would improve significantly if the constraint were loosened.

# Shadow price interpretation: relax the constraint and observe
import numpy as np
from scipy.optimize import fsolve

def solve_relaxed(c):
    """Solve min x^2 + 2y^2 s.t. x + y = c, return (x*, y*, f*)."""
    # Analytical: x = 2c/3, y = c/3, f = 2c^2/3
    return 2*c/3, c/3, 2*c**2/3

print("Shadow price of the constraint x + y = c:")
print(f"{'c':>6s}  {'x*':>8s}  {'y*':>8s}  {'f*':>8s}  {'df*/dc':>8s}")
print("-" * 44)
prev_f = None
for c in [0.9, 0.95, 1.0, 1.05, 1.1]:
    x, y, f = solve_relaxed(c)
    df_dc = f - prev_f if prev_f is not None else 0.0
    # The analytical derivative df*/dc = 4c/3, at c=1 this is 4/3 = lambda*
    print(f"{c:6.2f}  {x:8.4f}  {y:8.4f}  {f:8.4f}  {df_dc / 0.05:8.4f}")
    prev_f = f

print(f"\ndf*/dc at c=1 = {4.0/3.0:.4f} = lambda* (the Lagrange multiplier)")

Second-Order Sufficient Conditions

The first-order conditions (33) are necessary but not sufficient. A second-order sufficient condition for a local minimum is that the bordered Hessian (or equivalently the Hessian of the Lagrangian restricted to the tangent space of the constraints) is positive definite:

\[\mathbf{v}^{\!\top} \nabla^2_{\boldsymbol{\theta}\boldsymbol{\theta}} \mathcal{L} (\boldsymbol{\theta}^*, \boldsymbol{\lambda}^*)\,\mathbf{v} > 0 \quad\text{for all } \mathbf{v} \neq \mathbf{0} \text{ with } \mathbf{J}_h(\boldsymbol{\theta}^*)\mathbf{v} = \mathbf{0},\]

where

\[\nabla^2_{\boldsymbol{\theta}\boldsymbol{\theta}} \mathcal{L} = \nabla^2 f - \sum_{j=1}^m \lambda_j\,\nabla^2 h_j.\]

16.2 Inequality Constraints and KKT Conditions

Problem Formulation

The general constrained optimization problem is

(34)\[\begin{split}\min_{\boldsymbol{\theta}}\; f(\boldsymbol{\theta}) \qquad\text{subject to}\qquad &h_j(\boldsymbol{\theta}) = 0, \quad j = 1, \dots, m, \\ &g_i(\boldsymbol{\theta}) \leq 0, \quad i = 1, \dots, q.\end{split}\]

Inequality constraints bring a new subtlety: at the solution, some constraints will be active (holding with equality) and others will be inactive (strict inequality). The KKT conditions elegantly handle both cases.

Deriving the KKT Conditions

Define the Lagrangian with both equality and inequality constraints:

(35)\[\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\lambda}, \boldsymbol{\mu}) = f(\boldsymbol{\theta}) - \sum_{j=1}^m \lambda_j\,h_j(\boldsymbol{\theta}) + \sum_{i=1}^q \mu_i\,g_i(\boldsymbol{\theta}).\]

At a constrained minimum \((\boldsymbol{\theta}^*, \boldsymbol{\lambda}^*, \boldsymbol{\mu}^*)\), the following Karush–Kuhn–Tucker (KKT) conditions must hold (under a constraint qualification such as LICQ):

  1. Stationarity:

    \[\nabla_{\boldsymbol{\theta}} \mathcal{L} = \nabla f(\boldsymbol{\theta}^*) - \sum_{j=1}^m \lambda_j^*\,\nabla h_j(\boldsymbol{\theta}^*) + \sum_{i=1}^q \mu_i^*\,\nabla g_i(\boldsymbol{\theta}^*) = \mathbf{0}.\]
  2. Primal feasibility:

    \[h_j(\boldsymbol{\theta}^*) = 0 \quad\forall\, j, \qquad g_i(\boldsymbol{\theta}^*) \leq 0 \quad\forall\, i.\]
  3. Dual feasibility:

    \[\mu_i^* \geq 0 \quad\forall\, i.\]
  4. Complementary slackness:

    (36)\[\mu_i^*\,g_i(\boldsymbol{\theta}^*) = 0 \quad\forall\, i.\]

Complementary Slackness Explained

Condition (36) says that for each inequality constraint, either the constraint is active (\(g_i(\boldsymbol{\theta}^*) = 0\)) and the multiplier can be positive, or the constraint is inactive (\(g_i(\boldsymbol{\theta}^*) < 0\)) and the multiplier must be zero.

Intuition: An inactive constraint does not affect the solution — it is as if the constraint were not there. Only active constraints “push” on the solution, and the multiplier \(\mu_i^*\) measures the strength of that push.

The KKT conditions are sufficient for a global minimum when \(f\) is convex and the constraints define a convex feasible set (i.e., \(g_i\) convex, \(h_j\) affine).

Let us verify the KKT conditions in code on the portfolio problem.

# Checking KKT conditions on the portfolio problem
import numpy as np
from scipy.optimize import minimize

np.random.seed(42)
p = 5
mu = np.array([0.12, 0.10, 0.07, 0.03, 0.14])
A = np.random.randn(p, p) * 0.1
Sigma = A.T @ A + 0.05 * np.eye(p)
sigma2_max = 0.02

# Solve with scipy
constraints = [
    {'type': 'eq',   'fun': lambda w: np.sum(w) - 1},
    {'type': 'ineq', 'fun': lambda w: sigma2_max - w @ Sigma @ w},
]
bounds = [(0, 1)] * p
w0 = np.ones(p) / p
res = minimize(lambda w: -mu @ w, w0, method='SLSQP',
               bounds=bounds, constraints=constraints)
w_star = res.x

# Check all four KKT conditions
print("=== KKT Condition Check ===\n")

# 1. Primal feasibility
eq_violation = abs(np.sum(w_star) - 1)
risk = w_star @ Sigma @ w_star
print("1. PRIMAL FEASIBILITY")
print(f"   Sum of weights = {np.sum(w_star):.8f} (violation: {eq_violation:.2e})")
print(f"   Portfolio variance = {risk:.6f} <= {sigma2_max} : "
      f"{'OK' if risk <= sigma2_max + 1e-8 else 'VIOLATED'}")
for i in range(p):
    status = "ACTIVE" if w_star[i] < 1e-6 else "inactive"
    print(f"   w[{i}] = {w_star[i]:.6f} >= 0 : {status}")

# 2. Complementary slackness (for bound constraints)
print("\n2. COMPLEMENTARY SLACKNESS")
print("   For each w_i >= 0 constraint:")
for i in range(p):
    # mu_i * w_i should = 0 (either mu_i = 0 or w_i = 0)
    active = w_star[i] < 1e-6
    print(f"   Asset {i}: w={w_star[i]:.6f}, "
          f"{'constraint active (multiplier can be > 0)' if active else 'constraint inactive (multiplier = 0)'}")

# 3. Risk constraint
risk_active = abs(risk - sigma2_max) < 1e-4
print(f"\n   Risk constraint: {'ACTIVE' if risk_active else 'inactive'} "
      f"(variance = {risk:.6f}, budget = {sigma2_max})")

# Solution quality
print(f"\n=== Optimal Portfolio ===")
print(f"{'Asset':>5s}  {'Weight':>8s}  {'Return':>8s}  {'Contribution':>12s}")
print("-" * 38)
for i in range(p):
    print(f"{i:5d}  {w_star[i]:8.4f}  {mu[i]:8.2%}  {w_star[i]*mu[i]:12.4%}")
print(f"{'Total':>5s}  {np.sum(w_star):8.4f}  {mu @ w_star:8.2%}")

16.3 Examples in Maximum Likelihood

Multinomial MLE with Simplex Constraint

Suppose \(X = (X_1, \dots, X_K) \sim \text{Multinomial}(n, \boldsymbol{\pi})\) with \(\boldsymbol{\pi} = (\pi_1, \dots, \pi_K)\). The log-likelihood is

\[\ell(\boldsymbol{\pi}) = \sum_{k=1}^K X_k \log \pi_k + \text{const}.\]

We maximize subject to \(\sum_k \pi_k = 1\) and \(\pi_k \geq 0\).

Using Lagrange multipliers for the equality constraint (the positivity constraints turn out to be inactive at the interior solution):

\[\mathcal{L} = \sum_{k=1}^K X_k \log \pi_k + \lambda\Bigl(1 - \sum_{k=1}^K \pi_k\Bigr).\]

Setting \(\partial\mathcal{L}/\partial\pi_k = 0\):

\[\frac{X_k}{\pi_k} - \lambda = 0 \quad\Longrightarrow\quad \pi_k = \frac{X_k}{\lambda}.\]

Summing over \(k\): \(1 = n/\lambda\), so \(\lambda = n\) and

\[\hat{\pi}_k = \frac{X_k}{n},\]

as expected. Let us verify this Lagrange multiplier derivation numerically.

# Multinomial MLE: Lagrange multiplier solution vs. direct formula
import numpy as np
from scipy.optimize import minimize

np.random.seed(42)
K = 6
counts = np.array([45, 30, 15, 5, 3, 2])
n = counts.sum()

# Analytical solution from Lagrange multipliers
pi_analytical = counts / n

# Numerical solution using scipy with constraints
constraints = [{'type': 'eq', 'fun': lambda pi: np.sum(pi) - 1}]
bounds = [(1e-10, 1)] * K
pi0 = np.ones(K) / K
res = minimize(lambda pi: -np.sum(counts * np.log(pi)), pi0,
               method='SLSQP', bounds=bounds, constraints=constraints)
pi_numerical = res.x

print(f"{'Category':>8s}  {'Count':>5s}  {'Analytical':>10s}  {'Numerical':>10s}  {'Match':>6s}")
print("-" * 48)
for k in range(K):
    match = "yes" if abs(pi_analytical[k] - pi_numerical[k]) < 1e-6 else "no"
    print(f"{k+1:8d}  {counts[k]:5d}  {pi_analytical[k]:10.4f}  "
          f"{pi_numerical[k]:10.4f}  {match:>6s}")
print(f"\nLagrange multiplier lambda = n = {n}")

The Full Portfolio Optimization

Now let us solve the portfolio problem properly with constraints, comparing the analytical Lagrange approach with numerical optimization.

# Portfolio optimization: solve analytically (equality only) and numerically
import numpy as np
from scipy.optimize import minimize

np.random.seed(42)
p = 5
mu = np.array([0.12, 0.10, 0.07, 0.03, 0.14])
A = np.random.randn(p, p) * 0.1
Sigma = A.T @ A + 0.05 * np.eye(p)

# Minimum-variance portfolio (equality constraint only: w^T 1 = 1)
# Lagrangian: L = 0.5 w^T Sigma w - lambda*(1^T w - 1)
# dL/dw = Sigma w - lambda 1 = 0  =>  w = lambda * Sigma^{-1} 1
# Constraint: 1^T w = 1  =>  lambda = 1 / (1^T Sigma^{-1} 1)
Sigma_inv = np.linalg.inv(Sigma)
ones = np.ones(p)
lam_mv = 1.0 / (ones @ Sigma_inv @ ones)
w_mv_analytic = lam_mv * Sigma_inv @ ones

# Numerical
res_mv = minimize(lambda w: 0.5 * w @ Sigma @ w, np.ones(p)/p,
                  method='SLSQP',
                  constraints=[{'type': 'eq', 'fun': lambda w: np.sum(w) - 1}])
w_mv_numerical = res_mv.x

print("Minimum-variance portfolio (equality constraint only):")
print(f"{'Asset':>5s}  {'Analytic':>10s}  {'Numerical':>10s}")
print("-" * 30)
for i in range(p):
    print(f"{i:5d}  {w_mv_analytic[i]:10.4f}  {w_mv_numerical[i]:10.4f}")
print(f"{'Var':>5s}  {w_mv_analytic @ Sigma @ w_mv_analytic:10.6f}  "
      f"{w_mv_numerical @ Sigma @ w_mv_numerical:10.6f}")

# Full problem: maximize return with risk budget + no short selling
sigma2_max = 0.02
constraints_full = [
    {'type': 'eq',   'fun': lambda w: np.sum(w) - 1},
    {'type': 'ineq', 'fun': lambda w: sigma2_max - w @ Sigma @ w},
]
bounds = [(0, 1)] * p
res_full = minimize(lambda w: -mu @ w, np.ones(p)/p,
                    method='SLSQP', bounds=bounds, constraints=constraints_full)

print(f"\nFull constrained portfolio (risk budget {sigma2_max}):")
print(f"  Weights: {np.round(res_full.x, 4)}")
print(f"  Return:  {mu @ res_full.x:.4f}")
print(f"  Risk:    {res_full.x @ Sigma @ res_full.x:.6f}")

Positive-Definiteness Constraint on Covariance

In multivariate normal MLE, \(\boldsymbol{\Sigma}\) must be positive definite. This is a matrix inequality constraint. In practice, this is handled by reparameterization (Section 16.7) rather than by explicit KKT conditions — for instance, working with the Cholesky factor \(\mathbf{L}\) such that \(\boldsymbol{\Sigma} = \mathbf{L}\mathbf{L}^{\!\top}\).

Variance Positivity

For a normal model \(\mathcal{N}(\mu, \sigma^2)\), the constraint \(\sigma^2 > 0\) is automatically satisfied by the MLE \(\hat{\sigma}^2 = \frac{1}{n}\sum_i (x_i - \bar{x})^2 > 0\) (for non-degenerate data). But in mixed models or penalized settings, the boundary \(\sigma^2 = 0\) can be relevant, and the KKT conditions (specifically complementary slackness) govern the behavior.

16.4 Augmented Lagrangian Method

Motivation

The basic Lagrangian approach requires solving the KKT system exactly, which can be difficult. The augmented Lagrangian method adds a quadratic penalty term to improve numerical conditioning.

Formulation

For equality constraints \(h_j(\boldsymbol{\theta}) = 0\), the augmented Lagrangian is

(37)\[\mathcal{L}_\rho(\boldsymbol{\theta}, \boldsymbol{\lambda}) = f(\boldsymbol{\theta}) - \sum_{j=1}^m \lambda_j\,h_j(\boldsymbol{\theta}) + \frac{\rho}{2}\sum_{j=1}^m h_j(\boldsymbol{\theta})^2,\]

where \(\rho > 0\) is a penalty parameter. The last term penalizes constraint violations, encouraging feasibility even when the Lagrange multiplier estimates are imperfect.

The augmented Lagrangian combines the best of both worlds: the Lagrange multipliers handle the constraint “exactly” in the limit, while the quadratic penalty provides numerical stability along the way.

Algorithm

Initialize: theta_0, lambda_0, rho_0
For k = 0, 1, 2, ...
    1. Minimize L_rho(theta, lambda_k) with respect to theta -> theta_{k+1}
    2. Update multipliers: lambda_{k+1,j} = lambda_{k,j} - rho_k h_j(theta_{k+1})
    3. Optionally increase rho_k -> rho_{k+1}
    4. Check convergence

The multiplier update in step 2 is a dual ascent step: it moves the multipliers in the direction that increases the Lagrangian, encouraging the constraints to be satisfied more tightly.

Let us implement the augmented Lagrangian from scratch and watch it converge.

# Augmented Lagrangian method: min x^2 + 2y^2 s.t. x + y = 1
import numpy as np

np.random.seed(42)

# Objective and constraint
def f(x, y):
    return x**2 + 2*y**2

def h(x, y):
    return x + y - 1  # constraint: h = 0

def grad_f(x, y):
    return np.array([2*x, 4*y])

def grad_h(x, y):
    return np.array([1.0, 1.0])

# Augmented Lagrangian method
theta = np.array([0.0, 0.0])  # initial guess
lam = 0.0                      # initial multiplier
rho = 1.0                      # penalty parameter

print(f"{'Iter':>4s}  {'x':>8s}  {'y':>8s}  {'lambda':>8s}  {'rho':>6s}  "
      f"{'f(x,y)':>8s}  {'|h(x,y)|':>10s}")
print("-" * 62)

for k in range(15):
    # Inner loop: minimize augmented Lagrangian w.r.t. theta using GD
    for _ in range(100):
        x, y = theta
        hval = h(x, y)
        grad = grad_f(x, y) - lam * grad_h(x, y) + rho * hval * grad_h(x, y)
        theta = theta - 0.05 * grad

    x, y = theta
    hval = h(x, y)
    print(f"{k:4d}  {x:8.5f}  {y:8.5f}  {lam:8.4f}  {rho:6.1f}  "
          f"{f(x, y):8.5f}  {abs(hval):10.2e}")

    # Update multiplier (dual ascent)
    lam = lam - rho * hval

    # Optionally increase rho
    if abs(hval) > 0.01:
        rho *= 1.5

    if abs(hval) < 1e-8:
        print(f"Converged at iteration {k}.")
        break

print(f"\nSolution: x={theta[0]:.6f}, y={theta[1]:.6f}")
print(f"Exact:    x={2/3:.6f}, y={1/3:.6f}")

Advantages over a pure penalty method: The augmented Lagrangian converges to the correct solution for a finite value of \(\rho\); a pure penalty method requires \(\rho \to \infty\), causing ill-conditioning.

Connection to ADMM

The Alternating Direction Method of Multipliers (ADMM) applies the augmented Lagrangian method to a problem that has been split into two blocks:

\[\min\; f_1(\boldsymbol{\theta}_1) + f_2(\boldsymbol{\theta}_2) \quad\text{s.t.}\quad \mathbf{A}_1\boldsymbol{\theta}_1 + \mathbf{A}_2\boldsymbol{\theta}_2 = \mathbf{b}.\]

The augmented Lagrangian is minimized alternately over \(\boldsymbol{\theta}_1\) and \(\boldsymbol{\theta}_2\), followed by a multiplier update. ADMM is widely used in machine learning for distributed optimization and for problems with structured regularizers (e.g., LASSO, group LASSO).

16.5 Barrier (Interior-Point) Methods

Motivation

Interior-point methods approach the constrained optimum from strictly inside the feasible region, using a barrier function that goes to infinity at the boundary.

Intuition

Imagine you are optimizing in a room, and the walls represent constraints. A barrier function creates a “force field” near the walls that repels you from getting too close. As the optimization proceeds, the force field is gradually weakened (by increasing \(t\)), allowing you to creep closer and closer to the walls — but you never actually hit them. In the limit, you end up right at the constrained optimum, possibly touching one or more walls.

Log-Barrier Function

For inequality constraints \(g_i(\boldsymbol{\theta}) \leq 0\), the log-barrier function is

(38)\[B(\boldsymbol{\theta}) = -\sum_{i=1}^q \log\bigl(-g_i(\boldsymbol{\theta})\bigr).\]

Note that \(B(\boldsymbol{\theta}) \to +\infty\) as any constraint becomes active (\(g_i \to 0^-\)), so minimizing \(f + (1/t)\,B\) keeps the iterates strictly inside the feasible region.

The Barrier Problem

For a parameter \(t > 0\), solve the unconstrained problem

(39)\[\min_{\boldsymbol{\theta}}\; f(\boldsymbol{\theta}) + \frac{1}{t}\,B(\boldsymbol{\theta}) = f(\boldsymbol{\theta}) - \frac{1}{t}\sum_{i=1}^q \log\bigl(-g_i(\boldsymbol{\theta})\bigr).\]

As \(t \to \infty\), the barrier term vanishes and the solution approaches the constrained optimum.

Let us implement the barrier method on our portfolio problem and watch the barrier parameter grow as the solution converges to the constraint boundary.

# Barrier method for portfolio optimization
# Maximize mu^T w s.t. w^T Sigma w <= sigma2_max, sum(w)=1, w >= 0
# Reformulated as: min -mu^T w with barrier for inequality constraints
import numpy as np

np.random.seed(42)
p = 3  # small for clarity
mu = np.array([0.12, 0.07, 0.14])
A = np.random.randn(p, p) * 0.1
Sigma = A.T @ A + 0.05 * np.eye(p)
sigma2_max = 0.03

# Constraints: w >= 0 (p constraints), w^T Sigma w <= sigma2_max (1 constraint)
# We handle sum(w) = 1 by projecting after each step

def barrier_obj(w, t):
    """Objective + (1/t) * barrier."""
    obj = -mu @ w
    # Barrier for w_i > 0
    if np.any(w <= 0):
        return 1e10
    obj -= (1.0 / t) * np.sum(np.log(w))
    # Barrier for sigma2_max - w^T Sigma w > 0
    slack = sigma2_max - w @ Sigma @ w
    if slack <= 0:
        return 1e10
    obj -= (1.0 / t) * np.log(slack)
    return obj

def barrier_grad(w, t):
    """Gradient of objective + (1/t) * barrier."""
    grad = -mu
    grad -= (1.0 / t) / w                           # d/dw [-log(w_i)] = -1/w_i
    slack = sigma2_max - w @ Sigma @ w
    grad -= (1.0 / t) * (-2 * Sigma @ w) / slack    # d/dw [-log(slack)]
    return grad

def project_simplex(w):
    """Project w onto the simplex {w >= 0, sum(w) = 1}."""
    # Simple projection: normalize positive part
    w = np.maximum(w, 1e-10)
    return w / np.sum(w)

w = np.ones(p) / p  # start at equal weights (feasible)

print(f"{'t':>8s}  ", end="")
for i in range(p):
    print(f"{'w'+str(i):>8s}  ", end="")
print(f"{'Return':>8s}  {'Risk':>8s}  {'Barrier obj':>12s}")
print("-" * 70)

for t in [1, 5, 20, 100, 500, 2000, 10000]:
    # Inner loop: projected gradient descent
    for step in range(500):
        g = barrier_grad(w, t)
        lr = 0.0005 / (1 + step * 0.001)
        w = w - lr * g
        w = project_simplex(w)
        # Ensure feasibility
        while w @ Sigma @ w >= sigma2_max:
            w = 0.95 * w + 0.05 * np.ones(p) / p

    ret = mu @ w
    risk = w @ Sigma @ w
    bobj = barrier_obj(w, t)
    print(f"{t:8d}  ", end="")
    for i in range(p):
        print(f"{w[i]:8.4f}  ", end="")
    print(f"{ret:8.4f}  {risk:8.6f}  {bobj:12.6f}")

print(f"\nAs t -> inf, the barrier weakens and the solution approaches")
print(f"the constrained optimum. Risk budget: {sigma2_max}")

The Central Path

The set of solutions \(\{\boldsymbol{\theta}^*(t) : t > 0\}\) traces a smooth curve in parameter space called the central path. As \(t\) increases, the points on the central path approach the constrained optimum.

The first-order optimality condition for (39) is

\[\nabla f(\boldsymbol{\theta}) + \frac{1}{t}\sum_{i=1}^q \frac{\nabla g_i(\boldsymbol{\theta})}{-g_i(\boldsymbol{\theta})} = \mathbf{0}.\]

Defining \(\mu_i = 1/(t \cdot (-g_i(\boldsymbol{\theta})))\), this becomes

\[\nabla f(\boldsymbol{\theta}) + \sum_{i=1}^q \mu_i\,\nabla g_i(\boldsymbol{\theta}) = \mathbf{0}, \qquad \mu_i\,g_i(\boldsymbol{\theta}) = -1/t,\]

which converges to the KKT conditions (with complementary slackness \(\mu_i g_i = 0\)) as \(t \to \infty\).

Algorithm

Initialize: theta_0 strictly feasible, t_0 > 0, growth factor kappa > 1
For k = 0, 1, 2, ...
    1. Solve the barrier problem (16.7) using Newton's method -> theta_k*
    2. Increase t: t_{k+1} = kappa * t_k
    3. Use theta_k* as the starting point for the next barrier problem
    4. Stop when q/t < epsilon (duality gap bound)

Interior-point methods have excellent theoretical properties: they can solve convex problems with \(q\) inequality constraints in \(O(\sqrt{q}\,\log(1/\epsilon))\) Newton steps.

16.6 Penalty Methods

Quadratic Penalty Method

The simplest approach to constrained optimization is to add a penalty for constraint violations. For equality constraints:

(40)\[\min_{\boldsymbol{\theta}}\; f(\boldsymbol{\theta}) + \frac{\rho}{2}\sum_{j=1}^m h_j(\boldsymbol{\theta})^2.\]

For inequality constraints \(g_i(\boldsymbol{\theta}) \leq 0\):

\[\min_{\boldsymbol{\theta}}\; f(\boldsymbol{\theta}) + \frac{\rho}{2}\sum_{i=1}^q \bigl[\max(0, g_i(\boldsymbol{\theta}))\bigr]^2.\]

Let us implement the penalty method and see how increasing \(\rho\) drives the solution toward feasibility—and also how it causes ill-conditioning.

# Penalty method: min x^2 + 2y^2 s.t. x + y = 1
# Compare with exact solution (2/3, 1/3) and augmented Lagrangian
import numpy as np

np.random.seed(42)
exact_x, exact_y = 2.0/3, 1.0/3
exact_f = exact_x**2 + 2*exact_y**2

print(f"{'rho':>10s}  {'x':>8s}  {'y':>8s}  {'f(x,y)':>8s}  "
      f"{'|h(x,y)|':>10s}  {'||error||':>10s}  {'cond(H)':>10s}")
print("-" * 78)

for rho in [1, 10, 100, 1000, 10000, 100000]:
    # Penalized objective: f(x,y) + (rho/2)*(x+y-1)^2
    # Gradient: [2x + rho*(x+y-1), 4y + rho*(x+y-1)]
    # Hessian:  [[2+rho, rho], [rho, 4+rho]]
    # Solve: H * [x,y] = [rho, rho]  (from setting gradient = 0)
    H = np.array([[2 + rho, rho], [rho, 4 + rho]])
    b = np.array([rho, rho])
    sol = np.linalg.solve(H, b)
    x, y = sol
    fval = x**2 + 2*y**2
    hval = x + y - 1
    err = np.sqrt((x - exact_x)**2 + (y - exact_y)**2)
    cond = np.linalg.cond(H)
    print(f"{rho:10d}  {x:8.6f}  {y:8.6f}  {fval:8.6f}  "
          f"{abs(hval):10.2e}  {err:10.2e}  {cond:10.1f}")

print(f"\nExact: x={exact_x:.6f}, y={exact_y:.6f}, f={exact_f:.6f}")
print("Note: as rho increases, h(x,y)->0 but cond(H) grows -> ill-conditioning!")

Convergence

Theorem. Let \(\boldsymbol{\theta}(\rho)\) be the minimizer of the quadratic penalty problem for a given \(\rho\). Under mild conditions, as \(\rho \to \infty\), any limit point of \(\boldsymbol{\theta}(\rho)\) is a solution of the constrained problem.

Proof sketch. For any feasible point \(\bar{\boldsymbol{\theta}}\):

\[f(\boldsymbol{\theta}(\rho)) + \frac{\rho}{2}\|\mathbf{h}(\boldsymbol{\theta}(\rho))\|^2 \leq f(\bar{\boldsymbol{\theta}}) + \frac{\rho}{2}\|\mathbf{h}(\bar{\boldsymbol{\theta}})\|^2 = f(\bar{\boldsymbol{\theta}}),\]

since \(\mathbf{h}(\bar{\boldsymbol{\theta}}) = \mathbf{0}\). Therefore \(\|\mathbf{h}(\boldsymbol{\theta}(\rho))\|^2 \leq 2[f(\bar{\boldsymbol{\theta}}) - f(\boldsymbol{\theta}(\rho))]/\rho\). As \(\rho \to \infty\), the constraint violation shrinks to zero (assuming \(f\) is bounded below on the feasible set).

Drawback: For large \(\rho\), the penalized objective becomes ill-conditioned (its Hessian has eigenvalues of order \(\rho\)), making the subproblem hard to solve. This motivates the augmented Lagrangian method (Section 16.4), which achieves feasibility without driving \(\rho\) to infinity.

Common Pitfall

The quadratic penalty method is tempting because of its simplicity, but beware of the ill-conditioning trap. As \(\rho\) grows, the Hessian becomes dominated by the penalty term, making Newton’s method and even gradient descent struggle with numerical precision. If you find yourself needing \(\rho > 10^6\), switch to the augmented Lagrangian method or reparameterize the problem.

16.7 Reparameterization as an Alternative to Constraints

In practice, the most common way to handle parameter constraints in statistical models is to transform the parameters so that the transformed parameters are unconstrained. This is especially convenient with gradient-based optimization.

Why Reparameterize?

Reparameterization is often the simplest and most robust approach. Instead of fighting the constraints with multipliers and penalties, you change variables so the constraints simply vanish. The transformed problem is unconstrained, so you can use any method from Chapters 12–14 directly. The catch is that the transformation must be smooth and invertible, and the transformed parameter space may have different curvature properties.

Positivity: The Log Transform

If \(\sigma^2 > 0\), define \(\phi = \log \sigma^2\). Then \(\phi \in \mathbb{R}\) is unconstrained, and \(\sigma^2 = e^{\phi}\) is automatically positive.

The gradient transforms as

\[\frac{\partial \ell}{\partial \phi} = \frac{\partial \ell}{\partial \sigma^2}\, \frac{\partial \sigma^2}{\partial \phi} = \frac{\partial \ell}{\partial \sigma^2}\,e^{\phi} = \sigma^2\,\frac{\partial \ell}{\partial \sigma^2}.\]
# Log reparameterization: MLE of normal variance
import numpy as np

np.random.seed(42)
n = 100
sigma2_true = 4.0
x = np.random.normal(0, np.sqrt(sigma2_true), n)

# Direct: gradient ascent on sigma^2 (must stay positive!)
sigma2 = 1.0
print("--- Without reparameterization (constrained) ---")
print(f"{'Iter':>4s}  {'sigma^2':>10s}  {'grad':>10s}")
for k in range(8):
    grad = -n / (2 * sigma2) + np.sum(x**2) / (2 * sigma2**2)
    sigma2 = sigma2 + 0.01 * grad
    sigma2 = max(sigma2, 1e-6)  # MANUAL clipping to enforce positivity
    print(f"{k:4d}  {sigma2:10.4f}  {grad:10.4f}")

# Reparameterized: gradient ascent on phi = log(sigma^2) (unconstrained!)
phi = 0.0  # log(1.0) = 0
print("\n--- With log reparameterization (unconstrained) ---")
print(f"{'Iter':>4s}  {'phi':>10s}  {'sigma^2':>10s}  {'grad_phi':>10s}")
for k in range(8):
    sigma2_reparam = np.exp(phi)
    # Chain rule: d(ell)/d(phi) = sigma^2 * d(ell)/d(sigma^2)
    grad_sigma2 = -n / (2 * sigma2_reparam) + np.sum(x**2) / (2 * sigma2_reparam**2)
    grad_phi = sigma2_reparam * grad_sigma2  # no clipping needed!
    phi = phi + 0.01 * grad_phi
    print(f"{k:4d}  {phi:10.4f}  {np.exp(phi):10.4f}  {grad_phi:10.4f}")

print(f"\nTrue sigma^2 = {sigma2_true:.4f}")
print(f"Sample variance = {np.var(x):.4f}")

Bounded Parameters: The Logit Transform

If \(p \in (0, 1)\), define \(\eta = \log(p / (1-p))\). Then \(\eta \in \mathbb{R}\), and \(p = 1/(1 + e^{-\eta})\) is automatically in \((0, 1)\).

More generally, for \(\theta \in (a, b)\):

\[\eta = \log\frac{\theta - a}{b - \theta}, \qquad \theta = a + (b-a)\,\frac{1}{1 + e^{-\eta}}.\]
# Logit reparameterization for a binomial proportion
import numpy as np
from scipy.special import expit, logit

np.random.seed(42)
p_true = 0.73
n = 50
x = np.random.binomial(n, p_true)  # x successes out of n trials
print(f"Observed: {x}/{n} = {x/n:.2f}, True p = {p_true}")

# Gradient ascent on eta = logit(p), unconstrained
eta = 0.0  # start at p = 0.5
print(f"\n{'Iter':>4s}  {'eta':>8s}  {'p':>8s}  {'grad_eta':>10s}  {'log-lik':>10s}")
for k in range(15):
    p = expit(eta)
    # Log-likelihood: x*log(p) + (n-x)*log(1-p)
    ll = x * np.log(p) + (n - x) * np.log(1 - p)
    # d(ell)/d(eta) = x - n*p  (a beautiful simplification!)
    grad_eta = x - n * p
    eta = eta + 0.05 * grad_eta
    print(f"{k:4d}  {eta:8.4f}  {p:8.4f}  {grad_eta:10.4f}  {ll:10.4f}")

print(f"\nMLE: p = {x/n:.4f} (exact), p = {expit(eta):.4f} (reparameterized GD)")

Simplex Constraint: The Softmax Transform

If \(\boldsymbol{\pi} = (\pi_1, \dots, \pi_K)\) with \(\pi_k \geq 0\) and \(\sum_k \pi_k = 1\), define unconstrained \(\boldsymbol{\eta} = (\eta_1, \dots, \eta_{K-1}) \in \mathbb{R}^{K-1}\) and set

(41)\[\pi_k = \frac{e^{\eta_k}}{\sum_{j=1}^{K} e^{\eta_j}}, \qquad k = 1, \dots, K,\]

where \(\eta_K = 0\) for identifiability (one degree of freedom is removed by the sum-to-one constraint).

The softmax transform is smooth and invertible (on the interior of the simplex), so gradient methods can be applied to \(\boldsymbol{\eta}\) without any constraint-handling machinery.

# Softmax reparameterization: multinomial MLE without constraints
import numpy as np
from scipy.special import softmax

np.random.seed(42)
K = 5
counts = np.array([45, 30, 15, 5, 5])
n = counts.sum()
pi_true = counts / n

# Unconstrained parameters (K-1 free, last fixed at 0)
eta = np.zeros(K - 1)

print(f"{'Iter':>4s}  ", end="")
for k in range(K):
    print(f"{'pi_'+str(k+1):>8s}  ", end="")
print(f"{'||pi - true||':>14s}")
print("-" * 65)

for step in range(200):
    # Forward: softmax to get probabilities
    eta_full = np.append(eta, 0.0)
    pi = softmax(eta_full)

    # Gradient: d(ell)/d(eta_k) = counts_k - n*pi_k for k=1,...,K-1
    grad = counts[:K-1] - n * pi[:K-1]
    eta = eta + 0.01 * grad

    if step % 40 == 0 or step == 199:
        err = np.linalg.norm(pi - pi_true)
        print(f"{step:4d}  ", end="")
        for k in range(K):
            print(f"{pi[k]:8.4f}  ", end="")
        print(f"{err:14.6f}")

print(f"\nTrue:  {pi_true}")
print(f"Sum:   {pi.sum():.8f} (always exactly 1 by construction)")

The key insight: the softmax guarantees \(\pi_k > 0\) and \(\sum_k \pi_k = 1\) by construction, so we never need to worry about constraint violations.

Positive-Definite Matrices: The Cholesky Factor

For a covariance matrix \(\boldsymbol{\Sigma}\) that must be positive definite, write \(\boldsymbol{\Sigma} = \mathbf{L}\mathbf{L}^{\!\top}\) where \(\mathbf{L}\) is lower triangular with positive diagonal entries. Optimize over the entries of \(\mathbf{L}\), parameterizing the diagonal as \(L_{jj} = e^{\phi_j}\) to ensure positivity.

This gives \(p(p+1)/2\) unconstrained parameters, which is the correct number of degrees of freedom for a \(p \times p\) symmetric positive definite matrix.

# Cholesky reparameterization for covariance MLE
import numpy as np

np.random.seed(42)
d = 3
# True covariance
Sigma_true = np.array([[4.0, 1.0, 0.5],
                        [1.0, 2.0, 0.3],
                        [0.5, 0.3, 1.0]])
L_true = np.linalg.cholesky(Sigma_true)

# Generate data
n = 500
X = np.random.randn(n, d) @ L_true.T

# Parameterize: L has d*(d+1)/2 free parameters
# Diagonal: L_jj = exp(phi_j) (always positive)
# Off-diagonal: unconstrained
n_params = d * (d + 1) // 2
params = np.zeros(n_params)  # initial: L = I

def params_to_L(params, d):
    """Convert unconstrained params to lower triangular L."""
    L = np.zeros((d, d))
    idx = 0
    for i in range(d):
        for j in range(i + 1):
            if i == j:
                L[i, j] = np.exp(params[idx])  # positive diagonal
            else:
                L[i, j] = params[idx]           # free off-diagonal
            idx += 1
    return L

def neg_log_lik(params, X, d):
    """Negative log-likelihood for multivariate normal (mu=0)."""
    L = params_to_L(params, d)
    Sigma = L @ L.T
    n = len(X)
    sign, logdet = np.linalg.slogdet(Sigma)
    Sigma_inv = np.linalg.inv(Sigma)
    return 0.5 * n * logdet + 0.5 * np.sum(X @ Sigma_inv * X)

# Optimize using numerical gradient descent
print(f"{'Iter':>4s}  {'neg-LL':>12s}  {'||Sigma - Sigma_hat||':>22s}")
print("-" * 44)
lr = 0.0001
for step in range(500):
    # Numerical gradient
    grad = np.zeros(n_params)
    eps = 1e-5
    f0 = neg_log_lik(params, X, d)
    for j in range(n_params):
        params_p = params.copy()
        params_p[j] += eps
        grad[j] = (neg_log_lik(params_p, X, d) - f0) / eps
    params = params - lr * grad

    if step % 100 == 0 or step == 499:
        L = params_to_L(params, d)
        Sigma_hat = L @ L.T
        err = np.linalg.norm(Sigma_hat - Sigma_true, 'fro')
        print(f"{step:4d}  {f0:12.2f}  {err:22.4f}")

L_final = params_to_L(params, d)
Sigma_hat = L_final @ L_final.T
print(f"\nTrue Sigma:\n{Sigma_true}")
print(f"\nEstimated Sigma:\n{np.round(Sigma_hat, 3)}")
print(f"Positive definite: {np.all(np.linalg.eigvalsh(Sigma_hat) > 0)}")

Correlation Matrices: Fisher’s z-Transform

For a correlation parameter \(r \in (-1, 1)\), use

\[z = \operatorname{arctanh}(r) = \frac{1}{2}\log\frac{1+r}{1-r}, \qquad r = \tanh(z).\]

Then \(z \in \mathbb{R}\) is unconstrained.

# Fisher's z-transform for correlation estimation
import numpy as np

np.random.seed(42)
r_true = 0.85
n = 100

# Generate correlated data
Sigma = np.array([[1.0, r_true], [r_true, 1.0]])
L = np.linalg.cholesky(Sigma)
data = np.random.randn(n, 2) @ L.T

# Gradient ascent on z = arctanh(r), unconstrained
z = 0.0  # start at r = 0
print(f"{'Iter':>4s}  {'z':>8s}  {'r=tanh(z)':>10s}  {'log-lik':>10s}")
for k in range(20):
    r = np.tanh(z)
    # Bivariate normal log-likelihood (mu=0, sigma=1, correlation r)
    det = 1 - r**2
    ll = -0.5 * n * np.log(det) - 0.5 / det * np.sum(
        data[:, 0]**2 - 2*r*data[:, 0]*data[:, 1] + data[:, 1]**2)
    # d(ell)/d(r)
    dll_dr = n * r / det + (1 / det**2) * np.sum(
        -r * (data[:, 0]**2 + data[:, 1]**2) + (1 + r**2) * data[:, 0] * data[:, 1])
    # Chain rule: d(ell)/d(z) = d(ell)/d(r) * d(r)/d(z) = dll_dr * (1 - r^2)
    grad_z = dll_dr * (1 - r**2)
    z = z + 0.001 * grad_z
    if k % 4 == 0:
        print(f"{k:4d}  {z:8.4f}  {np.tanh(z):10.4f}  {ll:10.2f}")

print(f"\nTrue r = {r_true}, Sample r = {np.corrcoef(data.T)[0,1]:.4f}, "
      f"Estimated r = {np.tanh(z):.4f}")

All Four Reparameterizations at a Glance

# Summary: all reparameterization strategies
import numpy as np

print(f"{'Constraint':>25s}  {'Transform':>25s}  {'Inverse':>25s}")
print("-" * 80)
examples = [
    ("sigma^2 > 0",       "phi = log(sigma^2)",    "sigma^2 = exp(phi)"),
    ("p in (0, 1)",        "eta = logit(p)",         "p = sigmoid(eta)"),
    ("r in (-1, 1)",       "z = arctanh(r)",         "r = tanh(z)"),
    ("pi on simplex",      "eta_k unconstrained",    "pi = softmax(eta)"),
    ("Sigma pos. def.",    "L = chol(Sigma)",        "Sigma = L @ L.T"),
]
for constraint, transform, inverse in examples:
    print(f"{constraint:>25s}  {transform:>25s}  {inverse:>25s}")

# Numerical demo: all transforms preserve constraints
print("\nNumerical verification:")
phi = -2.5
print(f"  exp({phi}) = {np.exp(phi):.4f} > 0: True")
eta = 3.7
p = 1 / (1 + np.exp(-eta))
print(f"  sigmoid({eta}) = {p:.4f} in (0,1): {0 < p < 1}")
z = 1.2
r = np.tanh(z)
print(f"  tanh({z}) = {r:.4f} in (-1,1): {-1 < r < 1}")
eta_vec = np.array([1.0, -0.5, 0.3])
from scipy.special import softmax
pi = softmax(np.append(eta_vec, 0.0))
print(f"  softmax({list(np.round(eta_vec, 1))} + [0]) = {np.round(pi, 3)}, "
      f"sum={pi.sum():.6f}")

When to Reparameterize vs. Use Constrained Methods

Scenario

Reparameterize

Constrained method

Simple positivity/boundedness

Preferred

Overkill

Simplex (few categories)

Preferred (softmax)

Fine too

Complex coupled constraints

May be difficult

Preferred

Linear constraints (\(\mathbf{A}\boldsymbol{\theta} = \mathbf{b}\))

Null-space projection

Lagrangian

Interpretability of Hessian

Harder (need Jacobian)

Original parameterization

16.8 Putting It All Together: Constrained MLE

A typical workflow for constrained maximum likelihood is:

  1. Identify the constraints. Are they simple (positivity, simplex) or complex (coupled inequalities, linear equalities)?

  2. Try reparameterization first. For simple constraints, the log, logit, softmax, or Cholesky transforms eliminate the constraints. Apply any unconstrained method from Chapter 12: Gradient Methods, Chapter 13: Newton and Scoring Methods, or Chapter 14: Quasi-Newton Methods.

  3. If reparameterization is infeasible, use:

    • Lagrange multipliers (for small problems with equality constraints).

    • Interior-point methods (for convex problems with many inequality constraints).

    • Augmented Lagrangian / ADMM (for large-scale or separable problems).

    • Penalty methods (for quick-and-dirty solutions).

  4. Verify the KKT conditions at the solution. Check that all constraints are satisfied, multipliers have the correct sign, and complementary slackness holds.

  5. Compute standard errors using the constrained observed information (which may involve the bordered Hessian or the Hessian of the Lagrangian restricted to the constraint manifold).

Let us run a final comprehensive example comparing all three approaches— penalty, augmented Lagrangian, and reparameterization—on the same problem.

# Head-to-head: three approaches to the same constrained problem
# Problem: MLE of Bernoulli p with constraint p in (0.2, 0.8)
import numpy as np
from scipy.special import expit

np.random.seed(42)
n = 30
p_true = 0.65
x = np.random.binomial(1, p_true, n)
p_mle = x.mean()
print(f"Observed: {x.sum()}/{n} successes, unconstrained MLE = {p_mle:.4f}")

# Method 1: Quadratic penalty for constraint p >= 0.2 and p <= 0.8
p = 0.5
for rho in [10, 100, 1000]:
    for _ in range(200):
        grad = x.sum() / p - (n - x.sum()) / (1 - p)
        # Penalty gradient for g1: 0.2 - p <= 0 and g2: p - 0.8 <= 0
        if p < 0.2:
            grad -= rho * (0.2 - p)
        if p > 0.8:
            grad -= rho * (p - 0.8)
        p = p + 0.001 * grad
        p = np.clip(p, 0.01, 0.99)

p_penalty = p
print(f"\nPenalty method:          p = {p_penalty:.6f}")

# Method 2: Reparameterization via logit with bounds
# Map p in (0.2, 0.8) to eta in R via eta = log((p-0.2)/(0.8-p))
a, b = 0.2, 0.8
eta = 0.0  # start at midpoint p = 0.5
for _ in range(500):
    p = a + (b - a) * expit(eta)
    grad_p = x.sum() / p - (n - x.sum()) / (1 - p)
    dp_deta = (b - a) * expit(eta) * (1 - expit(eta))
    grad_eta = grad_p * dp_deta
    eta = eta + 0.01 * grad_eta

p_reparam = a + (b - a) * expit(eta)
print(f"Reparameterization:     p = {p_reparam:.6f}")

# Method 3: Projected gradient (project onto [0.2, 0.8] after each step)
p = 0.5
for _ in range(500):
    grad = x.sum() / p - (n - x.sum()) / (1 - p)
    p = p + 0.001 * grad
    p = np.clip(p, 0.2, 0.8)  # project onto feasible set

p_projected = p
print(f"Projected gradient:     p = {p_projected:.6f}")

# Analytical: MLE is x.mean(), clipped to [0.2, 0.8]
p_exact = np.clip(p_mle, 0.2, 0.8)
print(f"Exact (clipped MLE):    p = {p_exact:.6f}")

16.9 Summary

Constrained optimization is an essential part of likelihood-based inference. Lagrange multipliers handle equality constraints by introducing dual variables whose values measure the sensitivity of the optimum to the constraints. The KKT conditions extend this framework to inequality constraints, with complementary slackness determining which constraints are active. Algorithmic approaches — augmented Lagrangian, barrier methods, and penalty methods — provide practical ways to solve constrained problems numerically.

In statistical practice, the most common approach is reparameterization: the log transform for positivity, the logit for bounded parameters, the softmax for simplex constraints, and the Cholesky decomposition for positive definiteness. These transforms convert the constrained problem into an unconstrained one, allowing direct application of the gradient methods (Chapter 12: Gradient Methods), Newton methods (Chapter 13: Newton and Scoring Methods), and quasi-Newton methods (Chapter 14: Quasi-Newton Methods) developed in earlier chapters. When reparameterization is not feasible or when the constraints have complex structure, the formal constrained-optimization methods of this chapter provide the necessary tools.