Chapter 16: Constrained Optimization¶
You manage a portfolio of 5 assets. You want to maximize expected return, but you have a risk budget: the portfolio variance must not exceed 2%. The weights must be non-negative (no short selling) and sum to one (fully invested). This is a constrained optimization problem—and if you ignore the constraints, your optimizer will happily tell you to put 500% of your money in the highest-return asset and short-sell everything else.
# What happens when you ignore constraints
import numpy as np
np.random.seed(42)
p = 5
mu = np.array([0.12, 0.10, 0.07, 0.03, 0.14]) # expected returns
A = np.random.randn(p, p) * 0.1
Sigma = A.T @ A + 0.05 * np.eye(p) # covariance of returns
# Unconstrained "optimal" portfolio: just maximize mu^T w
# Without constraints, w -> infinity in the direction of max return
w_unconstrained = mu / np.linalg.norm(mu) # just point at highest return
w_unconstrained *= 10 # scale up for "more return"
print("Unconstrained 'solution':")
print(f" Weights: {np.round(w_unconstrained, 2)}")
print(f" Sum of weights: {w_unconstrained.sum():.2f} (should be 1.00)")
print(f" Min weight: {w_unconstrained.min():.2f} (should be >= 0)")
print(f" Portfolio variance: {w_unconstrained @ Sigma @ w_unconstrained:.4f} (budget: 0.02)")
print("\nThis is nonsense. We NEED constraints.")
In many likelihood problems the parameters cannot take arbitrary values. Probabilities must be non-negative and sum to one. Variances must be positive. Correlation matrices must be positive semi-definite. This chapter develops the theory and algorithms for optimization under constraints, starting from Lagrange multipliers for equality constraints, extending to the Karush–Kuhn–Tucker (KKT) conditions for inequality constraints, and then covering the main algorithmic approaches: augmented Lagrangian methods, barrier (interior-point) methods, and penalty methods. We close with the practical alternative of reparameterization, which transforms a constrained problem into an unconstrained one.
Why Constrained Optimization Matters
Every time you fit a model with probabilities, variances, or proportions, you are doing constrained optimization — whether you realize it or not. The unconstrained methods from earlier chapters (Chapter 12: Gradient Methods, Chapter 13: Newton and Scoring Methods, Chapter 14: Quasi-Newton Methods) are powerful, but they can happily produce a negative variance or probabilities that sum to 1.3. Constrained optimization gives you the mathematical framework to keep parameters in their proper domains while still finding the optimum.
16.1 Equality Constraints and Lagrange Multipliers¶
Problem Formulation¶
Consider the problem
where \(f\) and \(h_1, \dots, h_m\) are continuously differentiable.
Geometric Intuition¶
At a constrained minimum \(\boldsymbol{\theta}^*\), the gradient of \(f\) cannot have a component in the feasible set — otherwise we could decrease \(f\) by moving along the constraint surface. More precisely, the gradient \(\nabla f(\boldsymbol{\theta}^*)\) must be normal to the constraint surface.
Think of it this way: you are hiking on a mountain and are confined to a trail (the constraint surface). At the lowest point of the trail, the slope of the mountain must be perpendicular to the trail — if there were any downhill component along the trail, you could walk further and go lower.
The constraint surface is (locally) a manifold defined by \(\mathbf{h}(\boldsymbol{\theta}) = \mathbf{0}\). Its tangent space at \(\boldsymbol{\theta}^*\) is the null space of the Jacobian matrix
The normal space to the constraint surface is spanned by the rows of \(\mathbf{J}_h\), i.e., by the gradients \(\nabla h_j\). The condition that \(\nabla f\) lies in this normal space is
or equivalently
The scalars \(\lambda_1^*, \dots, \lambda_m^*\) are the Lagrange multipliers.
The Lagrangian¶
Define the Lagrangian function
Reading this formula: the Lagrangian takes the objective \(f\) and subtracts a penalty for each constraint, weighted by the multiplier \(\lambda_j\). If \(\lambda_j\) is large, the corresponding constraint exerts a strong pull on the solution. The Lagrangian converts the constrained problem into an unconstrained one: finding a stationary point of \(\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\lambda})\) with respect to both \(\boldsymbol{\theta}\) and \(\boldsymbol{\lambda}\) gives you the constrained optimum and the multipliers simultaneously.
The necessary conditions for a constrained minimum are found by setting the partial derivatives to zero:
The first equation is (31) — the gradient of \(f\) must lie in the span of the constraint gradients. The second equation is simply the requirement that all constraints are satisfied (\(h_j = 0\)). Together they form a system of \(p + m\) equations in \(p + m\) unknowns \((\boldsymbol{\theta}, \boldsymbol{\lambda})\), which can be solved by Newton’s method or other root-finding techniques.
Let us solve this both analytically and numerically and compare.
# Lagrange multipliers: analytical vs. numerical
# Problem: minimize x^2 + 2y^2 subject to x + y = 1
import numpy as np
from scipy.optimize import fsolve
# --- Analytical solution ---
# Lagrangian: L = x^2 + 2y^2 - lambda*(x + y - 1)
# dL/dx = 2x - lambda = 0 => x = lambda/2
# dL/dy = 4y - lambda = 0 => y = lambda/4
# Constraint: x + y = 1 => lambda/2 + lambda/4 = 1 => lambda = 4/3
lam_analytic = 4.0 / 3.0
x_analytic = lam_analytic / 2.0
y_analytic = lam_analytic / 4.0
# --- Numerical solution (fsolve) ---
def lagrange_system(vars):
x, y, lam = vars
return [2*x - lam, # dL/dx = 0
4*y - lam, # dL/dy = 0
x + y - 1] # constraint
sol = fsolve(lagrange_system, [0.5, 0.5, 1.0])
x_num, y_num, lam_num = sol
print("Minimize x^2 + 2y^2 subject to x + y = 1")
print(f"\n{'':>12s} {'x':>8s} {'y':>8s} {'lambda':>8s} {'f(x,y)':>8s}")
print("-" * 50)
print(f"{'Analytical':>12s} {x_analytic:8.4f} {y_analytic:8.4f} "
f"{lam_analytic:8.4f} {x_analytic**2 + 2*y_analytic**2:8.4f}")
print(f"{'Numerical':>12s} {x_num:8.4f} {y_num:8.4f} "
f"{lam_num:8.4f} {x_num**2 + 2*y_num**2:8.4f}")
print(f"\nConstraint satisfied: x + y = {x_num + y_num:.8f}")
Intuition
The Lagrange multiplier \(\lambda^*\) has a beautiful economic interpretation: it tells you the rate at which the optimal value of \(f\) changes if you relax the constraint by an infinitesimal amount. In operations research, this is called the shadow price of the constraint. If \(\lambda^*\) is large, the constraint is “expensive” — the objective would improve significantly if the constraint were loosened.
# Shadow price interpretation: relax the constraint and observe
import numpy as np
from scipy.optimize import fsolve
def solve_relaxed(c):
"""Solve min x^2 + 2y^2 s.t. x + y = c, return (x*, y*, f*)."""
# Analytical: x = 2c/3, y = c/3, f = 2c^2/3
return 2*c/3, c/3, 2*c**2/3
print("Shadow price of the constraint x + y = c:")
print(f"{'c':>6s} {'x*':>8s} {'y*':>8s} {'f*':>8s} {'df*/dc':>8s}")
print("-" * 44)
prev_f = None
for c in [0.9, 0.95, 1.0, 1.05, 1.1]:
x, y, f = solve_relaxed(c)
df_dc = f - prev_f if prev_f is not None else 0.0
# The analytical derivative df*/dc = 4c/3, at c=1 this is 4/3 = lambda*
print(f"{c:6.2f} {x:8.4f} {y:8.4f} {f:8.4f} {df_dc / 0.05:8.4f}")
prev_f = f
print(f"\ndf*/dc at c=1 = {4.0/3.0:.4f} = lambda* (the Lagrange multiplier)")
Second-Order Sufficient Conditions¶
The first-order conditions (33) are necessary but not sufficient. A second-order sufficient condition for a local minimum is that the bordered Hessian (or equivalently the Hessian of the Lagrangian restricted to the tangent space of the constraints) is positive definite:
where
16.2 Inequality Constraints and KKT Conditions¶
Problem Formulation¶
The general constrained optimization problem is
Inequality constraints bring a new subtlety: at the solution, some constraints will be active (holding with equality) and others will be inactive (strict inequality). The KKT conditions elegantly handle both cases.
Deriving the KKT Conditions¶
Define the Lagrangian with both equality and inequality constraints:
At a constrained minimum \((\boldsymbol{\theta}^*, \boldsymbol{\lambda}^*, \boldsymbol{\mu}^*)\), the following Karush–Kuhn–Tucker (KKT) conditions must hold (under a constraint qualification such as LICQ):
Stationarity:
\[\nabla_{\boldsymbol{\theta}} \mathcal{L} = \nabla f(\boldsymbol{\theta}^*) - \sum_{j=1}^m \lambda_j^*\,\nabla h_j(\boldsymbol{\theta}^*) + \sum_{i=1}^q \mu_i^*\,\nabla g_i(\boldsymbol{\theta}^*) = \mathbf{0}.\]Primal feasibility:
\[h_j(\boldsymbol{\theta}^*) = 0 \quad\forall\, j, \qquad g_i(\boldsymbol{\theta}^*) \leq 0 \quad\forall\, i.\]Dual feasibility:
\[\mu_i^* \geq 0 \quad\forall\, i.\]Complementary slackness:
(36)¶\[\mu_i^*\,g_i(\boldsymbol{\theta}^*) = 0 \quad\forall\, i.\]
Complementary Slackness Explained¶
Condition (36) says that for each inequality constraint, either the constraint is active (\(g_i(\boldsymbol{\theta}^*) = 0\)) and the multiplier can be positive, or the constraint is inactive (\(g_i(\boldsymbol{\theta}^*) < 0\)) and the multiplier must be zero.
Intuition: An inactive constraint does not affect the solution — it is as if the constraint were not there. Only active constraints “push” on the solution, and the multiplier \(\mu_i^*\) measures the strength of that push.
The KKT conditions are sufficient for a global minimum when \(f\) is convex and the constraints define a convex feasible set (i.e., \(g_i\) convex, \(h_j\) affine).
Let us verify the KKT conditions in code on the portfolio problem.
# Checking KKT conditions on the portfolio problem
import numpy as np
from scipy.optimize import minimize
np.random.seed(42)
p = 5
mu = np.array([0.12, 0.10, 0.07, 0.03, 0.14])
A = np.random.randn(p, p) * 0.1
Sigma = A.T @ A + 0.05 * np.eye(p)
sigma2_max = 0.02
# Solve with scipy
constraints = [
{'type': 'eq', 'fun': lambda w: np.sum(w) - 1},
{'type': 'ineq', 'fun': lambda w: sigma2_max - w @ Sigma @ w},
]
bounds = [(0, 1)] * p
w0 = np.ones(p) / p
res = minimize(lambda w: -mu @ w, w0, method='SLSQP',
bounds=bounds, constraints=constraints)
w_star = res.x
# Check all four KKT conditions
print("=== KKT Condition Check ===\n")
# 1. Primal feasibility
eq_violation = abs(np.sum(w_star) - 1)
risk = w_star @ Sigma @ w_star
print("1. PRIMAL FEASIBILITY")
print(f" Sum of weights = {np.sum(w_star):.8f} (violation: {eq_violation:.2e})")
print(f" Portfolio variance = {risk:.6f} <= {sigma2_max} : "
f"{'OK' if risk <= sigma2_max + 1e-8 else 'VIOLATED'}")
for i in range(p):
status = "ACTIVE" if w_star[i] < 1e-6 else "inactive"
print(f" w[{i}] = {w_star[i]:.6f} >= 0 : {status}")
# 2. Complementary slackness (for bound constraints)
print("\n2. COMPLEMENTARY SLACKNESS")
print(" For each w_i >= 0 constraint:")
for i in range(p):
# mu_i * w_i should = 0 (either mu_i = 0 or w_i = 0)
active = w_star[i] < 1e-6
print(f" Asset {i}: w={w_star[i]:.6f}, "
f"{'constraint active (multiplier can be > 0)' if active else 'constraint inactive (multiplier = 0)'}")
# 3. Risk constraint
risk_active = abs(risk - sigma2_max) < 1e-4
print(f"\n Risk constraint: {'ACTIVE' if risk_active else 'inactive'} "
f"(variance = {risk:.6f}, budget = {sigma2_max})")
# Solution quality
print(f"\n=== Optimal Portfolio ===")
print(f"{'Asset':>5s} {'Weight':>8s} {'Return':>8s} {'Contribution':>12s}")
print("-" * 38)
for i in range(p):
print(f"{i:5d} {w_star[i]:8.4f} {mu[i]:8.2%} {w_star[i]*mu[i]:12.4%}")
print(f"{'Total':>5s} {np.sum(w_star):8.4f} {mu @ w_star:8.2%}")
16.3 Examples in Maximum Likelihood¶
Multinomial MLE with Simplex Constraint¶
Suppose \(X = (X_1, \dots, X_K) \sim \text{Multinomial}(n, \boldsymbol{\pi})\) with \(\boldsymbol{\pi} = (\pi_1, \dots, \pi_K)\). The log-likelihood is
We maximize subject to \(\sum_k \pi_k = 1\) and \(\pi_k \geq 0\).
Using Lagrange multipliers for the equality constraint (the positivity constraints turn out to be inactive at the interior solution):
Setting \(\partial\mathcal{L}/\partial\pi_k = 0\):
Summing over \(k\): \(1 = n/\lambda\), so \(\lambda = n\) and
as expected. Let us verify this Lagrange multiplier derivation numerically.
# Multinomial MLE: Lagrange multiplier solution vs. direct formula
import numpy as np
from scipy.optimize import minimize
np.random.seed(42)
K = 6
counts = np.array([45, 30, 15, 5, 3, 2])
n = counts.sum()
# Analytical solution from Lagrange multipliers
pi_analytical = counts / n
# Numerical solution using scipy with constraints
constraints = [{'type': 'eq', 'fun': lambda pi: np.sum(pi) - 1}]
bounds = [(1e-10, 1)] * K
pi0 = np.ones(K) / K
res = minimize(lambda pi: -np.sum(counts * np.log(pi)), pi0,
method='SLSQP', bounds=bounds, constraints=constraints)
pi_numerical = res.x
print(f"{'Category':>8s} {'Count':>5s} {'Analytical':>10s} {'Numerical':>10s} {'Match':>6s}")
print("-" * 48)
for k in range(K):
match = "yes" if abs(pi_analytical[k] - pi_numerical[k]) < 1e-6 else "no"
print(f"{k+1:8d} {counts[k]:5d} {pi_analytical[k]:10.4f} "
f"{pi_numerical[k]:10.4f} {match:>6s}")
print(f"\nLagrange multiplier lambda = n = {n}")
The Full Portfolio Optimization¶
Now let us solve the portfolio problem properly with constraints, comparing the analytical Lagrange approach with numerical optimization.
# Portfolio optimization: solve analytically (equality only) and numerically
import numpy as np
from scipy.optimize import minimize
np.random.seed(42)
p = 5
mu = np.array([0.12, 0.10, 0.07, 0.03, 0.14])
A = np.random.randn(p, p) * 0.1
Sigma = A.T @ A + 0.05 * np.eye(p)
# Minimum-variance portfolio (equality constraint only: w^T 1 = 1)
# Lagrangian: L = 0.5 w^T Sigma w - lambda*(1^T w - 1)
# dL/dw = Sigma w - lambda 1 = 0 => w = lambda * Sigma^{-1} 1
# Constraint: 1^T w = 1 => lambda = 1 / (1^T Sigma^{-1} 1)
Sigma_inv = np.linalg.inv(Sigma)
ones = np.ones(p)
lam_mv = 1.0 / (ones @ Sigma_inv @ ones)
w_mv_analytic = lam_mv * Sigma_inv @ ones
# Numerical
res_mv = minimize(lambda w: 0.5 * w @ Sigma @ w, np.ones(p)/p,
method='SLSQP',
constraints=[{'type': 'eq', 'fun': lambda w: np.sum(w) - 1}])
w_mv_numerical = res_mv.x
print("Minimum-variance portfolio (equality constraint only):")
print(f"{'Asset':>5s} {'Analytic':>10s} {'Numerical':>10s}")
print("-" * 30)
for i in range(p):
print(f"{i:5d} {w_mv_analytic[i]:10.4f} {w_mv_numerical[i]:10.4f}")
print(f"{'Var':>5s} {w_mv_analytic @ Sigma @ w_mv_analytic:10.6f} "
f"{w_mv_numerical @ Sigma @ w_mv_numerical:10.6f}")
# Full problem: maximize return with risk budget + no short selling
sigma2_max = 0.02
constraints_full = [
{'type': 'eq', 'fun': lambda w: np.sum(w) - 1},
{'type': 'ineq', 'fun': lambda w: sigma2_max - w @ Sigma @ w},
]
bounds = [(0, 1)] * p
res_full = minimize(lambda w: -mu @ w, np.ones(p)/p,
method='SLSQP', bounds=bounds, constraints=constraints_full)
print(f"\nFull constrained portfolio (risk budget {sigma2_max}):")
print(f" Weights: {np.round(res_full.x, 4)}")
print(f" Return: {mu @ res_full.x:.4f}")
print(f" Risk: {res_full.x @ Sigma @ res_full.x:.6f}")
Positive-Definiteness Constraint on Covariance¶
In multivariate normal MLE, \(\boldsymbol{\Sigma}\) must be positive definite. This is a matrix inequality constraint. In practice, this is handled by reparameterization (Section 16.7) rather than by explicit KKT conditions — for instance, working with the Cholesky factor \(\mathbf{L}\) such that \(\boldsymbol{\Sigma} = \mathbf{L}\mathbf{L}^{\!\top}\).
Variance Positivity¶
For a normal model \(\mathcal{N}(\mu, \sigma^2)\), the constraint \(\sigma^2 > 0\) is automatically satisfied by the MLE \(\hat{\sigma}^2 = \frac{1}{n}\sum_i (x_i - \bar{x})^2 > 0\) (for non-degenerate data). But in mixed models or penalized settings, the boundary \(\sigma^2 = 0\) can be relevant, and the KKT conditions (specifically complementary slackness) govern the behavior.
16.4 Augmented Lagrangian Method¶
Motivation¶
The basic Lagrangian approach requires solving the KKT system exactly, which can be difficult. The augmented Lagrangian method adds a quadratic penalty term to improve numerical conditioning.
Formulation¶
For equality constraints \(h_j(\boldsymbol{\theta}) = 0\), the augmented Lagrangian is
where \(\rho > 0\) is a penalty parameter. The last term penalizes constraint violations, encouraging feasibility even when the Lagrange multiplier estimates are imperfect.
The augmented Lagrangian combines the best of both worlds: the Lagrange multipliers handle the constraint “exactly” in the limit, while the quadratic penalty provides numerical stability along the way.
Algorithm¶
Initialize: theta_0, lambda_0, rho_0
For k = 0, 1, 2, ...
1. Minimize L_rho(theta, lambda_k) with respect to theta -> theta_{k+1}
2. Update multipliers: lambda_{k+1,j} = lambda_{k,j} - rho_k h_j(theta_{k+1})
3. Optionally increase rho_k -> rho_{k+1}
4. Check convergence
The multiplier update in step 2 is a dual ascent step: it moves the multipliers in the direction that increases the Lagrangian, encouraging the constraints to be satisfied more tightly.
Let us implement the augmented Lagrangian from scratch and watch it converge.
# Augmented Lagrangian method: min x^2 + 2y^2 s.t. x + y = 1
import numpy as np
np.random.seed(42)
# Objective and constraint
def f(x, y):
return x**2 + 2*y**2
def h(x, y):
return x + y - 1 # constraint: h = 0
def grad_f(x, y):
return np.array([2*x, 4*y])
def grad_h(x, y):
return np.array([1.0, 1.0])
# Augmented Lagrangian method
theta = np.array([0.0, 0.0]) # initial guess
lam = 0.0 # initial multiplier
rho = 1.0 # penalty parameter
print(f"{'Iter':>4s} {'x':>8s} {'y':>8s} {'lambda':>8s} {'rho':>6s} "
f"{'f(x,y)':>8s} {'|h(x,y)|':>10s}")
print("-" * 62)
for k in range(15):
# Inner loop: minimize augmented Lagrangian w.r.t. theta using GD
for _ in range(100):
x, y = theta
hval = h(x, y)
grad = grad_f(x, y) - lam * grad_h(x, y) + rho * hval * grad_h(x, y)
theta = theta - 0.05 * grad
x, y = theta
hval = h(x, y)
print(f"{k:4d} {x:8.5f} {y:8.5f} {lam:8.4f} {rho:6.1f} "
f"{f(x, y):8.5f} {abs(hval):10.2e}")
# Update multiplier (dual ascent)
lam = lam - rho * hval
# Optionally increase rho
if abs(hval) > 0.01:
rho *= 1.5
if abs(hval) < 1e-8:
print(f"Converged at iteration {k}.")
break
print(f"\nSolution: x={theta[0]:.6f}, y={theta[1]:.6f}")
print(f"Exact: x={2/3:.6f}, y={1/3:.6f}")
Advantages over a pure penalty method: The augmented Lagrangian converges to the correct solution for a finite value of \(\rho\); a pure penalty method requires \(\rho \to \infty\), causing ill-conditioning.
Connection to ADMM¶
The Alternating Direction Method of Multipliers (ADMM) applies the augmented Lagrangian method to a problem that has been split into two blocks:
The augmented Lagrangian is minimized alternately over \(\boldsymbol{\theta}_1\) and \(\boldsymbol{\theta}_2\), followed by a multiplier update. ADMM is widely used in machine learning for distributed optimization and for problems with structured regularizers (e.g., LASSO, group LASSO).
16.5 Barrier (Interior-Point) Methods¶
Motivation¶
Interior-point methods approach the constrained optimum from strictly inside the feasible region, using a barrier function that goes to infinity at the boundary.
Intuition
Imagine you are optimizing in a room, and the walls represent constraints. A barrier function creates a “force field” near the walls that repels you from getting too close. As the optimization proceeds, the force field is gradually weakened (by increasing \(t\)), allowing you to creep closer and closer to the walls — but you never actually hit them. In the limit, you end up right at the constrained optimum, possibly touching one or more walls.
Log-Barrier Function¶
For inequality constraints \(g_i(\boldsymbol{\theta}) \leq 0\), the log-barrier function is
Note that \(B(\boldsymbol{\theta}) \to +\infty\) as any constraint becomes active (\(g_i \to 0^-\)), so minimizing \(f + (1/t)\,B\) keeps the iterates strictly inside the feasible region.
The Barrier Problem¶
For a parameter \(t > 0\), solve the unconstrained problem
As \(t \to \infty\), the barrier term vanishes and the solution approaches the constrained optimum.
Let us implement the barrier method on our portfolio problem and watch the barrier parameter grow as the solution converges to the constraint boundary.
# Barrier method for portfolio optimization
# Maximize mu^T w s.t. w^T Sigma w <= sigma2_max, sum(w)=1, w >= 0
# Reformulated as: min -mu^T w with barrier for inequality constraints
import numpy as np
np.random.seed(42)
p = 3 # small for clarity
mu = np.array([0.12, 0.07, 0.14])
A = np.random.randn(p, p) * 0.1
Sigma = A.T @ A + 0.05 * np.eye(p)
sigma2_max = 0.03
# Constraints: w >= 0 (p constraints), w^T Sigma w <= sigma2_max (1 constraint)
# We handle sum(w) = 1 by projecting after each step
def barrier_obj(w, t):
"""Objective + (1/t) * barrier."""
obj = -mu @ w
# Barrier for w_i > 0
if np.any(w <= 0):
return 1e10
obj -= (1.0 / t) * np.sum(np.log(w))
# Barrier for sigma2_max - w^T Sigma w > 0
slack = sigma2_max - w @ Sigma @ w
if slack <= 0:
return 1e10
obj -= (1.0 / t) * np.log(slack)
return obj
def barrier_grad(w, t):
"""Gradient of objective + (1/t) * barrier."""
grad = -mu
grad -= (1.0 / t) / w # d/dw [-log(w_i)] = -1/w_i
slack = sigma2_max - w @ Sigma @ w
grad -= (1.0 / t) * (-2 * Sigma @ w) / slack # d/dw [-log(slack)]
return grad
def project_simplex(w):
"""Project w onto the simplex {w >= 0, sum(w) = 1}."""
# Simple projection: normalize positive part
w = np.maximum(w, 1e-10)
return w / np.sum(w)
w = np.ones(p) / p # start at equal weights (feasible)
print(f"{'t':>8s} ", end="")
for i in range(p):
print(f"{'w'+str(i):>8s} ", end="")
print(f"{'Return':>8s} {'Risk':>8s} {'Barrier obj':>12s}")
print("-" * 70)
for t in [1, 5, 20, 100, 500, 2000, 10000]:
# Inner loop: projected gradient descent
for step in range(500):
g = barrier_grad(w, t)
lr = 0.0005 / (1 + step * 0.001)
w = w - lr * g
w = project_simplex(w)
# Ensure feasibility
while w @ Sigma @ w >= sigma2_max:
w = 0.95 * w + 0.05 * np.ones(p) / p
ret = mu @ w
risk = w @ Sigma @ w
bobj = barrier_obj(w, t)
print(f"{t:8d} ", end="")
for i in range(p):
print(f"{w[i]:8.4f} ", end="")
print(f"{ret:8.4f} {risk:8.6f} {bobj:12.6f}")
print(f"\nAs t -> inf, the barrier weakens and the solution approaches")
print(f"the constrained optimum. Risk budget: {sigma2_max}")
The Central Path¶
The set of solutions \(\{\boldsymbol{\theta}^*(t) : t > 0\}\) traces a smooth curve in parameter space called the central path. As \(t\) increases, the points on the central path approach the constrained optimum.
The first-order optimality condition for (39) is
Defining \(\mu_i = 1/(t \cdot (-g_i(\boldsymbol{\theta})))\), this becomes
which converges to the KKT conditions (with complementary slackness \(\mu_i g_i = 0\)) as \(t \to \infty\).
Algorithm¶
Initialize: theta_0 strictly feasible, t_0 > 0, growth factor kappa > 1
For k = 0, 1, 2, ...
1. Solve the barrier problem (16.7) using Newton's method -> theta_k*
2. Increase t: t_{k+1} = kappa * t_k
3. Use theta_k* as the starting point for the next barrier problem
4. Stop when q/t < epsilon (duality gap bound)
Interior-point methods have excellent theoretical properties: they can solve convex problems with \(q\) inequality constraints in \(O(\sqrt{q}\,\log(1/\epsilon))\) Newton steps.
16.6 Penalty Methods¶
Quadratic Penalty Method¶
The simplest approach to constrained optimization is to add a penalty for constraint violations. For equality constraints:
For inequality constraints \(g_i(\boldsymbol{\theta}) \leq 0\):
Let us implement the penalty method and see how increasing \(\rho\) drives the solution toward feasibility—and also how it causes ill-conditioning.
# Penalty method: min x^2 + 2y^2 s.t. x + y = 1
# Compare with exact solution (2/3, 1/3) and augmented Lagrangian
import numpy as np
np.random.seed(42)
exact_x, exact_y = 2.0/3, 1.0/3
exact_f = exact_x**2 + 2*exact_y**2
print(f"{'rho':>10s} {'x':>8s} {'y':>8s} {'f(x,y)':>8s} "
f"{'|h(x,y)|':>10s} {'||error||':>10s} {'cond(H)':>10s}")
print("-" * 78)
for rho in [1, 10, 100, 1000, 10000, 100000]:
# Penalized objective: f(x,y) + (rho/2)*(x+y-1)^2
# Gradient: [2x + rho*(x+y-1), 4y + rho*(x+y-1)]
# Hessian: [[2+rho, rho], [rho, 4+rho]]
# Solve: H * [x,y] = [rho, rho] (from setting gradient = 0)
H = np.array([[2 + rho, rho], [rho, 4 + rho]])
b = np.array([rho, rho])
sol = np.linalg.solve(H, b)
x, y = sol
fval = x**2 + 2*y**2
hval = x + y - 1
err = np.sqrt((x - exact_x)**2 + (y - exact_y)**2)
cond = np.linalg.cond(H)
print(f"{rho:10d} {x:8.6f} {y:8.6f} {fval:8.6f} "
f"{abs(hval):10.2e} {err:10.2e} {cond:10.1f}")
print(f"\nExact: x={exact_x:.6f}, y={exact_y:.6f}, f={exact_f:.6f}")
print("Note: as rho increases, h(x,y)->0 but cond(H) grows -> ill-conditioning!")
Convergence¶
Theorem. Let \(\boldsymbol{\theta}(\rho)\) be the minimizer of the quadratic penalty problem for a given \(\rho\). Under mild conditions, as \(\rho \to \infty\), any limit point of \(\boldsymbol{\theta}(\rho)\) is a solution of the constrained problem.
Proof sketch. For any feasible point \(\bar{\boldsymbol{\theta}}\):
since \(\mathbf{h}(\bar{\boldsymbol{\theta}}) = \mathbf{0}\). Therefore \(\|\mathbf{h}(\boldsymbol{\theta}(\rho))\|^2 \leq 2[f(\bar{\boldsymbol{\theta}}) - f(\boldsymbol{\theta}(\rho))]/\rho\). As \(\rho \to \infty\), the constraint violation shrinks to zero (assuming \(f\) is bounded below on the feasible set).
Drawback: For large \(\rho\), the penalized objective becomes ill-conditioned (its Hessian has eigenvalues of order \(\rho\)), making the subproblem hard to solve. This motivates the augmented Lagrangian method (Section 16.4), which achieves feasibility without driving \(\rho\) to infinity.
Common Pitfall
The quadratic penalty method is tempting because of its simplicity, but beware of the ill-conditioning trap. As \(\rho\) grows, the Hessian becomes dominated by the penalty term, making Newton’s method and even gradient descent struggle with numerical precision. If you find yourself needing \(\rho > 10^6\), switch to the augmented Lagrangian method or reparameterize the problem.
16.7 Reparameterization as an Alternative to Constraints¶
In practice, the most common way to handle parameter constraints in statistical models is to transform the parameters so that the transformed parameters are unconstrained. This is especially convenient with gradient-based optimization.
Why Reparameterize?
Reparameterization is often the simplest and most robust approach. Instead of fighting the constraints with multipliers and penalties, you change variables so the constraints simply vanish. The transformed problem is unconstrained, so you can use any method from Chapters 12–14 directly. The catch is that the transformation must be smooth and invertible, and the transformed parameter space may have different curvature properties.
Positivity: The Log Transform¶
If \(\sigma^2 > 0\), define \(\phi = \log \sigma^2\). Then \(\phi \in \mathbb{R}\) is unconstrained, and \(\sigma^2 = e^{\phi}\) is automatically positive.
The gradient transforms as
# Log reparameterization: MLE of normal variance
import numpy as np
np.random.seed(42)
n = 100
sigma2_true = 4.0
x = np.random.normal(0, np.sqrt(sigma2_true), n)
# Direct: gradient ascent on sigma^2 (must stay positive!)
sigma2 = 1.0
print("--- Without reparameterization (constrained) ---")
print(f"{'Iter':>4s} {'sigma^2':>10s} {'grad':>10s}")
for k in range(8):
grad = -n / (2 * sigma2) + np.sum(x**2) / (2 * sigma2**2)
sigma2 = sigma2 + 0.01 * grad
sigma2 = max(sigma2, 1e-6) # MANUAL clipping to enforce positivity
print(f"{k:4d} {sigma2:10.4f} {grad:10.4f}")
# Reparameterized: gradient ascent on phi = log(sigma^2) (unconstrained!)
phi = 0.0 # log(1.0) = 0
print("\n--- With log reparameterization (unconstrained) ---")
print(f"{'Iter':>4s} {'phi':>10s} {'sigma^2':>10s} {'grad_phi':>10s}")
for k in range(8):
sigma2_reparam = np.exp(phi)
# Chain rule: d(ell)/d(phi) = sigma^2 * d(ell)/d(sigma^2)
grad_sigma2 = -n / (2 * sigma2_reparam) + np.sum(x**2) / (2 * sigma2_reparam**2)
grad_phi = sigma2_reparam * grad_sigma2 # no clipping needed!
phi = phi + 0.01 * grad_phi
print(f"{k:4d} {phi:10.4f} {np.exp(phi):10.4f} {grad_phi:10.4f}")
print(f"\nTrue sigma^2 = {sigma2_true:.4f}")
print(f"Sample variance = {np.var(x):.4f}")
Bounded Parameters: The Logit Transform¶
If \(p \in (0, 1)\), define \(\eta = \log(p / (1-p))\). Then \(\eta \in \mathbb{R}\), and \(p = 1/(1 + e^{-\eta})\) is automatically in \((0, 1)\).
More generally, for \(\theta \in (a, b)\):
# Logit reparameterization for a binomial proportion
import numpy as np
from scipy.special import expit, logit
np.random.seed(42)
p_true = 0.73
n = 50
x = np.random.binomial(n, p_true) # x successes out of n trials
print(f"Observed: {x}/{n} = {x/n:.2f}, True p = {p_true}")
# Gradient ascent on eta = logit(p), unconstrained
eta = 0.0 # start at p = 0.5
print(f"\n{'Iter':>4s} {'eta':>8s} {'p':>8s} {'grad_eta':>10s} {'log-lik':>10s}")
for k in range(15):
p = expit(eta)
# Log-likelihood: x*log(p) + (n-x)*log(1-p)
ll = x * np.log(p) + (n - x) * np.log(1 - p)
# d(ell)/d(eta) = x - n*p (a beautiful simplification!)
grad_eta = x - n * p
eta = eta + 0.05 * grad_eta
print(f"{k:4d} {eta:8.4f} {p:8.4f} {grad_eta:10.4f} {ll:10.4f}")
print(f"\nMLE: p = {x/n:.4f} (exact), p = {expit(eta):.4f} (reparameterized GD)")
Simplex Constraint: The Softmax Transform¶
If \(\boldsymbol{\pi} = (\pi_1, \dots, \pi_K)\) with \(\pi_k \geq 0\) and \(\sum_k \pi_k = 1\), define unconstrained \(\boldsymbol{\eta} = (\eta_1, \dots, \eta_{K-1}) \in \mathbb{R}^{K-1}\) and set
where \(\eta_K = 0\) for identifiability (one degree of freedom is removed by the sum-to-one constraint).
The softmax transform is smooth and invertible (on the interior of the simplex), so gradient methods can be applied to \(\boldsymbol{\eta}\) without any constraint-handling machinery.
# Softmax reparameterization: multinomial MLE without constraints
import numpy as np
from scipy.special import softmax
np.random.seed(42)
K = 5
counts = np.array([45, 30, 15, 5, 5])
n = counts.sum()
pi_true = counts / n
# Unconstrained parameters (K-1 free, last fixed at 0)
eta = np.zeros(K - 1)
print(f"{'Iter':>4s} ", end="")
for k in range(K):
print(f"{'pi_'+str(k+1):>8s} ", end="")
print(f"{'||pi - true||':>14s}")
print("-" * 65)
for step in range(200):
# Forward: softmax to get probabilities
eta_full = np.append(eta, 0.0)
pi = softmax(eta_full)
# Gradient: d(ell)/d(eta_k) = counts_k - n*pi_k for k=1,...,K-1
grad = counts[:K-1] - n * pi[:K-1]
eta = eta + 0.01 * grad
if step % 40 == 0 or step == 199:
err = np.linalg.norm(pi - pi_true)
print(f"{step:4d} ", end="")
for k in range(K):
print(f"{pi[k]:8.4f} ", end="")
print(f"{err:14.6f}")
print(f"\nTrue: {pi_true}")
print(f"Sum: {pi.sum():.8f} (always exactly 1 by construction)")
The key insight: the softmax guarantees \(\pi_k > 0\) and \(\sum_k \pi_k = 1\) by construction, so we never need to worry about constraint violations.
Positive-Definite Matrices: The Cholesky Factor¶
For a covariance matrix \(\boldsymbol{\Sigma}\) that must be positive definite, write \(\boldsymbol{\Sigma} = \mathbf{L}\mathbf{L}^{\!\top}\) where \(\mathbf{L}\) is lower triangular with positive diagonal entries. Optimize over the entries of \(\mathbf{L}\), parameterizing the diagonal as \(L_{jj} = e^{\phi_j}\) to ensure positivity.
This gives \(p(p+1)/2\) unconstrained parameters, which is the correct number of degrees of freedom for a \(p \times p\) symmetric positive definite matrix.
# Cholesky reparameterization for covariance MLE
import numpy as np
np.random.seed(42)
d = 3
# True covariance
Sigma_true = np.array([[4.0, 1.0, 0.5],
[1.0, 2.0, 0.3],
[0.5, 0.3, 1.0]])
L_true = np.linalg.cholesky(Sigma_true)
# Generate data
n = 500
X = np.random.randn(n, d) @ L_true.T
# Parameterize: L has d*(d+1)/2 free parameters
# Diagonal: L_jj = exp(phi_j) (always positive)
# Off-diagonal: unconstrained
n_params = d * (d + 1) // 2
params = np.zeros(n_params) # initial: L = I
def params_to_L(params, d):
"""Convert unconstrained params to lower triangular L."""
L = np.zeros((d, d))
idx = 0
for i in range(d):
for j in range(i + 1):
if i == j:
L[i, j] = np.exp(params[idx]) # positive diagonal
else:
L[i, j] = params[idx] # free off-diagonal
idx += 1
return L
def neg_log_lik(params, X, d):
"""Negative log-likelihood for multivariate normal (mu=0)."""
L = params_to_L(params, d)
Sigma = L @ L.T
n = len(X)
sign, logdet = np.linalg.slogdet(Sigma)
Sigma_inv = np.linalg.inv(Sigma)
return 0.5 * n * logdet + 0.5 * np.sum(X @ Sigma_inv * X)
# Optimize using numerical gradient descent
print(f"{'Iter':>4s} {'neg-LL':>12s} {'||Sigma - Sigma_hat||':>22s}")
print("-" * 44)
lr = 0.0001
for step in range(500):
# Numerical gradient
grad = np.zeros(n_params)
eps = 1e-5
f0 = neg_log_lik(params, X, d)
for j in range(n_params):
params_p = params.copy()
params_p[j] += eps
grad[j] = (neg_log_lik(params_p, X, d) - f0) / eps
params = params - lr * grad
if step % 100 == 0 or step == 499:
L = params_to_L(params, d)
Sigma_hat = L @ L.T
err = np.linalg.norm(Sigma_hat - Sigma_true, 'fro')
print(f"{step:4d} {f0:12.2f} {err:22.4f}")
L_final = params_to_L(params, d)
Sigma_hat = L_final @ L_final.T
print(f"\nTrue Sigma:\n{Sigma_true}")
print(f"\nEstimated Sigma:\n{np.round(Sigma_hat, 3)}")
print(f"Positive definite: {np.all(np.linalg.eigvalsh(Sigma_hat) > 0)}")
Correlation Matrices: Fisher’s z-Transform¶
For a correlation parameter \(r \in (-1, 1)\), use
Then \(z \in \mathbb{R}\) is unconstrained.
# Fisher's z-transform for correlation estimation
import numpy as np
np.random.seed(42)
r_true = 0.85
n = 100
# Generate correlated data
Sigma = np.array([[1.0, r_true], [r_true, 1.0]])
L = np.linalg.cholesky(Sigma)
data = np.random.randn(n, 2) @ L.T
# Gradient ascent on z = arctanh(r), unconstrained
z = 0.0 # start at r = 0
print(f"{'Iter':>4s} {'z':>8s} {'r=tanh(z)':>10s} {'log-lik':>10s}")
for k in range(20):
r = np.tanh(z)
# Bivariate normal log-likelihood (mu=0, sigma=1, correlation r)
det = 1 - r**2
ll = -0.5 * n * np.log(det) - 0.5 / det * np.sum(
data[:, 0]**2 - 2*r*data[:, 0]*data[:, 1] + data[:, 1]**2)
# d(ell)/d(r)
dll_dr = n * r / det + (1 / det**2) * np.sum(
-r * (data[:, 0]**2 + data[:, 1]**2) + (1 + r**2) * data[:, 0] * data[:, 1])
# Chain rule: d(ell)/d(z) = d(ell)/d(r) * d(r)/d(z) = dll_dr * (1 - r^2)
grad_z = dll_dr * (1 - r**2)
z = z + 0.001 * grad_z
if k % 4 == 0:
print(f"{k:4d} {z:8.4f} {np.tanh(z):10.4f} {ll:10.2f}")
print(f"\nTrue r = {r_true}, Sample r = {np.corrcoef(data.T)[0,1]:.4f}, "
f"Estimated r = {np.tanh(z):.4f}")
All Four Reparameterizations at a Glance¶
# Summary: all reparameterization strategies
import numpy as np
print(f"{'Constraint':>25s} {'Transform':>25s} {'Inverse':>25s}")
print("-" * 80)
examples = [
("sigma^2 > 0", "phi = log(sigma^2)", "sigma^2 = exp(phi)"),
("p in (0, 1)", "eta = logit(p)", "p = sigmoid(eta)"),
("r in (-1, 1)", "z = arctanh(r)", "r = tanh(z)"),
("pi on simplex", "eta_k unconstrained", "pi = softmax(eta)"),
("Sigma pos. def.", "L = chol(Sigma)", "Sigma = L @ L.T"),
]
for constraint, transform, inverse in examples:
print(f"{constraint:>25s} {transform:>25s} {inverse:>25s}")
# Numerical demo: all transforms preserve constraints
print("\nNumerical verification:")
phi = -2.5
print(f" exp({phi}) = {np.exp(phi):.4f} > 0: True")
eta = 3.7
p = 1 / (1 + np.exp(-eta))
print(f" sigmoid({eta}) = {p:.4f} in (0,1): {0 < p < 1}")
z = 1.2
r = np.tanh(z)
print(f" tanh({z}) = {r:.4f} in (-1,1): {-1 < r < 1}")
eta_vec = np.array([1.0, -0.5, 0.3])
from scipy.special import softmax
pi = softmax(np.append(eta_vec, 0.0))
print(f" softmax({list(np.round(eta_vec, 1))} + [0]) = {np.round(pi, 3)}, "
f"sum={pi.sum():.6f}")
When to Reparameterize vs. Use Constrained Methods¶
Scenario |
Reparameterize |
Constrained method |
|---|---|---|
Simple positivity/boundedness |
Preferred |
Overkill |
Simplex (few categories) |
Preferred (softmax) |
Fine too |
Complex coupled constraints |
May be difficult |
Preferred |
Linear constraints (\(\mathbf{A}\boldsymbol{\theta} = \mathbf{b}\)) |
Null-space projection |
Lagrangian |
Interpretability of Hessian |
Harder (need Jacobian) |
Original parameterization |
16.8 Putting It All Together: Constrained MLE¶
A typical workflow for constrained maximum likelihood is:
Identify the constraints. Are they simple (positivity, simplex) or complex (coupled inequalities, linear equalities)?
Try reparameterization first. For simple constraints, the log, logit, softmax, or Cholesky transforms eliminate the constraints. Apply any unconstrained method from Chapter 12: Gradient Methods, Chapter 13: Newton and Scoring Methods, or Chapter 14: Quasi-Newton Methods.
If reparameterization is infeasible, use:
Lagrange multipliers (for small problems with equality constraints).
Interior-point methods (for convex problems with many inequality constraints).
Augmented Lagrangian / ADMM (for large-scale or separable problems).
Penalty methods (for quick-and-dirty solutions).
Verify the KKT conditions at the solution. Check that all constraints are satisfied, multipliers have the correct sign, and complementary slackness holds.
Compute standard errors using the constrained observed information (which may involve the bordered Hessian or the Hessian of the Lagrangian restricted to the constraint manifold).
Let us run a final comprehensive example comparing all three approaches— penalty, augmented Lagrangian, and reparameterization—on the same problem.
# Head-to-head: three approaches to the same constrained problem
# Problem: MLE of Bernoulli p with constraint p in (0.2, 0.8)
import numpy as np
from scipy.special import expit
np.random.seed(42)
n = 30
p_true = 0.65
x = np.random.binomial(1, p_true, n)
p_mle = x.mean()
print(f"Observed: {x.sum()}/{n} successes, unconstrained MLE = {p_mle:.4f}")
# Method 1: Quadratic penalty for constraint p >= 0.2 and p <= 0.8
p = 0.5
for rho in [10, 100, 1000]:
for _ in range(200):
grad = x.sum() / p - (n - x.sum()) / (1 - p)
# Penalty gradient for g1: 0.2 - p <= 0 and g2: p - 0.8 <= 0
if p < 0.2:
grad -= rho * (0.2 - p)
if p > 0.8:
grad -= rho * (p - 0.8)
p = p + 0.001 * grad
p = np.clip(p, 0.01, 0.99)
p_penalty = p
print(f"\nPenalty method: p = {p_penalty:.6f}")
# Method 2: Reparameterization via logit with bounds
# Map p in (0.2, 0.8) to eta in R via eta = log((p-0.2)/(0.8-p))
a, b = 0.2, 0.8
eta = 0.0 # start at midpoint p = 0.5
for _ in range(500):
p = a + (b - a) * expit(eta)
grad_p = x.sum() / p - (n - x.sum()) / (1 - p)
dp_deta = (b - a) * expit(eta) * (1 - expit(eta))
grad_eta = grad_p * dp_deta
eta = eta + 0.01 * grad_eta
p_reparam = a + (b - a) * expit(eta)
print(f"Reparameterization: p = {p_reparam:.6f}")
# Method 3: Projected gradient (project onto [0.2, 0.8] after each step)
p = 0.5
for _ in range(500):
grad = x.sum() / p - (n - x.sum()) / (1 - p)
p = p + 0.001 * grad
p = np.clip(p, 0.2, 0.8) # project onto feasible set
p_projected = p
print(f"Projected gradient: p = {p_projected:.6f}")
# Analytical: MLE is x.mean(), clipped to [0.2, 0.8]
p_exact = np.clip(p_mle, 0.2, 0.8)
print(f"Exact (clipped MLE): p = {p_exact:.6f}")
16.9 Summary¶
Constrained optimization is an essential part of likelihood-based inference. Lagrange multipliers handle equality constraints by introducing dual variables whose values measure the sensitivity of the optimum to the constraints. The KKT conditions extend this framework to inequality constraints, with complementary slackness determining which constraints are active. Algorithmic approaches — augmented Lagrangian, barrier methods, and penalty methods — provide practical ways to solve constrained problems numerically.
In statistical practice, the most common approach is reparameterization: the log transform for positivity, the logit for bounded parameters, the softmax for simplex constraints, and the Cholesky decomposition for positive definiteness. These transforms convert the constrained problem into an unconstrained one, allowing direct application of the gradient methods (Chapter 12: Gradient Methods), Newton methods (Chapter 13: Newton and Scoring Methods), and quasi-Newton methods (Chapter 14: Quasi-Newton Methods) developed in earlier chapters. When reparameterization is not feasible or when the constraints have complex structure, the formal constrained-optimization methods of this chapter provide the necessary tools.