Unit 6 - Notes

INT255 11 min read

Unit 6: Regularization and Generalization Theory

1. The Bias-Variance Trade-off

The bias-variance trade-off is a fundamental concept in supervised learning that describes the tension between model complexity and its ability to generalize to new, unseen data. It decomposes the expected generalization error of a learning algorithm into three components.

1.1. Decomposing Generalization Error

Let's consider a regression problem. The true underlying relationship between features x and target y is given by y = f(x) + ε, where ε is random noise with zero mean and variance σ².

  • f(x): The true, unknown function we want to approximate.
  • ε: Irreducible noise (e.g., measurement error). E[ε] = 0, Var(ε) = σ².

We use a dataset D to train a model, resulting in a learned hypothesis h_D(x). Note that the hypothesis h_D is a random variable because it depends on the specific training dataset D we happened to draw from the underlying data distribution.

The goal is to minimize the Expected Prediction Error (EPE) for a new data point (x, y):

EPE(x) = E_D[E_y[(y - h_D(x))² | x]]

This is the expected squared error at a point x, averaged over all possible training datasets D and all possible values of y for that x. This can be decomposed as follows:

EPE(x) = (Bias)² + Variance + Irreducible Error

Let's break down each term:

a) Bias

Bias measures the difference between the average prediction of our model and the correct value f(x). It represents the model's inherent assumptions and simplifications.

Bias[h_D(x)] = E_D[h_D(x)] - f(x)

  • High Bias: The model's average prediction is far from the true function. This happens when the model is too simple to capture the underlying structure of the data (e.g., fitting a linear model to a quadratic relationship). This leads to underfitting.
  • Low Bias: The model's average prediction is close to the true function. This indicates the model is flexible enough to capture the data's true structure.

b) Variance

Variance measures how much the model's prediction h_D(x) changes for a given point x if we train it on different datasets D. It represents the model's sensitivity to the specific training data.

Var[h_D(x)] = E_D[(h_D(x) - E_D[h_D(x)])²]

  • High Variance: The model changes significantly with small changes in the training data. It is learning the noise and specific quirks of the training set, not the underlying function. This leads to overfitting.
  • Low Variance: The model's predictions are stable across different training sets. It produces a consistent output, indicating it has learned the general patterns.

c) Irreducible Error

This error component is due to the inherent noise ε in the data itself. No model, no matter how good, can eliminate this error.

Irreducible Error = Var(ε) = σ²

1.2. The Trade-off in Practice

  • Simple Models (e.g., Linear Regression): Tend to have high bias and low variance. They make strong assumptions (e.g., linearity) and are not overly sensitive to the training data.
  • Complex Models (e.g., High-Degree Polynomials, Deep Neural Networks): Tend to have low bias and high variance. They are flexible enough to fit the training data very well (low bias) but are highly sensitive to its noise, leading to poor generalization (high variance).

The relationship is illustrated by the "U-shaped" curve of total error versus model complexity.

The optimal model complexity is at the bottom of the "U," where the total error is minimized. This is the sweet spot that balances bias and variance.

2. Overfitting and Underfitting: A Mathematical Viewpoint

Let h be a hypothesis from a hypothesis space H.

  • True Risk / Generalization Error R(h): The expected loss over the entire data distribution.
    R(h) = E_{(x,y)∼P}[L(h(x), y)], where L is a loss function (e.g., squared error). This is what we truly want to minimize, but we can't compute it because we don't know the true distribution P.
  • Empirical Risk / Training Error R_emp(h): The average loss over our training set S = {(x_i, y_i)}_{i=1}^n.
    R_emp(h) = (1/n) * Σ_{i=1}^n L(h(x_i), y_i). This is what we can compute and minimize during training.

Underfitting (High Bias):

  • The model is too simple. The hypothesis space H is not rich enough to contain a function that can approximate the true function f.
  • Mathematical Symptom: Both R_emp(h) and R(h) are high. The model performs poorly on both the training data and new data.

Overfitting (High Variance):

  • The model is too complex. It has learned the noise and specific patterns of the training set S.
  • Mathematical Symptom: R_emp(h) is very low, but R(h) is high. There is a large generalization gap: R(h) - R_emp(h) >> 0. The model has "memorized" the training data but fails to generalize.

The goal of machine learning, particularly regularization, is to find a hypothesis h that not only has a low empirical risk R_emp(h) but also a small generalization gap, thereby ensuring a low true risk R(h).

3. L1 and L2 Regularization

Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function. This penalty discourages the model from becoming too complex.

The general form of a regularized objective function is:

J(w) = R_emp(w) + λ * Ω(w)

  • R_emp(w): The original loss function (e.g., Mean Squared Error).
  • w: The model parameters (weights).
  • Ω(w): The regularization term, which penalizes complexity.
  • λ: The regularization parameter, a hyperparameter that controls the strength of the penalty. λ ≥ 0.

3.1. L2 Regularization (Ridge Regression)

L2 regularization adds a penalty proportional to the squared magnitude of the weights. The penalty term is the squared L2 norm of the weight vector.

Ω(w) = ||w||₂² = Σ_{j=1}^d w_j²

The objective function becomes:

J(w) = Σ_{i=1}^n (y_i - wᵀx_i)² + λ * Σ_{j=1}^d w_j²

Mathematical Effects:

  1. Shrinks Weights: It encourages the model to use smaller weights. Large weights are heavily penalized, pushing them towards zero but rarely ever reaching exactly zero.
  2. Distributes Weights: In the presence of correlated features, L2 tends to distribute the weight more evenly among them.
  3. Makes the Problem Well-Posed: For linear regression, the closed-form solution is w = (XᵀX)⁻¹Xᵀy. If features are correlated, XᵀX can be singular or ill-conditioned, making its inverse unstable. The L2-regularized solution is w = (XᵀX + λI)⁻¹Xᵀy. The term λI adds a positive value to the diagonal, making the matrix invertible and the solution stable.

3.2. L1 Regularization (Lasso Regression)

L1 regularization adds a penalty proportional to the absolute value of the weights. The penalty term is the L1 norm of the weight vector.

Ω(w) = ||w||₁ = Σ_{j=1}^d |w_j|

The objective function becomes:

J(w) = Σ_{i=1}^n (y_i - wᵀx_i)² + λ * Σ_{j=1}^d |w_j|

Mathematical Effects:

  1. Creates Sparsity: The key feature of L1 is that it can shrink some weights to be exactly zero. This effectively performs automatic feature selection, as features with zero weights are removed from the model.
  2. Non-differentiable: The |w_j| term is not differentiable at w_j = 0. This means there is no simple closed-form solution, and optimization must be done with algorithms like Coordinate Descent or Subgradient Descent.

3.3. Geometric Intuition: L1 vs. L2

The difference between L1 and L2 can be understood geometrically by viewing regularization as a constrained optimization problem.

Minimize R_emp(w) subject to Ω(w) ≤ C

For a given λ, there is a corresponding constraint radius C.

  • L2 Regularization: The constraint region is ||w||₂² ≤ C, which is a hypersphere (a circle in 2D). The level curves of the loss function are ellipses. The optimal solution is the point where an ellipse first touches the circle. This tangency point is unlikely to be on an axis, so weights are small but non-zero.
  • L1 Regularization: The constraint region is ||w||₁ ≤ C, which is a hyper-rhombus (a diamond in 2D). Because the L1 "ball" has sharp corners (vertices) at the axes, the expanding ellipses of the loss function are much more likely to hit one of these corners first. At a corner, one of the weight components is zero, leading to a sparse solution.

4. Norm-based Constraints and Structural Risk Minimization

The L1 and L2 penalties are specific instances of a more general idea of controlling model complexity.

4.1. Lp Norms

The Lp norm of a vector w is defined as:

||w||_p = (Σ_{i=1}^d |w_i|^p)^(1/p)

  • p=1: L1 norm (Manhattan distance)
  • p=2: L2 norm (Euclidean distance)

Regularization can use any Lp norm. The "Elastic Net" regularization combines both L1 and L2 penalties: λ₁||w||₁ + λ₂||w||₂².

4.2. Structural Risk Minimization (SRM)

SRM is a theoretical principle from statistical learning theory (developed by Vapnik) that provides a formal justification for regularization.

The core idea is to balance the quality of the fit on the training data (empirical risk) with the complexity of the hypothesis class. Instead of just minimizing R_emp(h), SRM aims to minimize an upper bound on the true risk R(h).

A key result from learning theory states that with probability at least 1-δ, the following bound holds for any hypothesis h in a hypothesis class H:

R(h) ≤ R_emp(h) + Φ(H, n, δ)

  • R(h): True risk (what we care about).
  • R_emp(h): Empirical risk (what we measure).
  • Φ(H, n, δ): The "complexity penalty" or "generalization bound". This term depends on:
    • H: The complexity/capacity of the hypothesis space.
    • n: The number of training samples.
    • δ: A confidence parameter (how certain we are that the bound holds).

The SRM Principle:
Instead of naively picking a hypothesis h that minimizes R_emp(h) (which can lead to overfitting), we should pick the hypothesis h that minimizes the entire right-hand side of the inequality.

Regularization is a practical implementation of SRM. The term λ * Ω(w) in our regularized objective function is a proxy for the theoretical complexity penalty Φ(H, n, δ). By penalizing large weights, we are effectively restricting ourselves to a simpler subset of the hypothesis space, thereby controlling the complexity term and improving generalization.

5. VC Dimension (Vapnik-Chervonenkis Dimension)

The VC dimension is a formal, quantitative measure of the capacity, complexity, or expressive power of a hypothesis class H. It is a key concept used in the complexity term Φ from the SRM principle.

5.1. Intuition and Definition

Intuition: The VC dimension measures how many data points a model class can perfectly classify, no matter how the labels are assigned.

Shattering: A hypothesis class H is said to shatter a set of n points {x₁, ..., x_n} if for every possible binary labeling of these points, there exists a hypothesis h ∈ H that can produce that exact labeling.

  • There are 2^n possible binary labelings for n points. To shatter the set, H must be able to realize all of them.

Definition: VC Dimension
The VC dimension of a hypothesis class H, denoted VC(H), is the size of the largest set of points that H can shatter. If H can shatter any arbitrarily large set of points, its VC dimension is infinite.

5.2. Examples

a) Linear Classifier in 1D (Thresholds on a line)

  • A set of 1 point can be shattered. ({+} or {-})
  • A set of 2 points {x₁, x₂} can be shattered. We can label them {+,+}, {+,-}, {-,-}, {-,+} using a threshold.
  • A set of 3 points {x₁, x₂, x₃} cannot be shattered. The labeling {+,-,+} is impossible to achieve with a single threshold.
  • VC(H_threshold) = 2.

b) Linear Classifier in 2D (Lines in a plane)

  • Can H shatter 3 points (non-collinear)? Yes. For any of the 2³=8 labelings, we can draw a line to separate the positives from the negatives.

  • Can H shatter 4 points? No. Consider 4 points arranged in a convex hull (e.g., corners of a square). It is impossible to find a line that correctly classifies the labeling where diagonally opposite points have the same label (the XOR problem).

  • VC(H_line_in_2D) = 3.

Generalization: The VC dimension of a linear classifier (hyperplane) in d dimensions is d+1.

5.3. Role in Generalization Theory

The VC dimension provides a concrete measure of complexity for the SRM bound. A simplified version of the VC bound is:

R(h) ≤ R_emp(h) + sqrt( (VC(H) * (log(2n/VC(H)) + 1) - log(δ/4)) / n )

Key Takeaways from the Bound:

  1. Complexity: If VC(H) is high, the second term (the generalization bound) is large. This means that for a complex model, the true risk could be much higher than the empirical risk. This is the mathematical formalization of overfitting.
  2. Data: As the number of samples n increases, the generalization bound decreases (at a rate of approx. 1/√n). This means with enough data, even a complex model can generalize well.
  3. SRM in Practice: SRM involves defining a nested sequence of hypothesis spaces H₁ ⊂ H₂ ⊂ ... where VC(H₁) < VC(H₂) < .... We then choose the hypothesis space H_k and the specific hypothesis h ∈ H_k that together minimize the VC bound. Regularization is a smoother, more practical way to do this: increasing λ effectively restricts the model to a hypothesis space with a smaller "effective" VC dimension.