Unit 5 - Notes

INT255 15 min read

Unit 5: Mathematical Foundations of Support Vector Machines

1. Geometric Interpretation of Classification Margins

The core idea of a Support Vector Machine (SVM) is to find an optimal separating hyperplane between classes. This optimality is defined by maximizing the "margin" of separation.

The Hyperplane

For a binary classification problem with data points in an n-dimensional space (ℝⁿ), a hyperplane is an (n-1)-dimensional subspace that divides the space into two half-spaces.

  • Equation of a Hyperplane: A hyperplane can be defined by the set of points x satisfying:

    w^T x + b = 0

    where:

    • w is the normal vector to the hyperplane (a non-zero vector perpendicular to the hyperplane). The orientation of the hyperplane is determined by w.
    • b is a scalar bias term. It determines the offset of the hyperplane from the origin.
  • Classification Rule: Once we have a hyperplane (w, b), we can classify a new data point x using the sign of f(x) = w^T x + b:

    • If w^T x + b > 0, classify x as class +1.
    • If w^T x + b < 0, classify x as class -1.

Defining the Margin

The margin is the "street" or "cushion" between the separating hyperplane and the closest data points from either class.

  1. Functional Margin: The functional margin of a data point (x_i, y_i), where y_i ∈ {-1, 1}, with respect to a hyperplane (w, b) is given by:

    γ̂_i = y_i (w^T x_i + b)

    A correct classification means γ̂_i > 0. However, the functional margin can be arbitrarily increased by scaling w and b (e.g., replacing them with 2w and 2b) without changing the hyperplane itself. This makes it unsuitable for optimization.

  2. Geometric Margin: The geometric margin is the actual perpendicular distance from a data point x_i to the hyperplane w^T x + b = 0. This distance is given by:

    γ_i = (y_i (w^T x_i + b)) / ||w||

    where ||w|| is the Euclidean norm (L2 norm) of w, i.e., ||w|| = sqrt(w_1^2 + w_2^2 + ... + w_n^2).

    The geometric margin is invariant to the scaling of w and b, making it a suitable metric for optimization.

The Optimal Hyperplane (Maximal Margin Classifier)

The goal of an SVM is to find the hyperplane (w, b) that maximizes the geometric margin for the entire dataset. This is the hyperplane that is farthest from the closest training points of both classes.

  • The margin of the classifier is γ = min_i γ_i.
  • The objective is to maximize this γ.
  • The points that lie on the edge of the margin (the closest points to the separating hyperplane) are called support vectors. They are critical because they alone define the position and orientation of the optimal hyperplane.

2. Hard Margin vs. Soft Margin SVM

Hard Margin SVM (Linearly Separable Case)

The hard margin SVM assumes that the training data is perfectly linearly separable. It finds a hyperplane that separates all data points with no misclassifications.

Formulation (Primal Problem)

To find the maximal margin hyperplane, we solve the following optimization problem:

max_(w,b) γ
subject to: y_i (w^T x_i + b) / ||w|| ≥ γ for all i=1, ..., m

This is a complex problem. We can simplify it. We know the geometric margin γ is scale-invariant. So, we can fix the functional margin of the support vectors to 1. This means for the closest points, y_i (w^T x_i + b) = 1.

This constraint implies that γ = 1 / ||w||. Now, maximizing γ is equivalent to maximizing 1 / ||w||, which is in turn equivalent to minimizing ||w||, or more conveniently, minimizing (1/2)||w||^2 (the 1/2 and the square are for mathematical convenience during differentiation).

The final optimization problem for the Hard Margin SVM is:

min_(w,b) (1/2) ||w||^2
subject to: y_i (w^T x_i + b) ≥ 1 for all i=1, ..., m

This is a convex quadratic programming problem with linear constraints.

Support Vectors

Support vectors are the training data points x_i for which the constraint is active (i.e., the equality holds):

y_i (w^T x_i + b) = 1

These points lie exactly on the margin boundaries. The optimal hyperplane is determined only by these support vectors.

Limitations

  1. Requires Linear Separability: It fails if the data cannot be perfectly separated by a hyperplane.
  2. Sensitivity to Outliers: A single outlier can drastically change the position of the optimal hyperplane and reduce the margin significantly.

Soft Margin SVM (Non-Linearly Separable Case)

The soft margin SVM relaxes the strict assumption of linear separability. It allows for some data points to be misclassified or to fall inside the margin.

Introducing Slack Variables

To allow for violations, we introduce a non-negative slack variable ξ_i ≥ 0 for each data point x_i.

  • If ξ_i = 0, the point x_i is correctly classified and is on or outside the margin.
  • If 0 < ξ_i ≤ 1, the point x_i is correctly classified but lies inside the margin.
  • If ξ_i > 1, the point x_i is misclassified (it's on the wrong side of the hyperplane).

The constraint is modified to: y_i (w^T x_i + b) ≥ 1 - ξ_i

Formulation (Primal Problem)

We want to find a balance between maximizing the margin (minimizing ||w||^2) and minimizing the classification errors (minimizing the sum of slack variables Σ ξ_i). This trade-off is controlled by a hyperparameter C.

The optimization problem for the Soft Margin SVM is:

min_(w,b,ξ) (1/2) ||w||^2 + C * Σ_(i=1)^m ξ_i
subject to:

  1. y_i (w^T x_i + b) ≥ 1 - ξ_i for all i=1, ..., m
  2. ξ_i ≥ 0 for all i=1, ..., m

The Role of the Hyperparameter C

C > 0 is a regularization parameter that controls the trade-off:

  • Large C: A large C places a heavy penalty on misclassification (Σ ξ_i). The optimizer will try very hard to classify all points correctly, leading to a smaller margin and a more complex decision boundary. This can lead to overfitting.
  • Small C: A small C places less penalty on misclassification. The optimizer will prioritize a larger margin, even if it means misclassifying some points. This can lead to underfitting.

3. Primal and Dual Optimization Problems

The Primal Problem (Recap)

The formulations derived above are known as the primal problems.

  • Hard Margin Primal: min_(w,b) (1/2) ||w||^2 s.t. y_i (w^T x_i + b) ≥ 1.
  • Soft Margin Primal: min_(w,b,ξ) (1/2) ||w||^2 + C Σ ξ_i s.t. y_i (w^T x_i + b) ≥ 1 - ξ_i and ξ_i ≥ 0.

The Need for the Dual Problem

While the primal problem can be solved directly, converting it to its dual form offers significant advantages:

  1. The Kernel Trick: The dual formulation expresses the optimization problem in terms of dot products of the input data points (x_i^T x_j). This allows us to use the kernel trick to handle non-linear data efficiently.
  2. Computational Efficiency: For datasets where the number of features (n) is much larger than the number of samples (m), the dual problem can be much faster to solve.
  3. Identifies Support Vectors: The solution to the dual problem directly gives us the support vectors.

The Dual Problem Formulation

The process of deriving the dual problem involves the Lagrangian formulation. The final dual problem (for the soft margin case) is:

max_α Σ_(i=1)^m α_i - (1/2) Σ_(i=1)^m Σ_(j=1)^m y_i y_j α_i α_j (x_i^T x_j)
subject to:

  1. Σ_(i=1)^m α_i y_i = 0
  2. 0 ≤ α_i ≤ C for all i=1, ..., m

Here, α_i are the Lagrange multipliers.

4. The Lagrangian Formulation: Bridging Primal and Dual

Constrained Optimization and Lagrange Multipliers

The Lagrangian is a technique used to solve constrained optimization problems. We convert a constrained problem into an unconstrained one by introducing Lagrange multipliers (one for each constraint).

Constructing the Lagrangian for Hard Margin SVM

Let's start with the simpler hard margin case.
Primal Problem: min_(w,b) (1/2) ||w||^2
Constraint: y_i (w^T x_i + b) ≥ 1 which can be rewritten as 1 - y_i (w^T x_i + b) ≤ 0.

We introduce a Lagrange multiplier α_i ≥ 0 for each constraint. The Lagrangian function L is:

L(w, b, α) = (1/2) ||w||^2 - Σ_(i=1)^m α_i [y_i (w^T x_i + b) - 1]

The primal problem is equivalent to solving min_(w,b) max_(α≥0) L(w, b, α). The dual problem is found by swapping the min and max: max_(α≥0) min_(w,b) L(w, b, α).

Deriving the Dual Problem

Step 1: Minimize L with respect to w and b

To find the minimum of L with respect to the primal variables w and b, we take the partial derivatives and set them to zero.

  • Derivative w.r.t. w:

    ∂L / ∂w = w - Σ_(i=1)^m α_i y_i x_i = 0

    This gives us a crucial result: w = Σ_(i=1)^m α_i y_i x_i

    This shows that the optimal weight vector w is a linear combination of the input data points x_i. Only the points for which α_i > 0 contribute to w. These are the support vectors.

  • Derivative w.r.t. b:

    ∂L / ∂b = - Σ_(i=1)^m α_i y_i = 0

    This gives us a constraint on the α's: Σ_(i=1)^m α_i y_i = 0

Step 2: Substitute back into the Lagrangian

Now we substitute these two results back into the original Lagrangian L to eliminate w and b, leaving a problem that only depends on α.

L(α) = (1/2) (Σ_i α_i y_i x_i)^T (Σ_j α_j y_j x_j) - Σ_i α_i y_i ( (Σ_j α_j y_j x_j)^T x_i + b ) + Σ_i α_i

Simplifying this using Σ_i α_i y_i = 0 and expanding the dot products:

L(α) = Σ_(i=1)^m α_i - (1/2) Σ_(i=1)^m Σ_(j=1)^m y_i y_j α_i α_j (x_i^T x_j)

This is the objective function for the dual problem. We need to maximize it subject to the constraints we found (Σ α_i y_i = 0) and the original constraint on the multipliers (α_i ≥ 0).

The Lagrangian for Soft Margin SVM

The derivation is similar but includes the slack variables ξ_i and their constraints.
The Lagrangian for the soft margin primal is:

L(w, b, ξ, α, μ) = (1/2) ||w||^2 + C Σ_i ξ_i - Σ_i α_i [y_i (w^T x_i + b) - 1 + ξ_i] - Σ_i μ_i ξ_i

where both α_i ≥ 0 and μ_i ≥ 0 are Lagrange multipliers. After taking derivatives w.r.t w, b, and ξ, and substituting back, we arrive at the same dual objective function, but with a different constraint on α: 0 ≤ α_i ≤ C.

Karush-Kuhn-Tucker (KKT) Conditions

The KKT conditions are the necessary conditions for the solution of a constrained optimization problem to be optimal. For the SVM dual problem, they provide powerful insights into the solution. For the soft margin SVM, the key KKT conditions are:

  1. α_i ≥ 0 (Lagrange multiplier for margin constraint)
  2. μ_i ≥ 0 (Lagrange multiplier for slack constraint)
  3. y_i (w^T x_i + b) - 1 + ξ_i ≥ 0 (Primal feasibility)
  4. ξ_i ≥ 0 (Primal feasibility)
  5. α_i [y_i (w^T x_i + b) - 1 + ξ_i] = 0 (Complementary slackness)
  6. μ_i ξ_i = 0 (Complementary slackness)
  7. C - α_i - μ_i = 0 (From ∂L/∂ξ_i = 0)

From these conditions, we can deduce the nature of the support vectors:

  • If α_i = 0: The point x_i is not a support vector and is correctly classified outside the margin.
  • If 0 < α_i < C: Then μ_i > 0, which implies ξ_i = 0. This means y_i (w^T x_i + b) = 1. The point x_i is a margin support vector; it lies exactly on the margin.
  • If α_i = C: Then μ_i = 0. This implies ξ_i ≥ 0. The point x_i is a non-margin support vector. It is either inside the margin or misclassified.

The bias b can be calculated using any margin support vector (0 < α_i < C): b = y_i - w^T x_i = y_i - Σ_j α_j y_j (x_j^T x_i).

5. The Kernel Trick and Kernel Functions

The Motivation: Non-Linear Decision Boundaries

Many real-world datasets are not linearly separable. The kernel trick is an elegant mathematical technique that allows SVMs to learn complex, non-linear decision boundaries.

The core idea is to map the data from its original input space X to a higher-dimensional feature space F where the data becomes linearly separable.
φ: X -> F

We could then run the SVM algorithm in this new feature space F. The decision boundary would be linear in F but would correspond to a non-linear boundary back in the original space X. However, computing this mapping φ(x) for high-dimensional spaces can be computationally infeasible.

The "Trick" in the Dual Formulation

Let's look at the SVM dual objective function and the prediction function:

  • Dual Objective: max_α Σ α_i - (1/2) Σ_i Σ_j y_i y_j α_i α_j (x_i^T x_j)
  • Prediction: f(x) = w^T x + b = (Σ_i α_i y_i x_i)^T x + b = Σ_i α_i y_i (x_i^T x) + b

Notice that the data points x_i only ever appear inside a dot product. If we map the data to the feature space F using φ, these dot products become φ(x_i)^T φ(x_j).

The kernel trick is to define a kernel function K(x_i, x_j) that computes this dot product in the high-dimensional space without ever explicitly computing the mapping φ.

K(x_i, x_j) = φ(x_i)^T φ(x_j)

Formalizing the Kernel Function

By replacing x_i^T x_j with K(x_i, x_j), we can operate in the high-dimensional space implicitly.

  • Kernelized Dual Objective:
    max_α Σ α_i - (1/2) Σ_i Σ_j y_i y_j α_i α_j K(x_i, x_j)
    subject to Σ α_i y_i = 0 and 0 ≤ α_i ≤ C.

  • Kernelized Prediction Function:
    f(x) = Σ_(i ∈ SV) α_i y_i K(x_i, x) + b
    (where SV is the set of support vectors)

Common Kernel Functions

  1. Linear Kernel:

    • K(x, z) = x^T z
    • This is the standard SVM, operating in the original feature space. It's equivalent to no transformation.
  2. Polynomial Kernel:

    • K(x, z) = (γ * x^T z + r)^d
    • d is the degree of the polynomial, γ is a scaling factor, and r is a constant offset.
    • This implicitly maps the data to a feature space of polynomial combinations of the original features up to degree d.
  3. Radial Basis Function (RBF) Kernel (or Gaussian Kernel):

    • K(x, z) = exp(-γ ||x - z||^2)
    • γ (gamma) is a parameter that controls the width of the Gaussian. A small γ means a wide influence, leading to a smoother decision boundary. A large γ means a narrow influence, leading to a more complex, potentially overfit boundary.
    • This kernel is very powerful as it corresponds to an infinite-dimensional feature space. It is a popular default choice.
  4. Sigmoid Kernel:

    • K(x, z) = tanh(γ * x^T z + r)
    • This kernel is related to neural networks with a sigmoid activation function.

Mercer's Theorem

Not every function K(x, z) can be used as a kernel. A function is a valid kernel if it corresponds to some dot product in a feature space. Mercer's theorem provides the condition for this: a symmetric function K(x, z) is a valid kernel if and only if the kernel matrix K (where K_ij = K(x_i, x_j)) is positive semi-definite for any set of points {x_1, ..., x_m}.

6. Optimization Perspective of SVM Training

SVM as a Quadratic Programming (QP) Problem

Training an SVM is fundamentally an optimization task. The primal and dual formulations are both instances of a Quadratic Programming (QP) problem.

A QP problem has the form:
min_u (1/2) u^T P u + q^T u
subject to:
G u ≤ h
A u = b

  • The objective function is quadratic (u^T P u).
  • The constraints are linear (G u ≤ h, A u = b).

The SVM dual problem fits this form perfectly:

  • u is the vector of α's.
  • The objective Σ α_i - (1/2) Σ_i Σ_j y_i y_j α_i α_j K(x_i, x_j) is quadratic in α.
  • The constraints Σ α_i y_i = 0 and 0 ≤ α_i ≤ C are linear.

Because the objective function is convex (the kernel matrix is positive semi-definite), this QP problem has a unique global minimum, which means we are guaranteed to find the optimal solution.

Solving the QP Problem: Sequential Minimal Optimization (SMO)

General-purpose QP solvers can be slow for large datasets. The Sequential Minimal Optimization (SMO) algorithm is a highly efficient algorithm specifically for training SVMs.

  • Idea: SMO breaks down the large QP problem into a series of the smallest possible QP sub-problems.
  • Process: In each step, SMO selects two Lagrange multipliers (α_i, α_j) to optimize jointly, while keeping all other multipliers fixed.
  • Advantage: The sub-problem involving just two multipliers can be solved analytically, which is extremely fast and avoids complex numerical QP optimization in the inner loop. SMO iterates through the multipliers until the solution converges.

The Hinge Loss Formulation (An alternative perspective)

The soft margin SVM primal problem can be re-written from a loss minimization perspective, which connects it to other machine learning models.

The objective min (1/2) ||w||^2 + C Σ ξ_i can be seen as:

min_(w,b) [ (1/λ) ||w||^2 + Σ_i max(0, 1 - y_i (w^T x_i + b)) ]

where λ is related to 1/C. This formulation consists of two parts:

  1. Regularization Term: (1/λ) ||w||^2. This is an L2 regularization term that penalizes large weights, promoting a simpler model and a larger margin.
  2. Loss Term: Σ_i max(0, 1 - y_i (w^T x_i + b)). This is the Hinge Loss.
    • If a point is correctly classified and outside the margin (y_i f(x_i) ≥ 1), the loss is 0.
    • If a point is inside the margin or misclassified (y_i f(x_i) < 1), the loss is a linear penalty 1 - y_i f(x_i).

This perspective frames SVM as a model that minimizes the sum of a regularization term (to control complexity and margin size) and a loss function (to penalize classification errors).