1

Define the concept of a hyperplane and its fundamental role in Support Vector Machines (SVMs).

2

Explain the geometric interpretation of the classification margin in SVM. How is it related to the support vectors?

3

Derive the expression for the margin in terms of the weight vector $\mathbf{w}$ and bias $\text{b}$ for a linearly separable dataset.

4

Distinguish between Hard Margin SVM and Soft Margin SVM. When is each appropriate?

5

Explain the role of slack variables in Soft Margin SVM. How do they allow for misclassifications and handle non-linearly separable data?

6

Describe the objective function of a Hard Margin SVM, including its constraints. Explain why it's formulated this way.

7

Describe the objective function of a Soft Margin SVM, explaining the significance of the regularization parameter $\text{C}$ .

8

Explain why the Primal Optimization Problem for SVM is typically formulated as a quadratic programming problem.

9

What is the significance of moving from the Primal to the Dual Optimization Problem in SVM training?

Moving from the Primal to the Dual Optimization Problem is a crucial step in SVM training, especially for practical applications. Its significance stems from several key advantages:\n\n Computational Efficiency with High-Dimensional Data:\n The primal problem's complexity depends on the dimensionality of the feature space (number of features). \n The dual problem's complexity, however, depends on the number of training samples ( $m$ ). When the dimensionality of the feature space is very high (or even infinite, after a kernel mapping), but the number of training samples is manageable, the dual problem becomes significantly more efficient to solve.\n\n Introduction of the Kernel Trick:\n The most important advantage is that the dual formulation expresses the problem solely in terms of inner products (dot products) of the input data points, $\langle \mathbf{x}_i, \mathbf{x}_j \rangle$ . \n This allows for the direct application of the kernel trick, where these inner products can be replaced by a kernel function $\text{K}( \mathbf{x}_i, \mathbf{x}_j )$ . This implicitly maps the data into a higher (potentially infinite) dimensional feature space without explicitly calculating the new feature vectors, making non-linear separation possible.\n\n Sparsity of Solution (Support Vectors):\n The solution to the dual problem involves Lagrange multipliers ( $\alpha_i$ ). Only a small number of these $\alpha_i$ will be non-zero, corresponding to the support vectors. This leads to a sparse solution, meaning that the decision function relies only on a subset of the training data. This makes the model more interpretable and sometimes more memory-efficient during prediction.\n\n* Convexity and Global Optimum: Both the primal and dual SVM problems are convex, guaranteeing that any local optimum found is also a global optimum.

10

Describe the general structure of a Lagrangian function for a constrained optimization problem.

11

Formulate the Lagrangian for the Hard Margin SVM primal problem.

12

Explain how the Karush-Kuhn-Tucker (KKT) conditions are applied to the SVM Lagrangian to derive the dual problem.

13

Discuss the implications of the KKT slackness condition for support vectors in SVM.

14

Explain the "kernel trick" in the context of SVMs. Why is it necessary?

The kernel trick is a method used in SVMs (and other machine learning algorithms) to handle non-linearly separable data by implicitly mapping the data into a higher-dimensional feature space without explicitly calculating the coordinates of the data in that space.\n\nHow it works:\n1. Non-linear data: Many real-world datasets are not linearly separable in their original input space. A linear decision boundary would perform poorly.\n2. Feature mapping: One way to handle this is to transform the original input features $\mathbf{x}$ into a higher-dimensional feature space using a non-linear mapping function, say $\phi(\mathbf{x})$ . In this higher-dimensional space, the data might become linearly separable.\n3. Inner products in dual form: The dual formulation of SVMs only involves data points through their inner products (dot products), i.e., $\langle \mathbf{x}_i, \mathbf{x}_j \rangle$ . If we map the data to a higher dimension, these inner products become $\langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle$ .\n4. Kernel function: A kernel function $\text{K}( \mathbf{x}_i, \mathbf{x}_j )$ is a function that directly computes this inner product in the high-dimensional space, i.e., $\text{K}( \mathbf{x}_i, \mathbf{x}_j ) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle$ , without ever explicitly calculating $\phi(\mathbf{x}_i)$ or $\phi(\mathbf{x}_j)$ . This avoids the computational burden and memory requirements of working in the high-dimensional space.\n\nWhy it is necessary:\n Handling Non-linear Separability: It allows SVMs to find non-linear decision boundaries in the original input space, making them powerful for complex datasets that are not linearly separable.\n Computational Efficiency: Explicitly computing $\phi(\mathbf{x})$ can be computationally very expensive or even impossible if the target feature space is infinite-dimensional. The kernel trick provides an efficient way to work in these high-dimensional spaces implicitly.\n* Avoids "Curse of Dimensionality": By not explicitly constructing the high-dimensional feature vectors, it helps mitigate the computational and statistical challenges associated with high dimensionality.

15

Describe at least three common types of kernel functions used in SVMs (e.g., Linear, Polynomial, RBF) and briefly explain when each might be preferred.

16

How does a kernel function implicitly map data into a higher-dimensional feature space without explicit computation?

A kernel function performs an implicit mapping into a higher-dimensional feature space through a clever mathematical trick that leverages the nature of the dual SVM formulation.\n\nHere's the breakdown:\n\n1. The Need for Feature Mapping: Often, data is not linearly separable in its original input space $\mathcal{X}$ . To find a linear decision boundary, we might map the data to a higher-dimensional feature space $\mathcal{F}$ using a non-linear transformation $\phi: \mathcal{X} \to \mathcal{F}$ . In $\mathcal{F}$ , the data might become linearly separable.\n\n2. SVM Dual Problem's Dependence on Inner Products: The critical insight is that in the dual formulation of SVM, the optimization problem (and the resulting decision function) only depends on the inner products (dot products) between data points. Specifically, terms like $\langle \mathbf{x}_i, \mathbf{x}_j \rangle$ (in the original space) or $\langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle$ (in the feature space) appear.\n\n3. The Kernel Function as an Inner Product Calculator: A kernel function $\text{K}( \mathbf{x}_i, \mathbf{x}_j )$ is defined such that it directly computes this inner product in the high-dimensional space without ever needing to explicitly calculate the $\phi(\mathbf{x}_i)$ and $\phi(\mathbf{x}_j)$ feature vectors. That is, $\text{K}( \mathbf{x}_i, \mathbf{x}_j ) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle$ . \n For example, consider $\mathbf{x} = [x_1, x_2]^T$ and a mapping $\phi(\mathbf{x}) = [x_1^2, \sqrt{2}x_1x_2, x_2^2]^T$ . Explicitly calculating $\phi(\mathbf{x}_i)$ and $\phi(\mathbf{x}_j)$ and then taking their dot product can be tedious.\n However, the polynomial kernel of degree 2 is $\text{K}( \mathbf{x}_i, \mathbf{x}_j ) = ( \mathbf{x}_i \cdot \mathbf{x}_j )^2 = (x_{i1}x_{j1} + x_{i2}x_{j2})^2$ . If you expand this, you will find it equals $\langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle$ . \n\nKey Point: The kernel function avoids the explicit computation and storage of potentially very high (even infinite) dimensional feature vectors. Instead, it provides a shortcut to calculate the required inner product directly from the original low-dimensional input vectors. This allows SVMs to find complex non-linear decision boundaries efficiently and without encountering the 'curse of dimensionality' associated with explicit feature mapping.

17

What is Mercer's theorem, and why is it important for valid kernel functions?

18

Briefly outline the general steps involved in training an SVM.

19

Explain how the dual problem's solution relates to the weight vector $\mathbf{w}$ and bias $\text{b}$ of the separating hyperplane.

20

Discuss the computational advantages of solving the dual problem over the primal problem, especially when using the kernel trick.

Solving the dual problem often offers significant computational advantages over the primal problem in SVM training, particularly with the kernel trick:\n\n1. Kernel Trick Integration:\n Primal: The primal problem explicitly involves $\mathbf{w}$ and $\mathbf{x}_i$ . To use a kernel, one would theoretically need to transform $\mathbf{x}_i$ into $\phi(\mathbf{x}_i)$ and then solve in the high-dimensional feature space. This explicit transformation can be computationally prohibitive or impossible if the feature space is very high or infinite-dimensional.\n Dual: The dual problem is formulated entirely in terms of inner products, $\langle \mathbf{x}_i, \mathbf{x}_j \rangle$ . This is where the kernel trick shines: $\langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle$ can be replaced by $\text{K}( \mathbf{x}_i, \mathbf{x}_j )$ , which calculates the inner product in the high-dimensional space without ever explicitly computing $\phi(\mathbf{x})$ . This makes non-linear classification with high-dimensional mappings tractable.\n\n2. Dimensionality vs. Number of Samples:\n Primal: The complexity of the primal problem depends on the dimensionality of the feature space ( $p$ ). If $p$ $is very large, the primal problem becomes computationally expensive to solve.\n * **Dual:** The complexity of the dual problem depends on the number of training samples ($ m $). When$ m $is much smaller than $p$ (which is common in many applications, especially after implicit mapping to high dimensions), the dual problem is significantly faster to solve.\n\n3. Sparsity of Solution (Support Vectors):\n The dual formulation yields the Lagrange multipliers $\alpha_i$ . Only a small fraction of these will be non-zero, corresponding to the support vectors. This means the decision boundary is defined by a sparse set of training points.\n * This sparsity reduces the complexity of the final model: for prediction, only computations involving the support vectors are needed, making inference faster and requiring less memory.\n\n4. Convexity Guarantees: Both primal and dual SVM problems are convex, ensuring that efficient quadratic programming solvers can find a global optimum. However, the dual form often presents a more numerically stable and well-conditioned problem, especially with the introduction of kernels.

Unit5 - Subjective Questions