1What is the primary geometric goal of a Support Vector Machine (SVM) for classification?
Geometric interpretation of classification margins
Easy
A.To find the line that passes through the most data points
B.To maximize the margin between classes
C.To connect all data points of the same class
D.To minimize the number of support vectors
Correct Answer: To maximize the margin between classes
Explanation:
The core idea of SVM is to find the optimal hyperplane that has the largest possible distance, or margin, to the nearest data points of any class, which improves the model's generalization.
Incorrect! Try again.
2In the context of SVM, what are "support vectors"?
Geometric interpretation of classification margins
Easy
A.The data points that are misclassified
B.All the data points in the training set
C.The data points that lie on or closest to the margin boundaries
D.The data points that are furthest from the decision boundary
Correct Answer: The data points that lie on or closest to the margin boundaries
Explanation:
Support vectors are the critical data points that "support" or define the position of the hyperplane. They are the points that are closest to the decision boundary.
Incorrect! Try again.
3What is the decision boundary created by a linear SVM called?
Geometric interpretation of classification margins
Easy
A.A centroid
B.A hyperplane
C.A regression line
D.A decision tree
Correct Answer: A hyperplane
Explanation:
A linear SVM separates data using a flat boundary. In two dimensions this is a line, in three dimensions a plane, and in higher dimensions it is called a hyperplane.
Incorrect! Try again.
4A Hard Margin SVM is suitable only when the training data is...
Hard margin and soft margin SVM
Easy
A.Perfectly linearly separable
B.Clustered into a single group
C.Very large
D.Not linearly separable
Correct Answer: Perfectly linearly separable
Explanation:
A Hard Margin SVM requires that all data points are correctly classified with a margin, which is only possible if the data is perfectly separable by a hyperplane without any errors.
Incorrect! Try again.
5What is the main advantage of a Soft Margin SVM over a Hard Margin SVM?
Hard margin and soft margin SVM
Easy
A.It always finds a wider margin
B.It can handle data that is not linearly separable and is more robust to outliers
C.It only works for non-linear data
D.It is computationally faster
Correct Answer: It can handle data that is not linearly separable and is more robust to outliers
Explanation:
Soft Margin SVM introduces slack variables to allow for some misclassifications, making it flexible enough to handle overlapping classes and outliers which would prevent a hard-margin SVM from finding a solution.
Incorrect! Try again.
6In a Soft Margin SVM, what is the role of the slack variable ?
Hard margin and soft margin SVM
Easy
A.It defines the width of the margin
B.It is the weight vector of the hyperplane
C.It measures how much a data point violates the margin
D.It is a random noise parameter
Correct Answer: It measures how much a data point violates the margin
Explanation:
The slack variable is introduced for each data point . If , it measures the degree to which the point is either inside the margin or on the wrong side of the hyperplane.
Incorrect! Try again.
7What does the hyperparameter control in a Soft Margin SVM?
Hard margin and soft margin SVM
Easy
A.The learning rate of the optimizer
B.The type of kernel to be used
C.The number of dimensions in the feature space
D.The trade-off between maximizing the margin and minimizing classification errors
Correct Answer: The trade-off between maximizing the margin and minimizing classification errors
Explanation:
A small value creates a wider margin but allows more margin violations (a "softer" margin). A large value penalizes violations more heavily, resulting in a narrower margin (closer to a "hard" margin).
Incorrect! Try again.
8What is the main purpose of using the Lagrangian formulation in the context of SVM optimization?
Lagrangian formulation
Easy
A.To visualize the data in 2D
B.To select the best kernel function automatically
C.To convert a constrained optimization problem into a form that is easier to solve
D.To increase the number of features
Correct Answer: To convert a constrained optimization problem into a form that is easier to solve
Explanation:
The Lagrangian method combines the objective function and its constraints into a single function. This allows us to solve the problem by finding the saddle point of this function, often by moving to the dual problem.
Incorrect! Try again.
9In the Lagrangian formulation of SVM, what are the variables called?
Lagrangian formulation
Easy
A.Slack variables
B.Lagrange multipliers
C.Bias terms
D.Weight vectors
Correct Answer: Lagrange multipliers
Explanation:
The variables , one for each data point, are introduced to incorporate the classification constraints into the objective function and are known as Lagrange multipliers.
Incorrect! Try again.
10According to the Karush-Kuhn-Tucker (KKT) conditions for SVM, if a data point is NOT a support vector, its corresponding Lagrange multiplier will be:
Lagrangian formulation
Easy
A.
B.
C.
D.
Correct Answer:
Explanation:
A key result of the KKT conditions is that only support vectors can have non-zero Lagrange multipliers. For all other points that are correctly classified and lie outside the margin, .
Incorrect! Try again.
11The primal optimization problem for a hard-margin SVM aims to minimize which quantity?
Primal and dual optimization problems
Easy
A.The number of misclassified points
B.The bias term,
C.The norm of the weight vector,
D.The sum of the distances from the margin
Correct Answer: The norm of the weight vector,
Explanation:
The primal problem is formulated as . Minimizing the norm of the weight vector, , is mathematically equivalent to maximizing the margin, which is .
Incorrect! Try again.
12A primary motivation for solving the dual problem instead of the primal problem in SVMs is that it enables the use of:
Primal and dual optimization problems
Easy
A.Gradient descent
B.Feature scaling
C.Regularization
D.The kernel trick
Correct Answer: The kernel trick
Explanation:
The dual formulation expresses the problem in terms of dot products of the input data points. This structure allows us to replace the dot product with a kernel function, which is the essence of the kernel trick for non-linear classification.
Incorrect! Try again.
13What is the fundamental idea behind the "kernel trick"?
Kernel trick and kernel functions
Easy
A.To reduce the dimensionality of the data before classification
B.To convert a classification problem into a regression problem
C.To compute dot products in a high-dimensional feature space without explicitly transforming the data
D.To randomly guess the support vectors to speed up training
Correct Answer: To compute dot products in a high-dimensional feature space without explicitly transforming the data
Explanation:
The kernel trick is a clever mathematical technique that uses a kernel function to calculate the result of a dot product in a high-dimensional space, avoiding the computationally expensive step of explicit data transformation.
Incorrect! Try again.
14Which of the following is a widely used kernel function in SVMs for handling non-linear data?
Kernel trick and kernel functions
Easy
A.Radial Basis Function (RBF) kernel
B.Mean Squared Error (MSE) kernel
C.Stochastic Gradient Descent (SGD) kernel
D.Cross-Entropy kernel
Correct Answer: Radial Basis Function (RBF) kernel
Explanation:
The Radial Basis Function (RBF) kernel is a popular default choice for SVMs because of its flexibility in modeling complex, non-linear relationships. Other common kernels include Linear, Polynomial, and Sigmoid.
Incorrect! Try again.
15Using a linear kernel in an SVM is equivalent to...
Kernel trick and kernel functions
Easy
A.Always misclassifying half the data
B.Using a very complex RBF kernel
C.Applying no non-linear transformation and finding a linear separator in the original feature space
D.Projecting the data into an infinite-dimensional space
Correct Answer: Applying no non-linear transformation and finding a linear separator in the original feature space
Explanation:
The linear kernel, defined as , is simply the dot product in the original feature space. This means the SVM operates without any non-linear mapping, creating a linear decision boundary.
Incorrect! Try again.
16When is it most appropriate to use a non-linear kernel like the Polynomial or RBF kernel?
Kernel trick and kernel functions
Easy
A.When you have a very small number of features
B.When the decision boundary between the classes is likely non-linear
C.When the data is perfectly linearly separable
D.When you want the fastest possible training time
Correct Answer: When the decision boundary between the classes is likely non-linear
Explanation:
Non-linear kernels are used to map data into a higher-dimensional space where a linear separator might exist. This is necessary when the classes cannot be separated by a simple line or plane in their original space.
Incorrect! Try again.
17The task of training an SVM is fundamentally what type of mathematical problem?
Optimization perspective of SVM training
Easy
A.A convex quadratic programming problem
B.A non-convex optimization problem
C.A linear programming problem
D.A system of linear equations
Correct Answer: A convex quadratic programming problem
Explanation:
SVM training involves minimizing a quadratic objective function () subject to linear inequality constraints. This specific type of problem is known as a convex quadratic program, which guarantees a unique global minimum.
Incorrect! Try again.
18The objective function for a hard-margin SVM is to minimize . This is equivalent to maximizing what geometric quantity?
Optimization perspective of SVM training
Easy
A.The angle between the support vectors
B.The distance to the origin
C.The margin, which is proportional to
D.The number of support vectors
Correct Answer: The margin, which is proportional to
Explanation:
The margin of an SVM is geometrically defined as . Therefore, minimizing (or its squared value for mathematical convenience) is the same as maximizing the margin.
Incorrect! Try again.
19In the dual formulation of SVM, the final decision function for a new data point depends on...
Primal and dual optimization problems
Easy
A.The dot product of with only the support vectors
B.The average of all feature vectors
C.All the data points in the training set
D.Only the bias term
Correct Answer: The dot product of with only the support vectors
Explanation:
Because the Lagrange multipliers are zero for non-support vectors, the sum in the decision function only needs to be computed over the support vectors. This makes prediction efficient, as it doesn't depend on the entire dataset.
Incorrect! Try again.
20For a 2D dataset, a linear SVM's margin is visually represented by the region between two...
Geometric interpretation of classification margins
Easy
A.Points
B.Concentric squares
C.Circles
D.Parallel lines
Correct Answer: Parallel lines
Explanation:
In a 2D space, the decision boundary is a line. The margin is the region between two other lines that are parallel to the decision boundary and pass through the closest points of each class (the support vectors).
Incorrect! Try again.
21In a linearly separable dataset, if we scale all feature vectors by a factor of 2 (i.e., ), how does the maximal geometric margin of a hard-margin SVM change?
Geometric interpretation of classification margins
Medium
A.It is halved.
B.It is squared.
C.It remains unchanged.
D.It is doubled.
Correct Answer: It is doubled.
Explanation:
The geometric margin is given by . The optimal hyperplane is defined by . When data points are scaled to , the optimal weight vector scales to to maintain the same separating boundary decision. The new margin becomes , which is double the original margin.
Incorrect! Try again.
22In a soft-margin SVM, what is the effect of choosing a very large value for the hyperparameter ?
Hard margin and soft margin SVM
Medium
A.It reduces the number of support vectors to zero.
B.It leads to a narrower margin and penalizes margin violations more heavily, behaving more like a hard-margin SVM.
C.It makes the decision boundary completely linear, regardless of the kernel used.
D.It leads to a wider margin and allows more margin violations.
Correct Answer: It leads to a narrower margin and penalizes margin violations more heavily, behaving more like a hard-margin SVM.
Explanation:
The hyperparameter is a regularization parameter that controls the trade-off between maximizing the margin and minimizing the classification error. A large assigns a high penalty to misclassified points and points within the margin, forcing the optimizer to find a solution with fewer margin violations, which typically results in a narrower margin, similar to a hard-margin SVM.
Incorrect! Try again.
23What is the primary motivation for solving the dual optimization problem of an SVM instead of the primal problem?
Primal and dual optimization problems
Medium
A.The primal problem is not a convex optimization problem, while the dual is.
B.The dual formulation allows the use of the kernel trick to handle non-linearly separable data.
C.The dual problem's objective function is simpler to differentiate.
D.The dual problem always has fewer constraints than the primal.
Correct Answer: The dual formulation allows the use of the kernel trick to handle non-linearly separable data.
Explanation:
The dual formulation expresses the optimization problem in terms of dot products of the input feature vectors (). This structure is key because the kernel trick works by replacing this dot product with a kernel function , allowing SVMs to efficiently learn non-linear decision boundaries in high-dimensional feature spaces without explicitly computing the transformations.
Incorrect! Try again.
24In the context of the SVM dual problem, the Karush-Kuhn-Tucker (KKT) conditions imply that for a data point that is NOT a support vector, its corresponding Lagrange multiplier must be:
Lagrangian formulation
Medium
A.
B.
C.
D.
Correct Answer:
Explanation:
The KKT conditions for SVMs include the complementary slackness condition: . For a point that is not a support vector, it lies correctly classified and outside the margin, so . For the product to be zero, its corresponding Lagrange multiplier must be zero.
Incorrect! Try again.
25Consider a polynomial kernel . What does the parameter control?
Kernel trick and kernel functions
Medium
A.The width of the margin.
B.The radial influence of a single training example.
C.The degree of the polynomial in the higher-dimensional feature space, influencing the complexity of the decision boundary.
D.The penalty for misclassification.
Correct Answer: The degree of the polynomial in the higher-dimensional feature space, influencing the complexity of the decision boundary.
Explanation:
The parameter in the polynomial kernel represents the degree of the polynomial. A higher value of corresponds to a more complex decision boundary, which can fit more intricate patterns in the data but also has a higher risk of overfitting.
Incorrect! Try again.
26The primal optimization problem for a hard-margin SVM is to minimize subject to . This type of problem is best classified as:
Optimization perspective of SVM training
Medium
A.Linear Programming (LP)
B.Integer Programming (IP)
C.Unconstrained Optimization
D.Quadratic Programming (QP)
Correct Answer: Quadratic Programming (QP)
Explanation:
This is a Quadratic Programming (QP) problem because the objective function () is a quadratic function of the variables (the components of ), and all the constraints () are linear with respect to the optimization variables and .
Incorrect! Try again.
27Which of the following statements correctly describes the support vectors in a hard-margin linear SVM?
Geometric interpretation of classification margins
Medium
A.They are the data points that lie exactly on the margin boundaries.
B.They are the data points that are misclassified by the hyperplane.
C.They are all the data points in the training set.
D.They are the data points furthest away from the decision boundary.
Correct Answer: They are the data points that lie exactly on the margin boundaries.
Explanation:
In a hard-margin SVM, the support vectors are the critical data points that lie precisely on the margin hyperplanes (i.e., where ). These points alone define the position and orientation of the optimal separating hyperplane.
Incorrect! Try again.
28In a soft-margin SVM, a data point has a corresponding slack variable . What can you conclude about this point?
Hard margin and soft margin SVM
Medium
A.The point lies on the correct side of the hyperplane but inside the margin.
B.The point is misclassified (on the wrong side of the hyperplane).
C.The point lies exactly on the decision boundary.
D.The point is correctly classified and outside the margin.
Correct Answer: The point is misclassified (on the wrong side of the hyperplane).
Explanation:
The constraint for soft-margin SVM is . A point is correctly classified if . If , then . Even if the point satisfies the constraint, it's possible for to be negative, meaning the point is on the wrong side of the hyperplane. Specifically, always indicates a misclassified point.
Incorrect! Try again.
29The objective function of the SVM dual problem is . What do the variables represent?
Lagrangian formulation
Medium
A.The bias term of the hyperplane.
B.The slack variables for each data point.
C.The components of the weight vector .
D.The Lagrange multipliers associated with the margin constraints.
Correct Answer: The Lagrange multipliers associated with the margin constraints.
Explanation:
The dual problem is derived using the method of Lagrange multipliers. Each is a Lagrange multiplier introduced for the constraint corresponding to the data point . Their optimal values determine which points are support vectors.
Incorrect! Try again.
30The Radial Basis Function (RBF) kernel is given by . What is the effect of a very small value?
Kernel trick and kernel functions
Medium
A.It forces all data points to become support vectors.
B.It makes the decision boundary smoother and less complex, behaving like a linear classifier.
C.It creates a very complex, high-variance decision boundary that overfits the data.
D.It has no effect on the model's complexity.
Correct Answer: It makes the decision boundary smoother and less complex, behaving like a linear classifier.
Explanation:
The parameter defines how much influence a single training example has. A small means that the influence is large and far-reaching, resulting in a smoother, less complex decision boundary. As approaches 0, the RBF kernel SVM model behaves similarly to a linear SVM.
Incorrect! Try again.
31In the SVM dual formulation, the weight vector can be expressed as a linear combination of which data points?
Primal and dual optimization problems
Medium
A.Only the data points that are misclassified.
B.Only the data points that are not support vectors.
C.Only the support vectors.
D.All data points in the training set.
Correct Answer: Only the support vectors.
Explanation:
From the KKT conditions, the optimal weight vector is given by . Since the Lagrange multipliers are zero for all non-support vectors, the sum is only over the support vectors (where ). Therefore, is a linear combination of only the support vectors.
Incorrect! Try again.
32You train a soft-margin SVM and find that the optimal solution has many support vectors with . What does this imply about your choice of ?
Hard margin and soft margin SVM
Medium
A.The choice of is irrelevant to the number of support vectors.
B.The value of is appropriately chosen for this dataset.
C. might be too small, allowing for a soft margin that misclassifies or violates the margin for many points.
D. is too large, causing the model to overfit.
Correct Answer: might be too small, allowing for a soft margin that misclassifies or violates the margin for many points.
Explanation:
In a soft-margin SVM, the dual constraints are . When , the point is a margin-violating support vector (either misclassified or inside the margin). If many points have , it suggests that the penalty for misclassification is not high enough to enforce a stricter margin, meaning is likely too small for the desired level of accuracy.
Incorrect! Try again.
33Maximizing the geometric margin in a hard-margin SVM is equivalent to minimizing which of the following expressions?
Geometric interpretation of classification margins
Medium
A.
B.
C.
D.
Correct Answer:
Explanation:
Maximizing is equivalent to minimizing . For mathematical convenience (to make the objective function differentiable and remove the square root), this is further transformed into minimizing . This is a strictly convex function, which guarantees a unique global minimum, and the factor of 1/2 simplifies the derivative during optimization.
Incorrect! Try again.
34Which of the following is NOT a valid Mercer kernel (i.e., cannot be used as a kernel function in an SVM)?
Kernel trick and kernel functions
Medium
A.
B. for some values of c
C.
D.
Correct Answer:
Explanation:
A function is a valid kernel if its corresponding Gram matrix is positive semi-definite for any set of data points. While the polynomial, RBF, and (for some parameters) sigmoid kernels satisfy this condition, the function does not generally produce a positive semi-definite Gram matrix and thus does not correspond to a dot product in some feature space. It violates Mercer's theorem.
Incorrect! Try again.
35If an SVM is trained on data points with features, and , which formulation is generally more computationally efficient to solve?
Primal and dual optimization problems
Medium
A.The dual problem.
B.Neither can be solved efficiently in this case.
C.The primal problem.
D.Both have the same computational complexity.
Correct Answer: The primal problem.
Explanation:
The primal problem optimizes over variables ( and ). The dual problem optimizes over variables (the ). When the number of data points is much larger than the number of features , solving the primal problem with variables is often more efficient than solving the dual with variables, especially when using a linear kernel.
Incorrect! Try again.
36The decision function for a kernel SVM is given by . Why is this function efficient to evaluate for a new point even in a very high-dimensional feature space?
Optimization perspective of SVM training
Medium
A.Because the number of support vectors (SV) is typically much smaller than the total number of training points.
B.Because the kernel function simplifies to a linear operation.
C.Because the Lagrange multipliers are always equal to 1.
D.Because the bias term is always zero.
Correct Answer: Because the number of support vectors (SV) is typically much smaller than the total number of training points.
Explanation:
The prediction for a new point depends only on the kernel evaluations between the new point and the support vectors. Since the set of support vectors is often a small subset of the entire training dataset, the summation is over a relatively small number of terms, making the prediction computationally efficient regardless of the dimensionality of the feature space.
Incorrect! Try again.
37In the soft-margin SVM Lagrangian, the term is added to the objective function. What is the role of this term?
Lagrangian formulation
Medium
A.It forces the weight vector to have a smaller magnitude.
B.It normalizes the input features.
C.It acts as a penalty term to minimize the sum of slack variables, thereby reducing classification errors and margin violations.
D.It ensures the margin is as wide as possible.
Correct Answer: It acts as a penalty term to minimize the sum of slack variables, thereby reducing classification errors and margin violations.
Explanation:
The primal objective for soft-margin SVM is to minimize . The term penalizes the model for having non-zero slack variables. Since corresponds to a point violating the margin, minimizing this sum encourages the model to classify points correctly and keep them outside the margin.
Incorrect! Try again.
38You are working with text data represented by high-dimensional but sparse TF-IDF vectors. Which kernel is often a good starting choice for an SVM classifier in this scenario?
Kernel trick and kernel functions
Medium
A.Sigmoid kernel
B.Radial Basis Function (RBF) kernel
C.Linear kernel
D.Polynomial kernel of a high degree
Correct Answer: Linear kernel
Explanation:
For high-dimensional and sparse data like text features, linear models often perform very well and are computationally efficient. A linear kernel SVM is equivalent to mapping the data to an even higher-dimensional space where it is more likely to be linearly separable. Non-linear kernels like RBF can be less effective and much slower to train in this context.
Incorrect! Try again.
39For a non-linearly separable dataset, which of the following statements is true?
Hard margin and soft margin SVM
Medium
A.A hard-margin SVM will find a solution by ignoring the outliers.
B.A hard-margin SVM will find the optimal non-linear boundary.
C.A hard-margin SVM has no feasible solution.
D.A soft-margin SVM will perform identically to a hard-margin SVM.
Correct Answer: A hard-margin SVM has no feasible solution.
Explanation:
The hard-margin SVM requires that every data point is correctly classified with a margin of at least 1. If the data is not linearly separable, it is impossible to find a hyperplane that satisfies this condition for all points. Therefore, the optimization problem has no feasible solution.
Incorrect! Try again.
40The strong duality principle holds for the SVM optimization problem. What does this imply?
Primal and dual optimization problems
Medium
A.The optimal value of the primal objective function is equal to the optimal value of the dual objective function.
B.The number of primal variables is equal to the number of dual variables.
C.The dual problem is always easier to solve than the primal problem.
D.The solution to the primal problem is always zero.
Correct Answer: The optimal value of the primal objective function is equal to the optimal value of the dual objective function.
Explanation:
Strong duality means that there is zero gap between the primal and dual solutions. The maximum value of the dual objective is equal to the minimum value of the primal objective. This is a crucial property that allows us to solve the dual problem to find the solution for the primal, which is particularly useful for applying the kernel trick.
Incorrect! Try again.
41Consider a hard-margin SVM trained on a linearly separable dataset. If every feature vector is transformed to , where is a non-singular diagonal matrix with diagonal entries and not all are equal, how does this non-uniform scaling affect the geometric margin and the set of support vectors?
Geometric interpretation of classification margins
Hard
A.The geometric margin will change, but the set of support vectors will remain unchanged.
B.The geometric margin will remain unchanged, but the set of support vectors may change.
C.Both the geometric margin and the set of support vectors are guaranteed to remain unchanged.
D.The geometric margin will change, and the set of support vectors may also change.
Correct Answer: The geometric margin will change, and the set of support vectors may also change.
Explanation:
Non-uniform scaling ( is not a multiple of the identity matrix) changes the geometry of the feature space, altering distances and angles. This will change the optimal separating hyperplane's orientation and position relative to the data points. Consequently, the maximal geometric margin will change, and the set of points that become support vectors can also change as different points may become closest to the new optimal boundary.
Incorrect! Try again.
42In a soft-margin SVM, the objective is to minimize . What is the precise consequence of setting the hyperparameter to a very large value (i.e., ) for a dataset that is not linearly separable?
Hard margin and soft margin SVM
Hard
A.The optimization will result in a weight vector that is close to zero.
B.The model converges to the hard-margin SVM solution.
C.The optimization problem becomes infeasible.
D.The decision boundary will have a very small margin and will be highly sensitive to individual data points.
Correct Answer: The decision boundary will have a very small margin and will be highly sensitive to individual data points.
Explanation:
As , the penalty for misclassification becomes infinitely high. The optimizer will try desperately to make all slack variables equal to zero. For non-linearly separable data, this is impossible. The model will compromise by finding a hyperplane with a very small margin (large ) that attempts to thread the needle between the classes, making the boundary very complex and highly influenced by every single point, leading to extreme overfitting.
Incorrect! Try again.
43The dual formulation of the SVM is often preferred over the primal. Under which scenario is solving the primal problem using methods like stochastic gradient descent on the hinge loss formulation computationally more advantageous than solving the dual?
Primal and dual optimization problems
Hard
A.When the number of features () is much larger than the number of training samples (), and a complex non-linear kernel is used.
B.When the number of training samples () is much larger than the number of features (), and a linear kernel is used.
C.The dual is always computationally superior to the primal when using a kernel.
D.When the Gram matrix is sparse.
Correct Answer: When the number of training samples () is much larger than the number of features (), and a linear kernel is used.
Explanation:
The primal problem's complexity depends on the number of features (solving for ). The dual problem's complexity depends on the number of samples (solving for and often requiring an Gram matrix). When and a linear kernel is used, solving the primal directly is much more efficient as you are optimizing over a -dimensional space instead of an -dimensional one.
Incorrect! Try again.
44Let and be two valid Mercer kernels. Which of the following operations is NOT guaranteed to produce a valid Mercer kernel?
Kernel trick and kernel functions
Hard
A.
B.
C. for a constant
D., where is a polynomial with non-negative coefficients.
Correct Answer:
Explanation:
The set of valid kernels (functions that produce a positive semi-definite Gram matrix) is closed under positive scaling, addition, and element-wise product (Hadamard product). However, the difference of two kernels is not guaranteed to be a valid kernel because the resulting Gram matrix may not be positive semi-definite (it could have negative eigenvalues).
Incorrect! Try again.
45In the dual formulation of a soft-margin SVM, consider the Karush-Kuhn-Tucker (KKT) conditions. If for a particular data point , its corresponding Lagrange multiplier is found to be exactly equal to the hyperparameter (i.e., ), what can be definitively concluded about this point?
Lagrangian formulation
Hard
A.The point is a support vector that is either inside the margin or is misclassified, with slack variable .
B.The point is not a support vector and is correctly classified.
C.The point is a support vector that lies exactly on the margin.
D.The point is an outlier that has been ignored by the model.
Correct Answer: The point is a support vector that is either inside the margin or is misclassified, with slack variable .
Explanation:
The KKT conditions for soft-margin SVM include and , where . If , then . This means the condition is satisfied for any . Since , the point is a support vector. Such points are often called 'bounded' support vectors and represent margin violators (either misclassified or correctly classified but inside the margin).
Incorrect! Try again.
46The SVM optimization problem is a Quadratic Programming (QP) problem. What is the primary implication of this for the uniqueness of the solution?
Optimization perspective of SVM training
Hard
A.The uniqueness of the solution depends entirely on the choice of the QP solver.
B.The solution is never unique, as multiple hyperplanes can achieve the same margin.
C.If a solution exists, the value of the objective function (the margin) is unique, and if the objective is strictly convex, the optimal weight vector is also unique.
D.A unique solution for both the weight vector and bias is always guaranteed.
Correct Answer: If a solution exists, the value of the objective function (the margin) is unique, and if the objective is strictly convex, the optimal weight vector is also unique.
Explanation:
The SVM objective function is strictly convex with respect to . For a convex optimization problem, any local minimum is a global minimum, and the minimum value of the objective function is unique. Because the objective is strictly convex, the optimal solution for the primal variables (the weight vector ) is also unique. The bias may not be unique in certain edge cases (e.g., if there are no support vectors with ).
Incorrect! Try again.
47In a hard-margin linear SVM, the margin is given by . How does the dimensionality of the feature space theoretically affect the maximum possible margin for a given dataset of points?
Geometric interpretation of classification margins
Hard
A.Higher dimensionality always decreases the maximum possible margin due to the curse of dimensionality.
B.The margin is only dependent on the number of support vectors, not the dimensionality.
C.The dimensionality of the feature space has no theoretical relationship with the maximum possible margin.
D.Higher dimensionality generally allows for a larger maximum margin, as it provides more degrees of freedom to find a separating hyperplane.
Correct Answer: Higher dimensionality generally allows for a larger maximum margin, as it provides more degrees of freedom to find a separating hyperplane.
Explanation:
In a higher-dimensional space, data points become sparser, and it's more likely that a separating hyperplane can be found. With more dimensions (degrees of freedom), there is more 'room' to place a hyperplane that is further away from all data points. This is a key concept behind Cover's theorem and the motivation for using kernels to map data to higher dimensions, where linear separability and larger margins become more probable.
Incorrect! Try again.
48A soft-margin SVM with a non-linear kernel is trained on a dataset. If you remove a data point which is correctly classified and lies strictly outside the margin (i.e., ), what is the most likely outcome upon retraining the SVM with the same hyperparameters?
Hard margin and soft margin SVM
Hard
A.The model will now overfit the remaining data.
B.The decision boundary will remain exactly the same.
C.The decision boundary will change significantly.
D.The margin will decrease.
Correct Answer: The decision boundary will remain exactly the same.
Explanation:
The SVM decision boundary is determined solely by the support vectors. A point correctly classified and strictly outside the margin has a corresponding Lagrange multiplier . Such a point is not a support vector. Removing it from the dataset does not change the set of active constraints in the optimization problem. Therefore, retraining the model will yield the exact same decision boundary.
Incorrect! Try again.
49What is the primary reason that the kernel trick can be applied to the dual formulation of the SVM but not directly to the primal formulation?
Primal and dual optimization problems
Hard
A.The dual objective function and the decision rule depend on the data only through dot products of feature vectors, whereas the primal depends on the feature vectors themselves.
B.The primal formulation does not involve a bias term .
C.The primal problem is non-convex, while the dual is convex.
D.The dual problem has fewer constraints than the primal problem.
Correct Answer: The dual objective function and the decision rule depend on the data only through dot products of feature vectors, whereas the primal depends on the feature vectors themselves.
Explanation:
The dual formulation's objective function involves terms of the form . The final decision rule is also based on dot products: . The kernel trick works by replacing these dot products with a kernel function . The primal formulation, which solves for directly, involves the feature vectors in its constraints () and is not structured in a way that only involves dot products, making the kernel trick inapplicable.
Incorrect! Try again.
50Consider the Radial Basis Function (RBF) kernel with parameter . What happens to the decision boundary of an SVM as ?
Kernel trick and kernel functions
Hard
A.The SVM fails to find any support vectors.
B.The decision boundary approaches a linear hyperplane.
C.The decision boundary becomes highly complex and overfits the data.
D.The influence of each support vector becomes extremely localized.
Correct Answer: The decision boundary approaches a linear hyperplane.
Explanation:
As , the term also approaches 0. Using a Taylor expansion, for small . So, . In the SVM decision function, many of these terms are constant or depend on the norm of a single point, behaving like offsets. The dominant term that depends on both and is the linear dot product . Therefore, the behavior of the kernel becomes increasingly linear, and the decision boundary approaches a linear hyperplane.
Incorrect! Try again.
51The Lagrangian for the hard-margin SVM primal problem is . What is the interpretation of the stationarity condition ?
Lagrangian formulation
Hard
A.It determines the value of the optimal bias term .
B.It shows that the optimal weight vector must be a linear combination of the feature vectors of the support vectors.
C.It establishes the constraint in the dual problem.
D.It proves that the optimization problem is convex.
Correct Answer: It shows that the optimal weight vector must be a linear combination of the feature vectors of the support vectors.
Explanation:
Taking the derivative of the Lagrangian with respect to and setting it to zero gives: . This directly implies that the optimal weight vector is . From the KKT conditions, we know that only for support vectors, meaning is a linear combination of the feature vectors of only the support vectors.
Incorrect! Try again.
52Consider the unconstrained hinge loss formulation of a linear SVM: . How does the solution change if the regularization term is changed from the L2-norm squared () to the L1-norm ()?
Hard margin and soft margin SVM
Hard
A.The problem becomes non-convex and difficult to solve.
B.The margin is no longer maximized, and the model focuses only on minimizing classification errors.
C.The solution remains identical, as both are convex regularizers.
D.The optimization problem is no longer a QP problem, and the resulting weight vector is likely to be sparse (have many zero components).
Correct Answer: The optimization problem is no longer a QP problem, and the resulting weight vector is likely to be sparse (have many zero components).
Explanation:
Changing the regularization to the L1-norm results in what is known as an L1-SVM. The objective function is no longer quadratic, so it is not a standard QP problem (though it is still a convex problem). The L1-norm is known for inducing sparsity, meaning it encourages many of the components of the weight vector to be exactly zero. This has the effect of performing automatic feature selection.
Incorrect! Try again.
53Mercer's theorem provides the conditions for a function to be a valid kernel. It states that must be a continuous, symmetric function such that the matrix is positive semi-definite for any finite set of points . What does positive semi-definite imply in this context?
Kernel trick and kernel functions
Hard
A.The kernel function corresponds to a dot product in a finite-dimensional space.
B.For any non-zero vector , the quadratic form .
C.The determinant of the Gram matrix must be strictly positive.
D.All entries of the Gram matrix must be non-negative.
Correct Answer: For any non-zero vector , the quadratic form .
Explanation:
This is the definition of a positive semi-definite (PSD) matrix. It ensures that the kernel function behaves like a dot product in some (possibly infinite-dimensional) feature space, which is crucial for the geometry of the SVM. It guarantees that squared distances in the feature space are non-negative and that the dual optimization problem is convex and well-posed.
Incorrect! Try again.
54If you add a new data point to a perfectly linearly separable dataset, under which condition is the hard-margin SVM decision boundary guaranteed to NOT change?
Geometric interpretation of classification margins
Hard
A.If the new point is correctly classified and lies outside the existing margin.
B.If the new point is from the positive class.
C.If the new point lies exactly on the decision boundary.
D.The decision boundary will always change when a new data point is added.
Correct Answer: If the new point is correctly classified and lies outside the existing margin.
Explanation:
The hard-margin SVM boundary is defined by the support vectors, which are the points lying exactly on the margin hyperplanes (). If a new point satisfies the margin constraint of the existing solution, i.e., , it does not violate any constraints and does not provide any new information that would force the hyperplane to move. It is not a support vector, so the optimal solution remains unchanged.
Incorrect! Try again.
55In the context of SVMs, what is the 'duality gap' and what does it mean if it is zero?
Primal and dual optimization problems
Hard
A.It is the difference between the primal objective value and the dual objective value. A zero gap (strong duality) means the optimal solutions to both problems are equivalent.
B.It is the difference in performance between a linear SVM and a kernelized SVM.
C.It is the number of misclassified points in a soft-margin SVM.
D.It is the geometric distance between the two margin boundaries.
Correct Answer: It is the difference between the primal objective value and the dual objective value. A zero gap (strong duality) means the optimal solutions to both problems are equivalent.
Explanation:
In optimization theory, weak duality states that the optimal value of the dual problem provides a lower bound on the optimal value of the primal problem. The difference between these values is the duality gap. For SVMs, the problem is constructed such that strong duality holds, meaning the duality gap is zero. This guarantees that solving the easier dual problem gives us the same optimal solution as solving the primal problem.
Incorrect! Try again.
56In the soft-margin SVM, what is the role of the Lagrange multipliers associated with the constraints ?
Lagrangian formulation
Hard
A.They enforce the relationship , which leads to the box constraint in the dual.
B.They are the weights of the support vectors in the final decision function.
C.They directly measure the geometric margin of the classifier.
D.They are hyperparameters that need to be tuned using cross-validation.
Correct Answer: They enforce the relationship , which leads to the box constraint in the dual.
Explanation:
In the Lagrangian of the primal problem, we have terms and . The stationarity condition with respect to yields , which gives . Since the KKT conditions require and , this relationship directly implies that must be less than or equal to , creating the box constraint in the dual problem.
Incorrect! Try again.
57The Sequential Minimal Optimization (SMO) algorithm iteratively picks pairs of Lagrange multipliers to optimize. What is a common heuristic for choosing the first multiplier, ?
Optimization perspective of SVM training
Hard
A.Choose the with the largest current value.
B.Choose an at random.
C.Choose an corresponding to the point furthest from the current decision boundary.
D.Choose an corresponding to a data point that most violates the KKT conditions.
Correct Answer: Choose an corresponding to a data point that most violates the KKT conditions.
Explanation:
To make the fastest progress towards the global optimum, SMO prioritizes the multipliers that are currently 'most wrong'. The KKT conditions must hold at the optimal solution. Therefore, a good heuristic is to select a point that currently violates these conditions by the largest amount. This strategy helps the algorithm converge more quickly than random or sequential selection.
Incorrect! Try again.
58Which of the following functions where is NOT a valid Mercer kernel?
Kernel trick and kernel functions
Hard
A.
B. for integer
C.
D. for non-negative vectors
Correct Answer:
Explanation:
A function is a valid Mercer kernel if its Gram matrix is positive semi-definite (PSD) for any set of points. The function is not a valid kernel. The diagonal elements of its Gram matrix are . A PSD matrix must have non-negative diagonal elements. More formally, the Gram matrix for this function can have negative eigenvalues, violating the PSD condition. The other options are the well-known Polynomial kernel, RBF kernel, and a valid intersection-style kernel.
Incorrect! Try again.
59Consider training a soft-margin linear SVM. If the dataset is perfectly linearly separable with a large margin, and you choose a very small value for the hyperparameter (e.g., ), what is the likely outcome?
Hard margin and soft margin SVM
Hard
A.All data points will become support vectors.
B.The model will be identical to the hard-margin SVM because the data is separable.
C.The optimization will fail because is too small.
D.The model may produce a decision boundary with a wider margin than the hard-margin SVM, potentially misclassifying some points even though a perfect separation is possible.
Correct Answer: The model may produce a decision boundary with a wider margin than the hard-margin SVM, potentially misclassifying some points even though a perfect separation is possible.
Explanation:
A very small places a high emphasis on minimizing (maximizing the margin) and a very low emphasis on penalizing slack variables . The optimizer might find that it can achieve a much smaller (a much wider margin) by allowing a few points to fall within the margin or even be misclassified, as the penalty for doing so is negligible. It prioritizes a simple, wide-margin solution over perfect classification.
Incorrect! Try again.
60After solving the dual SVM problem and obtaining the optimal Lagrange multipliers , the weight vector is constructed as . What happens to the norm of this weight vector, , as the regularization parameter increases?
Primal and dual optimization problems
Hard
A.It oscillates unpredictably.
B.It generally increases or stays the same.
C.It is independent of .
D.It generally decreases.
Correct Answer: It generally increases or stays the same.
Explanation:
A larger places a higher penalty on misclassification errors (the terms). To reduce these errors, the model will try to fit the data more closely. This often requires a more complex decision boundary, which in the linear case corresponds to a smaller margin. Since the margin is inversely related to , a smaller margin implies a larger . Therefore, as increases, the model tolerates a larger in order to reduce classification errors, causing to increase or stay the same.