1In the context of machine learning, what does 'bias' represent?
Bias-variance trade-off
Easy
A.The error from erroneous assumptions in the learning algorithm, leading to a model that is too simple.
B.The error from the model's sensitivity to small fluctuations in the training set.
C.The inherent noise present in the dataset.
D.The model's accuracy on the test dataset.
Correct Answer: The error from erroneous assumptions in the learning algorithm, leading to a model that is too simple.
Explanation:
Bias is the error introduced by approximating a real-world problem, which may be complex, by a much simpler model. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
Incorrect! Try again.
2A model that has high variance is likely to:
Bias-variance trade-off
Easy
A.Perform very well on the training data but poorly on unseen test data.
B.Perform poorly on the training data but well on unseen test data.
C.Perform very well on both the training data and test data.
D.Perform poorly on both the training data and test data.
Correct Answer: Perform very well on the training data but poorly on unseen test data.
Explanation:
High variance indicates that the model is too complex and has learned the noise in the training data, a condition known as overfitting. This makes it perform poorly when generalizing to new, unseen data.
Incorrect! Try again.
3Which scenario best describes an 'underfitting' model?
Overfitting and underfitting from a mathematical viewpoint
Easy
A.Low training error and low test error.
B.High training error and high test error.
C.Low training error and high test error.
D.High training error and low test error.
Correct Answer: High training error and high test error.
Explanation:
Underfitting occurs when a model is too simple to capture the underlying pattern of the data. As a result, it performs poorly not only on the test data but also on the data it was trained on.
Incorrect! Try again.
4Overfitting occurs when a model has:
Overfitting and underfitting from a mathematical viewpoint
Easy
A.Low bias and low variance.
B.High bias and low variance.
C.Low bias and high variance.
D.High bias and high variance.
Correct Answer: Low bias and high variance.
Explanation:
An overfitted model is very flexible (low bias) and captures the noise in the training data, making it sensitive to small changes (high variance). This harms its ability to generalize.
Incorrect! Try again.
5What is the primary goal of applying regularization to a machine learning model?
L1 and L2 regularization
Easy
A.To eliminate the need for a separate test set.
B.To prevent overfitting by penalizing model complexity.
C.To reduce the training time of the model.
D.To increase the model's training accuracy to 100%.
Correct Answer: To prevent overfitting by penalizing model complexity.
Explanation:
Regularization techniques add a penalty term to the loss function for large coefficient values, which discourages the model from becoming overly complex and fitting the noise in the training data.
Incorrect! Try again.
6The L2 regularization penalty term added to the loss function is proportional to the:
L1 and L2 regularization
Easy
A.Maximum absolute value among the model's weights.
B.Sum of the absolute magnitudes of the model's weights.
C.Sum of the squared magnitudes of the model's weights.
D.Number of non-zero weights in the model.
Correct Answer: Sum of the squared magnitudes of the model's weights.
Explanation:
L2 regularization, or Ridge regression, adds a penalty term of the form , where are the model weights. This penalizes large weights, encouraging them to be small and distributed.
Incorrect! Try again.
7A key advantage of L1 regularization (Lasso) over L2 regularization (Ridge) is that it:
L1 and L2 regularization
Easy
A.Always produces a model with lower bias.
B.Does not require tuning a hyperparameter.
C.Is computationally less expensive to calculate.
D.Can produce sparse models by shrinking some weights to exactly zero.
Correct Answer: Can produce sparse models by shrinking some weights to exactly zero.
Explanation:
The L1 penalty term, , has the effect of forcing some coefficient estimates to be exactly zero, which makes it useful for feature selection.
Incorrect! Try again.
8According to the bias-variance trade-off, as we increase a model's complexity, what typically happens to bias and variance?
Bias-variance trade-off
Easy
A.Bias increases and variance decreases.
B.Both bias and variance increase.
C.Bias decreases and variance increases.
D.Both bias and variance decrease.
Correct Answer: Bias decreases and variance increases.
Explanation:
A more complex model can better fit the training data, reducing bias. However, this increased flexibility makes it more sensitive to the training data's noise, thus increasing variance.
Incorrect! Try again.
9The L1 norm of a vector is defined as:
Norm-based constraints
Easy
A.
B.
C.
D.
Correct Answer:
Explanation:
The L1 norm, also known as the Manhattan distance or Taxicab norm, is the sum of the absolute values of the components of the vector. It is represented as .
Incorrect! Try again.
10From a geometric viewpoint, L1 regularization constrains the coefficients to lie within a:
Norm-based constraints
Easy
A.Hypercube.
B.Hyperdiamond (a square rotated by 45 degrees in 2D).
C.Hypersphere (a circle in 2D).
D.Hyperplane.
Correct Answer: Hyperdiamond (a square rotated by 45 degrees in 2D).
Explanation:
The L1 constraint defines a region whose shape is a hyperdiamond (or orthoplex). The sharp corners of this shape are why the optimization is likely to find solutions where some weights are exactly zero.
Incorrect! Try again.
11In the regularization term , what happens if the hyperparameter is set to zero?
L1 and L2 regularization
Easy
A.The model will fail to train.
B.The model's bias becomes infinitely high.
C.The model becomes maximally regularized, and all weights go to zero.
D.The model becomes a standard, unregularized model.
Correct Answer: The model becomes a standard, unregularized model.
Explanation:
The hyperparameter controls the strength of the regularization. If , the penalty term becomes zero, and the loss function reduces to the standard empirical risk, meaning no regularization is applied.
Incorrect! Try again.
12What is 'empirical risk' in the context of machine learning?
Structural risk minimization
Easy
A.The measured average loss over the training set.
B.The complexity of the chosen model.
C.The expected loss over all possible unseen data.
D.The irreducible error or noise in the data.
Correct Answer: The measured average loss over the training set.
Explanation:
Empirical risk is the error that the model makes on the training data. Simply minimizing this risk can lead to overfitting, which is why we also consider the model's complexity.
Incorrect! Try again.
13The principle of Structural Risk Minimization (SRM) suggests choosing a model that:
Structural risk minimization
Easy
A.Minimizes only the empirical risk.
B.Minimizes only the variance of the model.
C.Balances low empirical risk with low model complexity.
D.Maximizes the model complexity to fit all data points.
Correct Answer: Balances low empirical risk with low model complexity.
Explanation:
SRM aims to minimize a bound on the true generalization error. This bound is a sum of two terms: the empirical risk (training error) and a term that depends on the model's complexity (like its VC dimension).
Incorrect! Try again.
14What does the VC dimension of a hypothesis class (a set of models) measure?
VC dimension (intuition)
Easy
A.The amount of bias in the model.
B.The learning rate of the optimization algorithm.
C.The number of data points in the training set.
D.The capacity or expressive power of the class.
Correct Answer: The capacity or expressive power of the class.
Explanation:
The VC dimension is a measure of a model's complexity. It is defined as the maximum number of points that the model can 'shatter,' meaning it can perfectly classify them no matter how they are labeled.
Incorrect! Try again.
15A model with an infinite VC dimension is:
VC dimension (intuition)
Easy
A.A model that cannot be trained.
B.The ideal model for all classification tasks.
C.So simple that it cannot learn any patterns.
D.So powerful that it can memorize any training set, making it highly prone to overfitting.
Correct Answer: So powerful that it can memorize any training set, making it highly prone to overfitting.
Explanation:
An infinite VC dimension means the model has unlimited capacity. It can shatter any number of points, allowing it to perfectly memorize the training data, including noise, which leads to poor generalization.
Incorrect! Try again.
16For a linear classifier in a 2D plane (a simple line), what is the maximum number of points that it can 'shatter'?
VC dimension (intuition)
Easy
A.Unlimited
B.4
C.3
D.2
Correct Answer: 3
Explanation:
A line can separate any 3 non-collinear points in all possible ways. However, it cannot shatter 4 points; for example, it's impossible to separate points of a square where opposite corners have the same label. Thus, its VC dimension is 3.
Incorrect! Try again.
17From a mathematical viewpoint, what is a common characteristic of the weights (coefficients) in a heavily overfitted polynomial regression model?
Overfitting and underfitting from a mathematical viewpoint
Easy
A.They are all very close to zero.
B.They are all exactly one.
C.They have very large positive and negative magnitudes.
D.They are all positive.
Correct Answer: They have very large positive and negative magnitudes.
Explanation:
To make a high-degree polynomial pass through all the training points, the model often develops extremely large coefficients. This makes the function very 'wiggly' and sensitive to small changes in input, which is a hallmark of overfitting.
Incorrect! Try again.
18When L2 regularization is applied to linear regression, it is commonly known as:
L1 and L2 regularization
Easy
A.Lasso Regression
B.Principal Component Regression
C.Elastic Net
D.Ridge Regression
Correct Answer: Ridge Regression
Explanation:
Ridge Regression is the specific name for a linear regression model that is regularized using the L2 norm of the coefficients.
Incorrect! Try again.
19Which of the following describes a model with high bias?
Bias-variance trade-off
Easy
A.An overfitting model.
B.A model with low training error.
C.A complex model like a deep neural network.
D.An underfitting model.
Correct Answer: An underfitting model.
Explanation:
High bias means the model makes strong, potentially incorrect assumptions about the data. This leads to underfitting, where the model is too simple to capture the true underlying patterns.
Incorrect! Try again.
20Which algorithm is most directly associated with L1 regularization for the purpose of feature selection?
L1 and L2 regularization
Easy
A.PCA
B.LASSO
C.Ridge
D.k-Nearest Neighbors
Correct Answer: LASSO
Explanation:
LASSO stands for Least Absolute Shrinkage and Selection Operator. Its use of the L1 penalty inherently performs feature selection by shrinking the coefficients of less important features to exactly zero.
Incorrect! Try again.
21A machine learning model is trained on a dataset, and its performance is evaluated. The expected squared prediction error is decomposed as . If we are given a perfect model and an infinite amount of training data, which of these terms would still remain non-zero?
Bias-variance trade-off
Medium
A., the squared bias
B., the irreducible error
C.All terms would become zero
D., the variance
Correct Answer: , the irreducible error
Explanation:
The irreducible error, , represents the inherent noise in the data itself. Even with a perfect model and infinite data, this randomness cannot be eliminated. Bias and variance are properties of the model and would approach zero under these ideal conditions.
Incorrect! Try again.
22A model is considered to be overfitting when...
Overfitting and underfitting from a mathematical viewpoint
Medium
A.Training error is low but validation error is high and potentially increasing.
B.Both training error and validation error are low and converging.
C.Training error is high but validation error is low.
D.Both training error and validation error are high.
Correct Answer: Training error is low but validation error is high and potentially increasing.
Explanation:
Overfitting occurs when a model learns the training data too well, including its noise and specific quirks. This results in excellent performance on the training set (low error) but poor performance on unseen data (high validation error), indicating a failure to generalize.
Incorrect! Try again.
23In the context of linear regression, what is the primary difference in the effect of L1 (Lasso) versus L2 (Ridge) regularization on the model's weight vector ?
L1 and L2 regularization
Medium
A.L1 regularization penalizes large weights more severely than L2 regularization.
B.L1 regularization always results in a lower bias model compared to L2 regularization.
C.L1 regularization can force some weights to be exactly zero, while L2 only shrinks them towards zero.
D.L2 regularization can force some weights to be exactly zero, while L1 only shrinks them towards zero.
Correct Answer: L1 regularization can force some weights to be exactly zero, while L2 only shrinks them towards zero.
Explanation:
L1 regularization adds a penalty term proportional to . Due to the geometry of the L1 norm (a diamond shape), it tends to produce sparse solutions where many model weights are exactly zero, effectively performing feature selection. L2's penalty, , shrinks weights towards zero but rarely sets them to exactly zero.
Incorrect! Try again.
24Consider the k-Nearest Neighbors (k-NN) algorithm. How does decreasing the value of (e.g., from to ) typically affect the bias and variance of the model?
Bias-variance trade-off
Medium
A.Both bias and variance increase.
B.Bias increases, and variance decreases.
C.Both bias and variance decrease.
D.Bias decreases, and variance increases.
Correct Answer: Bias decreases, and variance increases.
Explanation:
A smaller makes the k-NN model more flexible and sensitive to local noise in the training data. This reduces the model's bias (it can fit complex patterns) but increases its variance (it is highly dependent on the specific training points). A classifier, for example, has very high variance.
Incorrect! Try again.
25The principle of Structural Risk Minimization (SRM) provides a framework for model selection by balancing two key quantities. What are they?
Structural risk minimization
Medium
A.Empirical risk (training error) and a model complexity term.
B.Bias and variance.
C.The L1 norm and the L2 norm of the weights.
D.Training accuracy and validation accuracy.
Correct Answer: Empirical risk (training error) and a model complexity term.
Explanation:
SRM aims to minimize an upper bound on the true risk (generalization error). This bound is expressed as the sum of the empirical risk (the error on the training data) and a complexity term (which depends on the model's capacity, like its VC dimension). Regularization is a practical implementation of SRM.
Incorrect! Try again.
26The cost function for Ridge regression is given by . This is equivalent to which of the following constrained optimization problems for some value ?
Norm-based constraints
Medium
A.Minimize subject to
B.Minimize subject to
C.Minimize subject to
D.Minimize subject to
Correct Answer: Minimize subject to
Explanation:
The Lagrangian formulation of adding a penalty term is equivalent to a constrained optimization problem. Ridge regression minimizes the Mean Squared Error (MSE) while keeping the squared L2 norm of the weight vector below a certain threshold . The value of is inversely related to the regularization parameter .
Incorrect! Try again.
27If model class A has a VC dimension of 10 and model class B has a VC dimension of 1000, which of the following is generally true when training on a dataset of a fixed size?
VC dimension (intuition)
Medium
A.Model A will always have a lower training error than Model B.
B.Model A is guaranteed to generalize better than Model B.
C.Model B will have lower variance than Model A.
D.Model B has a higher capacity to overfit the data compared to Model A.
Correct Answer: Model B has a higher capacity to overfit the data compared to Model A.
Explanation:
VC dimension is a measure of a model class's capacity or complexity. A higher VC dimension means the model can shatter a larger set of points, allowing it to fit more complex patterns. This increased flexibility makes it more prone to fitting the noise in the training data (overfitting), especially with a limited dataset.
Incorrect! Try again.
28You have fit a high-degree polynomial regression model to a dataset with a small number of data points (). You observe that the number of model parameters () is greater than . What is the most likely issue?
Overfitting and underfitting from a mathematical viewpoint
Medium
A.The optimization algorithm failed to converge.
B.The model is overfitting because its capacity is too high for the amount of data.
C.The model has high bias and low variance.
D.The model is underfitting because polynomials cannot capture the true relationship.
Correct Answer: The model is overfitting because its capacity is too high for the amount of data.
Explanation:
When the number of parameters () exceeds the number of data points (), the model has enough flexibility to perfectly memorize the training data, including its noise. This is a classic recipe for overfitting, leading to very low training error but poor generalization to new data.
Incorrect! Try again.
29Consider a Ridge regression model. What happens to the learned weight coefficients as the regularization parameter approaches infinity ()?
L1 and L2 regularization
Medium
A.All weight coefficients approach zero.
B.All weight coefficients approach infinity.
C.The weight coefficients become equal to the ordinary least squares solution.
D.Some weight coefficients become exactly zero, while others remain large.
Correct Answer: All weight coefficients approach zero.
Explanation:
The Ridge cost function is . As , the penalty term dominates the loss function. To minimize this term, the optimizer must force all weights to be as close to zero as possible.
Incorrect! Try again.
30A data scientist observes that their model performs poorly on both the training set and the test set. The errors are high and very similar. Which of the following best describes this situation in terms of bias and variance?
Bias-variance trade-off
Medium
A.High bias, high variance
B.Low bias, low variance
C.High bias, low variance
D.Low bias, high variance
Correct Answer: High bias, low variance
Explanation:
Poor performance on both training and test sets suggests the model is too simple to capture the underlying structure of the data (underfitting). This is characteristic of high bias. The errors being similar suggests the model's predictions do not change much with different training sets, which implies low variance.
Incorrect! Try again.
31From a geometric perspective, why does L1 regularization (Lasso) promote sparse solutions (i.e., weights being exactly zero)?
Norm-based constraints
Medium
A.The L1 norm is not differentiable, which causes the optimization to get stuck at zero.
B.The L1 norm constraint region is a hypersphere, which forces weights to be small and uniform.
C.The L1 norm constraint region is a hyperdiamond, and the elliptical contours of the loss function are likely to make contact at a corner.
D.The L1 norm penalizes small weights more than large weights, forcing them to zero.
Correct Answer: The L1 norm constraint region is a hyperdiamond, and the elliptical contours of the loss function are likely to make contact at a corner.
Explanation:
In two dimensions, the L1 constraint forms a diamond. The unregularized loss function forms elliptical contours. The optimal solution is the point where an ellipse first touches the diamond. Due to the sharp corners of the diamond lying on the axes, this contact point is very likely to be at a corner, where one of the weights is exactly zero.
Incorrect! Try again.
32You are working on a problem with 10,000 features, but you suspect that only about 100 of them are actually useful. Which regularization technique would be a more suitable initial choice and why?
L1 and L2 regularization
Medium
A.L2 (Ridge), because it handles multicollinearity better by shrinking all weights.
B.L1 (Lasso), because it performs automatic feature selection by driving irrelevant feature weights to zero.
C.L1 (Lasso), because its optimization is computationally faster than L2.
D.L2 (Ridge), because it results in a model with lower bias.
Correct Answer: L1 (Lasso), because it performs automatic feature selection by driving irrelevant feature weights to zero.
Explanation:
When dealing with high-dimensional data where many features are expected to be irrelevant, L1 regularization is highly advantageous. Its tendency to produce sparse models (setting many weights to zero) acts as an embedded feature selection method, simplifying the model and potentially improving its interpretability and performance.
Incorrect! Try again.
33According to generalization theory involving VC dimension, what is the impact of increasing the number of training samples () on the generalization gap, assuming the model class (and thus its VC dimension) remains fixed?
VC dimension (intuition)
Medium
A.The upper bound on the generalization gap tends to decrease.
B.The generalization gap remains constant.
C.The upper bound on the generalization gap tends to increase.
D.The impact depends on whether the model is linear or non-linear.
Correct Answer: The upper bound on the generalization gap tends to decrease.
Explanation:
Generalization bounds, such as the one derived from VC theory, typically show that the gap between true risk and empirical risk is bounded by a term that is proportional to . As the number of samples increases, this bound becomes tighter, suggesting that with more data, the model's performance on the training set becomes a more reliable estimate of its performance on unseen data.
Incorrect! Try again.
34What is the most likely mathematical consequence of adding a significant number of new, relevant data points to the training set of a model that is currently overfitting?
Overfitting and underfitting from a mathematical viewpoint
Medium
A.The model's bias will likely increase, and the training error will increase.
B.The model's bias will likely decrease, and the validation error will worsen.
C.The model's variance will likely increase, and the validation error will worsen.
D.The model's variance will likely decrease, and the validation error will improve.
Correct Answer: The model's variance will likely decrease, and the validation error will improve.
Explanation:
Overfitting is a high-variance problem. Providing more data helps the model learn the true underlying pattern instead of memorizing noise. This reduces the model's sensitivity to the specific training sample, thereby decreasing its variance and improving its ability to generalize, which is reflected in a lower validation error.
Incorrect! Try again.
35How does adding an L2 regularization term, , to a loss function practically implement the Structural Risk Minimization (SRM) principle?
Structural risk minimization
Medium
A.It adds a penalty for model complexity, where complexity is measured by the magnitude of the weights.
B.It increases the empirical risk to better match the true risk.
C.It guarantees that the empirical risk will be zero.
D.It directly minimizes the VC dimension of the model class.
Correct Answer: It adds a penalty for model complexity, where complexity is measured by the magnitude of the weights.
Explanation:
SRM balances empirical risk (loss on training data) with a penalty for model complexity. L2 regularization operationalizes this by adding a penalty term, , to the loss. This term penalizes models with large weights, which are considered more complex. The optimizer must therefore find a solution that fits the data well (low empirical risk) without making the weights too large (low complexity penalty).
Incorrect! Try again.
36The absolute value function in the L1 penalty, , is not differentiable at . What is a practical implication of this for training a model with L1 regularization?
L1 and L2 regularization
Medium
A.Standard gradient descent cannot be used directly; specialized optimizers like Coordinate Descent or subgradient methods are needed.
B.The model can never learn a weight that is exactly zero.
C.The cost function becomes non-convex, making it hard to find a global minimum.
D.The training process becomes significantly faster than with L2 regularization.
Correct Answer: Standard gradient descent cannot be used directly; specialized optimizers like Coordinate Descent or subgradient methods are needed.
Explanation:
The non-differentiability of the L1 norm at zero means that the gradient is not defined for any weight that is exactly zero. Therefore, standard gradient-based optimization methods cannot be applied without modification. Algorithms like Proximal Gradient Descent, Coordinate Descent, or those using subgradients are employed to handle this issue.
Incorrect! Try again.
37A team builds a very complex neural network with millions of parameters for a simple classification task with only a few hundred data points. Without any regularization, the model achieves 99.9% training accuracy but only 60% test accuracy. This large gap between training and test accuracy is primarily due to...
Bias-variance trade-off
Medium
A.High irreducible error
B.High bias
C.A non-convex loss function
D.High variance
Correct Answer: High variance
Explanation:
A large gap between training and test performance is the hallmark of overfitting. Mathematically, overfitting corresponds to a model with high variance. The model is so complex that it has learned the specific noise and artifacts of the small training set, and its predictions vary drastically when presented with new, unseen data.
Incorrect! Try again.
38Comparing the L1 and L2 norms as penalty functions, how do they differ in their treatment of a single, very large weight coefficient versus many medium-sized coefficients (assuming the sum of magnitudes is the same)?
Norm-based constraints
Medium
A.L2 penalizes the single large weight more heavily than L1 due to the squaring effect.
B.Both norms penalize them equally as long as is the same.
C.L2 encourages a single large weight, while L1 encourages many medium-sized weights.
D.L1 penalizes the single large weight more heavily than L2.
Correct Answer: L2 penalizes the single large weight more heavily than L1 due to the squaring effect.
Explanation:
Consider two weight vectors: and . and . However, while . The L2 penalty for the single large weight is much higher. This shows that the L2 norm prefers to distribute weight values more evenly, while being very averse to individual large weights.
Incorrect! Try again.
39An underfit model is characterized by high bias. Which of the following is a direct mathematical interpretation of high bias?
Overfitting and underfitting from a mathematical viewpoint
Medium
A.The model has a large number of parameters relative to the number of data points.
B.The model's predictions vary significantly for different training sets.
C.The model's average prediction over all possible training sets is far from the true underlying function.
D.The training error is close to zero, but the test error is large.
Correct Answer: The model's average prediction over all possible training sets is far from the true underlying function.
Explanation:
Bias, in the context of the bias-variance decomposition, measures the difference between the average prediction of our model and the correct value which we are trying to predict. High bias means that, on average, the model's predictions are systematically incorrect because its assumptions are too simplistic to approximate the true function.
Incorrect! Try again.
40In a linear model , if two features and are highly correlated, how would Ridge (L2) and Lasso (L1) regularization typically distribute the weights and ?
L1 and L2 regularization
Medium
A.Lasso tends to give both and similar, non-zero coefficients, while Ridge will set one to zero.
B.Both Ridge and Lasso will set one of the correlated feature weights to zero.
C.Ridge will make one coefficient large and positive and the other large and negative, while Lasso will shrink both towards zero.
D.Ridge tends to shrink both and together, giving them similar coefficients, while Lasso might arbitrarily set one to zero and keep the other.
Correct Answer: Ridge tends to shrink both and together, giving them similar coefficients, while Lasso might arbitrarily set one to zero and keep the other.
Explanation:
Ridge regression is known to handle multicollinearity by distributing the weight among correlated features. It will shrink their coefficients towards each other (and towards zero). Lasso, on the other hand, is unstable in the presence of high correlation; it will often arbitrarily pick one feature from a correlated group and assign it a non-zero weight, while setting the others to zero.
Incorrect! Try again.
41The standard bias-variance decomposition for squared error loss is . If the loss function is changed to the absolute error, , how does the decomposition of the Expected Prediction Error change?
Bias-variance trade-off
Hard
A.It becomes .
B.The decomposition remains the same, but Bias is calculated with respect to the median of and Variance is calculated using the L1 norm.
C.The decomposition is no longer possible for non-differentiable loss functions.
D.The decomposition no longer holds in this simple additive form; the relationship becomes more complex and is often expressed as an inequality.
Correct Answer: The decomposition no longer holds in this simple additive form; the relationship becomes more complex and is often expressed as an inequality.
Explanation:
The clean, additive bias-variance-noise decomposition is a special property derived from the linearity of expectations and the properties of variance, which are inherently tied to the squared error loss. For other loss functions like absolute error, the cross-terms in the expansion do not conveniently cancel out. The decomposition is not as straightforward and typically results in an inequality, such as , rather than a strict equality.
Incorrect! Try again.
42Consider a linear regression problem with two highly correlated features, . The model is . How would the estimated coefficients behave for Lasso (L1) vs. Ridge (L2) regression as the regularization strength is increased?
L1 and L2 regularization
Hard
A.Lasso will tend to select one feature and set the other's coefficient to zero, while Ridge will shrink both coefficients towards zero but keep them roughly equal.
B.Ridge will set one coefficient to zero while Lasso will shrink both, keeping them roughly equal.
C.Both Lasso and Ridge will shrink both coefficients towards each other and then towards zero.
D.Lasso will shrink both coefficients to zero at the same rate, while Ridge will arbitrarily pick one to shrink faster.
Correct Answer: Lasso will tend to select one feature and set the other's coefficient to zero, while Ridge will shrink both coefficients towards zero but keep them roughly equal.
Explanation:
With correlated features, the loss function surface has a long, flat valley. Ridge's circular L2 constraint will intersect this valley where , so it shrinks them together towards zero. Lasso's diamond-shaped L1 constraint is likely to intersect the valley at a corner, which lies on an axis. This means one coefficient becomes exactly zero while the other takes on the shared effect, thus performing feature selection.
Incorrect! Try again.
43Consider a hypothesis class consisting of classifiers that are unions of at most disjoint intervals on the real line. A point is classified as +1 if it falls within any of these intervals. What is the Vapnik-Chervonenkis (VC) dimension of this hypothesis class?
VC dimension (intuition)
Hard
A.Infinite
B.2k
C.2k+1
D.k
Correct Answer: 2k
Explanation:
A single interval can shatter 2 points (VCdim=2). A union of intervals can label points within them as positive and points outside as negative. To shatter a set of points, we need to realize all possible dichotomies. For points arranged alternatingly on a line, we can choose our intervals to enclose any subset of the odd-indexed points (or even-indexed), thus shattering them. However, points arranged this way cannot be shattered. For example, to label them alternatingly (+, -, +, -, ..., +) requires positive intervals, which is not allowed by the hypothesis class. Therefore, the VC dimension is .
Incorrect! Try again.
44According to the Structural Risk Minimization (SRM) principle, the generalization error is bounded by , where is the empirical risk, is the sample size, and is the VC dimension. If for a fixed dataset, Model A (VCdim=5) has an empirical error of 0.10 and Model B (VCdim=10) has an empirical error of 0.08, which statement is most accurate?
Structural risk minimization
Hard
A.Without knowing the exact form of the complexity penalty and the values of and , we cannot definitively choose between Model A and Model B based on SRM.
B.Model B is always better because its empirical error is lower.
C.Model A is always better because its complexity (VC dimension) is lower.
D.The SRM principle would select the model that minimizes , so Model A is chosen ( vs ).
Correct Answer: Without knowing the exact form of the complexity penalty and the values of and , we cannot definitively choose between Model A and Model B based on SRM.
Explanation:
SRM seeks to minimize the upper bound on the true risk, which is a sum of empirical risk and a complexity penalty term . The penalty term is an increasing function of the VC dimension . Model B has lower empirical risk but a higher penalty. Model A has higher empirical risk but a lower penalty. The optimal choice depends on this trade-off, which is dictated by the specific form of the bound and the sample size . A small would favor the simpler Model A, while a very large would favor the model with lower empirical risk, Model B. Without this information, no definitive choice can be made.
Incorrect! Try again.
45In a high-dimensional setting (), consider a linear model . The variance of the Ordinary Least Squares (OLS) estimator is given by . How does the phenomenon of multicollinearity (high correlation between predictor variables) specifically affect the bias-variance trade-off for this model?
Bias-variance trade-off
Hard
A.It dramatically increases the variance of the coefficient estimates without affecting the bias, which remains zero for OLS.
B.It increases both the bias and the variance, as the model struggles to attribute effects to specific predictors.
C.It increases the bias by forcing some coefficients to be estimated incorrectly, but it decreases the variance by stabilizing the model.
D.It has no effect on the bias-variance trade-off, as it is an issue of numerical stability, not statistical performance.
Correct Answer: It dramatically increases the variance of the coefficient estimates without affecting the bias, which remains zero for OLS.
Explanation:
Multicollinearity means that columns of are nearly linearly dependent. This causes the matrix to be near-singular, and its inverse to have very large diagonal entries. This directly translates to a very high variance in the coefficient estimates . However, the OLS estimator remains unbiased () as long as is invertible. Therefore, multicollinearity inflates the variance component of the error significantly while leaving the bias unchanged (at zero).
Incorrect! Try again.
46A model is trained on a dataset of size and has parameters. Let be the training error and be the true generalization error. From a mathematical viewpoint, which condition provides the strongest evidence of significant overfitting?
Overfitting and underfitting from a mathematical viewpoint
Hard
A. is high and .
B..
C.The learning curve for training error and validation error shows a large, persistent gap.
D. and .
Correct Answer: and .
Explanation:
Overfitting is characterized by a model that learns the training data, including its noise, almost perfectly but fails to generalize to new, unseen data. Mathematically, this is captured by a very low training error () combined with a much higher generalization error (). Option B describes underfitting. Option C () is a condition that often leads to overfitting but is not a direct measure of it. Option D is a qualitative description of the same phenomenon as A, but A provides the direct mathematical condition on the error values themselves.
Incorrect! Try again.
47The solution for Ridge Regression is given by . What is the behavior of the norm of this solution, , as the regularization parameter ?
L1 and L2 regularization
Hard
A..
B.The norm is undefined as becomes singular.
C..
D. approaches a non-zero constant determined by .
Correct Answer: .
Explanation:
As , the term dominates the matrix . So, . Therefore, . As , the term , causing the entire vector to approach the zero vector. Consequently, its L2 norm also approaches 0.
Incorrect! Try again.
48Consider solving a least squares problem subject to a norm constraint: subject to . Which statement correctly describes the relationship between this constrained optimization problem and the standard regularized regression formulation ?
Norm-based constraints
Hard
A.The constrained formulation is a relaxation of the regularized formulation and always yields a sparser solution.
B.For any , there exists a corresponding such that the two formulations have the same solution, and vice-versa (except for a boundary case at max). This is due to Lagrangian duality.
C.The two formulations are equivalent only for the L2 norm () but not for the L1 norm ().
D.The regularized formulation is only an approximation of the constrained problem and there is no guaranteed equivalence.
Correct Answer: For any , there exists a corresponding such that the two formulations have the same solution, and vice-versa (except for a boundary case at max). This is due to Lagrangian duality.
Explanation:
This is a fundamental result from optimization theory, related to Lagrangian duality. The regularized form (often called the penalized form) is the Lagrangian relaxation of the constrained problem. For convex problems like these, there is a one-to-one correspondence between the constraint budget and the regularization parameter . A smaller (tighter constraint) corresponds to a larger (stronger penalty), and vice-versa. They are essentially two ways of formulating the same underlying trade-off.
Incorrect! Try again.
49Consider the hypothesis class of all circles in (a point is classified as +1 if inside the circle, -1 if outside). What is the VC dimension of ?
VC dimension (intuition)
Hard
A.3
B.Infinite
C.4
D.2
Correct Answer: 3
Explanation:
You can shatter 3 points (e.g., the vertices of a non-degenerate triangle). For any labeling of these 3 points, you can find a circle that contains the positive points but not the negative ones. However, you cannot shatter 4 points. Consider 4 points on the boundary of a circle. If their labels are alternating (+, -, +, -), no single circle can contain the two positive points without also containing one of the negative points. Therefore, the VC dimension is 3.
Incorrect! Try again.
50For a polynomial regression model of degree , the coefficients are estimated by minimizing sum of squared errors. As increases to the point of overfitting, the L2 norm of the weight vector, , often grows very large. What is the mathematical reason for this phenomenon?
Overfitting and underfitting from a mathematical viewpoint
Hard
A.To fit the noise and specific data points more precisely, the polynomial must make sharp turns, which requires coefficients of large magnitude with alternating signs.
B.The loss function becomes non-convex for large , leading to unstable solutions with large norms.
C.A higher degree introduces multicollinearity between the polynomial features (), which always inflates the coefficient estimates.
D.The number of parameters exceeds the number of data points, and the Moore-Penrose pseudoinverse used to solve the system results in large coefficient values.
Correct Answer: To fit the noise and specific data points more precisely, the polynomial must make sharp turns, which requires coefficients of large magnitude with alternating signs.
Explanation:
When a high-degree polynomial overfits, it tries to pass exactly through or very close to each training data point. To create the necessary "wiggles" and sharp turns to interpolate the data (including noise), the function requires large positive and negative coefficients that largely cancel each other out except in the vicinity of the data points. This interplay of large, alternating-sign coefficients leads to a large overall norm of the weight vector.
Incorrect! Try again.
51The L1 penalty term is . Why is this function not differentiable at , and what is a common approach to optimize loss functions involving this term?
L1 and L2 regularization
Hard
A.The function has a value of 0 at the origin, but its gradient is non-zero, violating differentiability conditions. Proximal gradient descent is used.
B.The function is not convex. Optimization requires specialized non-convex solvers.
C.The derivative has a jump discontinuity at 0 (from -1 to +1). Optimization is typically handled using subgradient descent or coordinate descent.
D.The derivative is infinite at 0. This is handled by adding a small constant, , to .
Correct Answer: The derivative has a jump discontinuity at 0 (from -1 to +1). Optimization is typically handled using subgradient descent or coordinate descent.
Explanation:
The derivative of with respect to is , which is -1 for and +1 for . At , the left-hand and right-hand derivatives do not match, so the function is not differentiable. This point is crucial for sparsity. Optimization algorithms must handle this. Subgradient descent replaces the gradient with a subgradient (any value in [-1, 1] at ). Other popular methods include coordinate descent and proximal gradient methods.
Incorrect! Try again.
52For a k-Nearest Neighbors (k-NN) regression model, how do the bias and variance change as the value of is increased from 1 to (the total number of data points)?
Bias-variance trade-off
Hard
A.Both bias and variance decrease.
B.Bias increases and variance decreases.
C.Both bias and variance increase.
D.Bias decreases and variance increases.
Correct Answer: Bias increases and variance decreases.
Explanation:
When , the model is very flexible and captures local noise, resulting in low bias but high variance. As increases, the model averages over more neighbors, smoothing its predictions. This makes it less sensitive to the specific training data (lower variance). However, by averaging over a larger, less local neighborhood, the model becomes less flexible and may fail to capture the true underlying function (higher bias). When , the model predicts the global average for any input, resulting in maximum bias and minimum variance.
Incorrect! Try again.
53The SRM principle is based on minimizing a guaranteed upper bound on the true risk, often of the form . What does the presence of the sample size in the denominator of the complexity term imply about the relationship between Empirical Risk Minimization (ERM) and SRM?
Structural risk minimization
Hard
A.For small , SRM is more important than ERM, and for large , ERM is more important than SRM.
B.The in the denominator indicates that a larger dataset justifies using a more complex model (higher VC dimension).
C.The bound is only valid for , implying ERM is never sufficient.
D.As , the complexity term vanishes, and the SRM solution converges to the ERM solution.
Correct Answer: As , the complexity term vanishes, and the SRM solution converges to the ERM solution.
Explanation:
The complexity term, or "VC confidence," is the penalty for model complexity. The sample size appears in its denominator. This means that as the amount of training data increases towards infinity, the complexity penalty term goes to zero. In this asymptotic scenario, minimizing the upper bound (SRM) becomes equivalent to minimizing just the empirical risk (ERM). This provides the theoretical justification for why ERM is often a reasonable approach with a very large amount of data.
Incorrect! Try again.
54How does standardizing features (rescaling to have mean 0 and standard deviation 1) before applying Ridge (L2) or Lasso (L1) regression affect the solution?
Norm-based constraints
Hard
A.It is crucial because regularization penalizes the magnitude of coefficients, making the solution dependent on the scale of the features. Without it, features with larger scales are unfairly penalized.
B.It has no effect on the final solution because the penalty is applied proportionally to all coefficients.
C.It only improves the numerical stability of the optimization algorithm but does not change the optimal coefficient values found.
D.It only matters for Ridge regression due to its quadratic penalty, but not for Lasso's linear penalty.
Correct Answer: It is crucial because regularization penalizes the magnitude of coefficients, making the solution dependent on the scale of the features. Without it, features with larger scales are unfairly penalized.
Explanation:
Both L1 and L2 regularization add a penalty based on the magnitude of the coefficients. If features are on different scales, the penalty is applied inequitably. For example, if feature has a range of 1000 and has a range of 1, its corresponding coefficient would be much smaller than for a similar effect. The regularization penalty would then unfairly penalize more than . Standardizing features puts them on a common scale, ensuring the penalty is applied fairly and is not biased by arbitrary units.
Incorrect! Try again.
55Which of the following statements correctly captures the relationship between the number of parameters in a model and its VC dimension?
VC dimension (intuition)
Hard
A.The VC dimension is only defined for models with a finite number of parameters.
B.The VC dimension can be much larger or much smaller than the number of parameters; there is no universal direct relationship.
C.The VC dimension is always equal to the number of trainable parameters plus one.
D.The VC dimension is always upper-bounded by the number of parameters.
Correct Answer: The VC dimension can be much larger or much smaller than the number of parameters; there is no universal direct relationship.
Explanation:
While the number of parameters is often a rough heuristic for model complexity, there is no strict relationship with VC dimension. For linear classifiers in , VCdim is (parameters). However, a model with just one parameter, like , has infinite VC dimension because it can be made to oscillate arbitrarily fast to shatter any number of points. Conversely, models can be constructed with many constrained parameters, leading to a VC dimension much smaller than the parameter count.
Incorrect! Try again.
56Consider the geometry of the unregularized least-squares loss function, which has elliptical level sets. Lasso (L1) and Ridge (L2) add a constraint region that is a diamond (L1-ball) and a circle (L2-ball), respectively. Why is the Lasso solution more likely to be sparse (have zero-valued coefficients)?
L1 and L2 regularization
Hard
A.The L1 norm is a linear function, while the loss function is quadratic, and their intersection is always at an axis.
B.The sharp corners of the L1-ball are more likely to be the first point of contact as the elliptical level sets of the loss function expand, and these corners lie on the axes where some coefficients are zero.
C.The L2-ball's curved surface ensures that the point of tangency with the loss function's level sets will almost never be on an axis, while the L1-ball has flat sides.
D.The L1-ball has a smaller volume than the L2-ball for a given radius, which forces more coefficients to be zero.
Correct Answer: The sharp corners of the L1-ball are more likely to be the first point of contact as the elliptical level sets of the loss function expand, and these corners lie on the axes where some coefficients are zero.
Explanation:
The optimal solution is the point where the expanding level set of the loss function first touches the constraint region. The L2-ball (circle/sphere) is uniformly round, so the point of tangency can be anywhere on its surface. The L1-ball (diamond/polytope) has sharp corners that protrude along the axes. It is geometrically much more probable that the expanding ellipse will touch one of these corners before it touches any other part of the diamond's boundary. This contact at a corner corresponds to a sparse solution where one or more coefficients are exactly zero.
Incorrect! Try again.
57A model's performance is evaluated using a learning curve, plotting training and validation error against the number of training samples. In a classic case of high variance (overfitting), what is the expected behavior of these curves as the number of training samples increases?
Overfitting and underfitting from a mathematical viewpoint
Hard
A.Both errors will be high and will remain high, indicating the model cannot learn the data.
B.The training error will be very low and will increase, while the validation error will be high and will decrease. They will converge towards each other.
C.The training error will start low and stay low, while the validation error will start high and stay high.
D.Both errors will start high and decrease, with a persistent large gap between them.
Correct Answer: The training error will be very low and will increase, while the validation error will be high and will decrease. They will converge towards each other.
Explanation:
A model with high variance overfits the training data. With a small training set, it can fit it almost perfectly, so training error is near zero, but it generalizes poorly, so validation error is high. As the training set size increases, it becomes harder for the complex model to perfectly fit all the data, so the training error increases. At the same time, more data provides a better signal of the underlying pattern, so the model begins to generalize better, and the validation error decreases. The gap between the two curves narrows, indicating that more data can help mitigate high variance.
Incorrect! Try again.
58Consider the bias-variance decomposition: . Which of these terms can, in principle, be reduced to zero by choosing a sufficiently complex model and having an infinite amount of training data?
Bias-variance trade-off
Hard
A.Only Variance.
B.Only Bias.
C.Bias, Variance, and Noise.
D.Both Bias and Variance.
Correct Answer: Both Bias and Variance.
Explanation:
The Bias term measures how far the average model prediction is from the true function. With a sufficiently flexible model class (e.g., one that contains the true function), the bias can be reduced to zero. The Variance term measures the model's sensitivity to the specific training set. As the amount of training data approaches infinity, the model's prediction becomes stable and independent of the specific sample, thus the variance approaches zero. The Noise term (), or irreducible error, is a property of the data generating process itself and cannot be reduced by any model.
Incorrect! Try again.
59Elastic Net regularization combines L1 and L2 penalties: . What is the primary advantage of this combination, especially in a scenario with a group of highly correlated features and ?
L1 and L2 regularization
Hard
A.It allows the use of standard gradient descent, which is not possible for the pure L1 penalty.
B.It is computationally more efficient to optimize than either Lasso or Ridge alone.
C.It exhibits the 'grouping effect', tending to select all the correlated features together, while still promoting overall sparsity.
D.It creates a solution that is sparser than Lasso and more stable than Ridge.
Correct Answer: It exhibits the 'grouping effect', tending to select all the correlated features together, while still promoting overall sparsity.
Explanation:
In a situation with a group of highly correlated features, Lasso tends to arbitrarily select only one feature from the group. The Ridge penalty component in Elastic Net encourages the coefficients of correlated features to be similar. The combination, known as the 'grouping effect', means the Elastic Net will often select or discard the entire group of correlated features together, which is often a desirable property. The L1 component ensures that the overall solution remains sparse by setting the coefficients of irrelevant features (or groups) to zero.
Incorrect! Try again.
60Consider the hypothesis class H of 1-Nearest Neighbor (1-NN) classifiers defined by a training set of points in . What is the VC dimension of this hypothesis class with respect to the training set itself?
VC dimension (intuition)
Hard
A.1
B.Infinite
C.n
D.d+1
Correct Answer: n
Explanation:
The 1-NN classifier's decision boundary is determined entirely by the training data. If we want to test if the classifier can shatter its own training set of size , we can. For any of the possible labelings of the points in , we can simply assign those labels to the points. When we then classify each point , its nearest neighbor is itself, so it is assigned its own label. Thus, 1-NN can perfectly realize any dichotomy on its own training set. Therefore, its VC dimension is at least . It cannot shatter any set of size because it only has 'parameters' (the training points). This highlights how the complexity of non-parametric models can grow with the data.