Unit 6 - Practice Quiz

INT255 60 Questions
0 Correct 0 Wrong 60 Left
0/60

1 In the context of machine learning, what does 'bias' represent?

Bias-variance trade-off Easy
A. The error from erroneous assumptions in the learning algorithm, leading to a model that is too simple.
B. The error from the model's sensitivity to small fluctuations in the training set.
C. The inherent noise present in the dataset.
D. The model's accuracy on the test dataset.

2 A model that has high variance is likely to:

Bias-variance trade-off Easy
A. Perform very well on the training data but poorly on unseen test data.
B. Perform poorly on the training data but well on unseen test data.
C. Perform very well on both the training data and test data.
D. Perform poorly on both the training data and test data.

3 Which scenario best describes an 'underfitting' model?

Overfitting and underfitting from a mathematical viewpoint Easy
A. Low training error and low test error.
B. High training error and high test error.
C. Low training error and high test error.
D. High training error and low test error.

4 Overfitting occurs when a model has:

Overfitting and underfitting from a mathematical viewpoint Easy
A. Low bias and low variance.
B. High bias and low variance.
C. Low bias and high variance.
D. High bias and high variance.

5 What is the primary goal of applying regularization to a machine learning model?

L1 and L2 regularization Easy
A. To eliminate the need for a separate test set.
B. To prevent overfitting by penalizing model complexity.
C. To reduce the training time of the model.
D. To increase the model's training accuracy to 100%.

6 The L2 regularization penalty term added to the loss function is proportional to the:

L1 and L2 regularization Easy
A. Maximum absolute value among the model's weights.
B. Sum of the absolute magnitudes of the model's weights.
C. Sum of the squared magnitudes of the model's weights.
D. Number of non-zero weights in the model.

7 A key advantage of L1 regularization (Lasso) over L2 regularization (Ridge) is that it:

L1 and L2 regularization Easy
A. Always produces a model with lower bias.
B. Does not require tuning a hyperparameter.
C. Is computationally less expensive to calculate.
D. Can produce sparse models by shrinking some weights to exactly zero.

8 According to the bias-variance trade-off, as we increase a model's complexity, what typically happens to bias and variance?

Bias-variance trade-off Easy
A. Bias increases and variance decreases.
B. Both bias and variance increase.
C. Bias decreases and variance increases.
D. Both bias and variance decrease.

9 The L1 norm of a vector is defined as:

Norm-based constraints Easy
A.
B.
C.
D.

10 From a geometric viewpoint, L1 regularization constrains the coefficients to lie within a:

Norm-based constraints Easy
A. Hypercube.
B. Hyperdiamond (a square rotated by 45 degrees in 2D).
C. Hypersphere (a circle in 2D).
D. Hyperplane.

11 In the regularization term , what happens if the hyperparameter is set to zero?

L1 and L2 regularization Easy
A. The model will fail to train.
B. The model's bias becomes infinitely high.
C. The model becomes maximally regularized, and all weights go to zero.
D. The model becomes a standard, unregularized model.

12 What is 'empirical risk' in the context of machine learning?

Structural risk minimization Easy
A. The measured average loss over the training set.
B. The complexity of the chosen model.
C. The expected loss over all possible unseen data.
D. The irreducible error or noise in the data.

13 The principle of Structural Risk Minimization (SRM) suggests choosing a model that:

Structural risk minimization Easy
A. Minimizes only the empirical risk.
B. Minimizes only the variance of the model.
C. Balances low empirical risk with low model complexity.
D. Maximizes the model complexity to fit all data points.

14 What does the VC dimension of a hypothesis class (a set of models) measure?

VC dimension (intuition) Easy
A. The amount of bias in the model.
B. The learning rate of the optimization algorithm.
C. The number of data points in the training set.
D. The capacity or expressive power of the class.

15 A model with an infinite VC dimension is:

VC dimension (intuition) Easy
A. A model that cannot be trained.
B. The ideal model for all classification tasks.
C. So simple that it cannot learn any patterns.
D. So powerful that it can memorize any training set, making it highly prone to overfitting.

16 For a linear classifier in a 2D plane (a simple line), what is the maximum number of points that it can 'shatter'?

VC dimension (intuition) Easy
A. Unlimited
B. 4
C. 3
D. 2

17 From a mathematical viewpoint, what is a common characteristic of the weights (coefficients) in a heavily overfitted polynomial regression model?

Overfitting and underfitting from a mathematical viewpoint Easy
A. They are all very close to zero.
B. They are all exactly one.
C. They have very large positive and negative magnitudes.
D. They are all positive.

18 When L2 regularization is applied to linear regression, it is commonly known as:

L1 and L2 regularization Easy
A. Lasso Regression
B. Principal Component Regression
C. Elastic Net
D. Ridge Regression

19 Which of the following describes a model with high bias?

Bias-variance trade-off Easy
A. An overfitting model.
B. A model with low training error.
C. A complex model like a deep neural network.
D. An underfitting model.

20 Which algorithm is most directly associated with L1 regularization for the purpose of feature selection?

L1 and L2 regularization Easy
A. PCA
B. LASSO
C. Ridge
D. k-Nearest Neighbors

21 A machine learning model is trained on a dataset, and its performance is evaluated. The expected squared prediction error is decomposed as . If we are given a perfect model and an infinite amount of training data, which of these terms would still remain non-zero?

Bias-variance trade-off Medium
A. , the squared bias
B. , the irreducible error
C. All terms would become zero
D. , the variance

22 A model is considered to be overfitting when...

Overfitting and underfitting from a mathematical viewpoint Medium
A. Training error is low but validation error is high and potentially increasing.
B. Both training error and validation error are low and converging.
C. Training error is high but validation error is low.
D. Both training error and validation error are high.

23 In the context of linear regression, what is the primary difference in the effect of L1 (Lasso) versus L2 (Ridge) regularization on the model's weight vector ?

L1 and L2 regularization Medium
A. L1 regularization penalizes large weights more severely than L2 regularization.
B. L1 regularization always results in a lower bias model compared to L2 regularization.
C. L1 regularization can force some weights to be exactly zero, while L2 only shrinks them towards zero.
D. L2 regularization can force some weights to be exactly zero, while L1 only shrinks them towards zero.

24 Consider the k-Nearest Neighbors (k-NN) algorithm. How does decreasing the value of (e.g., from to ) typically affect the bias and variance of the model?

Bias-variance trade-off Medium
A. Both bias and variance increase.
B. Bias increases, and variance decreases.
C. Both bias and variance decrease.
D. Bias decreases, and variance increases.

25 The principle of Structural Risk Minimization (SRM) provides a framework for model selection by balancing two key quantities. What are they?

Structural risk minimization Medium
A. Empirical risk (training error) and a model complexity term.
B. Bias and variance.
C. The L1 norm and the L2 norm of the weights.
D. Training accuracy and validation accuracy.

26 The cost function for Ridge regression is given by . This is equivalent to which of the following constrained optimization problems for some value ?

Norm-based constraints Medium
A. Minimize subject to
B. Minimize subject to
C. Minimize subject to
D. Minimize subject to

27 If model class A has a VC dimension of 10 and model class B has a VC dimension of 1000, which of the following is generally true when training on a dataset of a fixed size?

VC dimension (intuition) Medium
A. Model A will always have a lower training error than Model B.
B. Model A is guaranteed to generalize better than Model B.
C. Model B will have lower variance than Model A.
D. Model B has a higher capacity to overfit the data compared to Model A.

28 You have fit a high-degree polynomial regression model to a dataset with a small number of data points (). You observe that the number of model parameters () is greater than . What is the most likely issue?

Overfitting and underfitting from a mathematical viewpoint Medium
A. The optimization algorithm failed to converge.
B. The model is overfitting because its capacity is too high for the amount of data.
C. The model has high bias and low variance.
D. The model is underfitting because polynomials cannot capture the true relationship.

29 Consider a Ridge regression model. What happens to the learned weight coefficients as the regularization parameter approaches infinity ()?

L1 and L2 regularization Medium
A. All weight coefficients approach zero.
B. All weight coefficients approach infinity.
C. The weight coefficients become equal to the ordinary least squares solution.
D. Some weight coefficients become exactly zero, while others remain large.

30 A data scientist observes that their model performs poorly on both the training set and the test set. The errors are high and very similar. Which of the following best describes this situation in terms of bias and variance?

Bias-variance trade-off Medium
A. High bias, high variance
B. Low bias, low variance
C. High bias, low variance
D. Low bias, high variance

31 From a geometric perspective, why does L1 regularization (Lasso) promote sparse solutions (i.e., weights being exactly zero)?

Norm-based constraints Medium
A. The L1 norm is not differentiable, which causes the optimization to get stuck at zero.
B. The L1 norm constraint region is a hypersphere, which forces weights to be small and uniform.
C. The L1 norm constraint region is a hyperdiamond, and the elliptical contours of the loss function are likely to make contact at a corner.
D. The L1 norm penalizes small weights more than large weights, forcing them to zero.

32 You are working on a problem with 10,000 features, but you suspect that only about 100 of them are actually useful. Which regularization technique would be a more suitable initial choice and why?

L1 and L2 regularization Medium
A. L2 (Ridge), because it handles multicollinearity better by shrinking all weights.
B. L1 (Lasso), because it performs automatic feature selection by driving irrelevant feature weights to zero.
C. L1 (Lasso), because its optimization is computationally faster than L2.
D. L2 (Ridge), because it results in a model with lower bias.

33 According to generalization theory involving VC dimension, what is the impact of increasing the number of training samples () on the generalization gap, assuming the model class (and thus its VC dimension) remains fixed?

VC dimension (intuition) Medium
A. The upper bound on the generalization gap tends to decrease.
B. The generalization gap remains constant.
C. The upper bound on the generalization gap tends to increase.
D. The impact depends on whether the model is linear or non-linear.

34 What is the most likely mathematical consequence of adding a significant number of new, relevant data points to the training set of a model that is currently overfitting?

Overfitting and underfitting from a mathematical viewpoint Medium
A. The model's bias will likely increase, and the training error will increase.
B. The model's bias will likely decrease, and the validation error will worsen.
C. The model's variance will likely increase, and the validation error will worsen.
D. The model's variance will likely decrease, and the validation error will improve.

35 How does adding an L2 regularization term, , to a loss function practically implement the Structural Risk Minimization (SRM) principle?

Structural risk minimization Medium
A. It adds a penalty for model complexity, where complexity is measured by the magnitude of the weights.
B. It increases the empirical risk to better match the true risk.
C. It guarantees that the empirical risk will be zero.
D. It directly minimizes the VC dimension of the model class.

36 The absolute value function in the L1 penalty, , is not differentiable at . What is a practical implication of this for training a model with L1 regularization?

L1 and L2 regularization Medium
A. Standard gradient descent cannot be used directly; specialized optimizers like Coordinate Descent or subgradient methods are needed.
B. The model can never learn a weight that is exactly zero.
C. The cost function becomes non-convex, making it hard to find a global minimum.
D. The training process becomes significantly faster than with L2 regularization.

37 A team builds a very complex neural network with millions of parameters for a simple classification task with only a few hundred data points. Without any regularization, the model achieves 99.9% training accuracy but only 60% test accuracy. This large gap between training and test accuracy is primarily due to...

Bias-variance trade-off Medium
A. High irreducible error
B. High bias
C. A non-convex loss function
D. High variance

38 Comparing the L1 and L2 norms as penalty functions, how do they differ in their treatment of a single, very large weight coefficient versus many medium-sized coefficients (assuming the sum of magnitudes is the same)?

Norm-based constraints Medium
A. L2 penalizes the single large weight more heavily than L1 due to the squaring effect.
B. Both norms penalize them equally as long as is the same.
C. L2 encourages a single large weight, while L1 encourages many medium-sized weights.
D. L1 penalizes the single large weight more heavily than L2.

39 An underfit model is characterized by high bias. Which of the following is a direct mathematical interpretation of high bias?

Overfitting and underfitting from a mathematical viewpoint Medium
A. The model has a large number of parameters relative to the number of data points.
B. The model's predictions vary significantly for different training sets.
C. The model's average prediction over all possible training sets is far from the true underlying function.
D. The training error is close to zero, but the test error is large.

40 In a linear model , if two features and are highly correlated, how would Ridge (L2) and Lasso (L1) regularization typically distribute the weights and ?

L1 and L2 regularization Medium
A. Lasso tends to give both and similar, non-zero coefficients, while Ridge will set one to zero.
B. Both Ridge and Lasso will set one of the correlated feature weights to zero.
C. Ridge will make one coefficient large and positive and the other large and negative, while Lasso will shrink both towards zero.
D. Ridge tends to shrink both and together, giving them similar coefficients, while Lasso might arbitrarily set one to zero and keep the other.

41 The standard bias-variance decomposition for squared error loss is . If the loss function is changed to the absolute error, , how does the decomposition of the Expected Prediction Error change?

Bias-variance trade-off Hard
A. It becomes .
B. The decomposition remains the same, but Bias is calculated with respect to the median of and Variance is calculated using the L1 norm.
C. The decomposition is no longer possible for non-differentiable loss functions.
D. The decomposition no longer holds in this simple additive form; the relationship becomes more complex and is often expressed as an inequality.

42 Consider a linear regression problem with two highly correlated features, . The model is . How would the estimated coefficients behave for Lasso (L1) vs. Ridge (L2) regression as the regularization strength is increased?

L1 and L2 regularization Hard
A. Lasso will tend to select one feature and set the other's coefficient to zero, while Ridge will shrink both coefficients towards zero but keep them roughly equal.
B. Ridge will set one coefficient to zero while Lasso will shrink both, keeping them roughly equal.
C. Both Lasso and Ridge will shrink both coefficients towards each other and then towards zero.
D. Lasso will shrink both coefficients to zero at the same rate, while Ridge will arbitrarily pick one to shrink faster.

43 Consider a hypothesis class consisting of classifiers that are unions of at most disjoint intervals on the real line. A point is classified as +1 if it falls within any of these intervals. What is the Vapnik-Chervonenkis (VC) dimension of this hypothesis class?

VC dimension (intuition) Hard
A. Infinite
B. 2k
C. 2k+1
D. k

44 According to the Structural Risk Minimization (SRM) principle, the generalization error is bounded by , where is the empirical risk, is the sample size, and is the VC dimension. If for a fixed dataset, Model A (VCdim=5) has an empirical error of 0.10 and Model B (VCdim=10) has an empirical error of 0.08, which statement is most accurate?

Structural risk minimization Hard
A. Without knowing the exact form of the complexity penalty and the values of and , we cannot definitively choose between Model A and Model B based on SRM.
B. Model B is always better because its empirical error is lower.
C. Model A is always better because its complexity (VC dimension) is lower.
D. The SRM principle would select the model that minimizes , so Model A is chosen ( vs ).

45 In a high-dimensional setting (), consider a linear model . The variance of the Ordinary Least Squares (OLS) estimator is given by . How does the phenomenon of multicollinearity (high correlation between predictor variables) specifically affect the bias-variance trade-off for this model?

Bias-variance trade-off Hard
A. It dramatically increases the variance of the coefficient estimates without affecting the bias, which remains zero for OLS.
B. It increases both the bias and the variance, as the model struggles to attribute effects to specific predictors.
C. It increases the bias by forcing some coefficients to be estimated incorrectly, but it decreases the variance by stabilizing the model.
D. It has no effect on the bias-variance trade-off, as it is an issue of numerical stability, not statistical performance.

46 A model is trained on a dataset of size and has parameters. Let be the training error and be the true generalization error. From a mathematical viewpoint, which condition provides the strongest evidence of significant overfitting?

Overfitting and underfitting from a mathematical viewpoint Hard
A. is high and .
B. .
C. The learning curve for training error and validation error shows a large, persistent gap.
D. and .

47 The solution for Ridge Regression is given by . What is the behavior of the norm of this solution, , as the regularization parameter ?

L1 and L2 regularization Hard
A. .
B. The norm is undefined as becomes singular.
C. .
D. approaches a non-zero constant determined by .

48 Consider solving a least squares problem subject to a norm constraint: subject to . Which statement correctly describes the relationship between this constrained optimization problem and the standard regularized regression formulation ?

Norm-based constraints Hard
A. The constrained formulation is a relaxation of the regularized formulation and always yields a sparser solution.
B. For any , there exists a corresponding such that the two formulations have the same solution, and vice-versa (except for a boundary case at max). This is due to Lagrangian duality.
C. The two formulations are equivalent only for the L2 norm () but not for the L1 norm ().
D. The regularized formulation is only an approximation of the constrained problem and there is no guaranteed equivalence.

49 Consider the hypothesis class of all circles in (a point is classified as +1 if inside the circle, -1 if outside). What is the VC dimension of ?

VC dimension (intuition) Hard
A. 3
B. Infinite
C. 4
D. 2

50 For a polynomial regression model of degree , the coefficients are estimated by minimizing sum of squared errors. As increases to the point of overfitting, the L2 norm of the weight vector, , often grows very large. What is the mathematical reason for this phenomenon?

Overfitting and underfitting from a mathematical viewpoint Hard
A. To fit the noise and specific data points more precisely, the polynomial must make sharp turns, which requires coefficients of large magnitude with alternating signs.
B. The loss function becomes non-convex for large , leading to unstable solutions with large norms.
C. A higher degree introduces multicollinearity between the polynomial features (), which always inflates the coefficient estimates.
D. The number of parameters exceeds the number of data points, and the Moore-Penrose pseudoinverse used to solve the system results in large coefficient values.

51 The L1 penalty term is . Why is this function not differentiable at , and what is a common approach to optimize loss functions involving this term?

L1 and L2 regularization Hard
A. The function has a value of 0 at the origin, but its gradient is non-zero, violating differentiability conditions. Proximal gradient descent is used.
B. The function is not convex. Optimization requires specialized non-convex solvers.
C. The derivative has a jump discontinuity at 0 (from -1 to +1). Optimization is typically handled using subgradient descent or coordinate descent.
D. The derivative is infinite at 0. This is handled by adding a small constant, , to .

52 For a k-Nearest Neighbors (k-NN) regression model, how do the bias and variance change as the value of is increased from 1 to (the total number of data points)?

Bias-variance trade-off Hard
A. Both bias and variance decrease.
B. Bias increases and variance decreases.
C. Both bias and variance increase.
D. Bias decreases and variance increases.

53 The SRM principle is based on minimizing a guaranteed upper bound on the true risk, often of the form . What does the presence of the sample size in the denominator of the complexity term imply about the relationship between Empirical Risk Minimization (ERM) and SRM?

Structural risk minimization Hard
A. For small , SRM is more important than ERM, and for large , ERM is more important than SRM.
B. The in the denominator indicates that a larger dataset justifies using a more complex model (higher VC dimension).
C. The bound is only valid for , implying ERM is never sufficient.
D. As , the complexity term vanishes, and the SRM solution converges to the ERM solution.

54 How does standardizing features (rescaling to have mean 0 and standard deviation 1) before applying Ridge (L2) or Lasso (L1) regression affect the solution?

Norm-based constraints Hard
A. It is crucial because regularization penalizes the magnitude of coefficients, making the solution dependent on the scale of the features. Without it, features with larger scales are unfairly penalized.
B. It has no effect on the final solution because the penalty is applied proportionally to all coefficients.
C. It only improves the numerical stability of the optimization algorithm but does not change the optimal coefficient values found.
D. It only matters for Ridge regression due to its quadratic penalty, but not for Lasso's linear penalty.

55 Which of the following statements correctly captures the relationship between the number of parameters in a model and its VC dimension?

VC dimension (intuition) Hard
A. The VC dimension is only defined for models with a finite number of parameters.
B. The VC dimension can be much larger or much smaller than the number of parameters; there is no universal direct relationship.
C. The VC dimension is always equal to the number of trainable parameters plus one.
D. The VC dimension is always upper-bounded by the number of parameters.

56 Consider the geometry of the unregularized least-squares loss function, which has elliptical level sets. Lasso (L1) and Ridge (L2) add a constraint region that is a diamond (L1-ball) and a circle (L2-ball), respectively. Why is the Lasso solution more likely to be sparse (have zero-valued coefficients)?

L1 and L2 regularization Hard
A. The L1 norm is a linear function, while the loss function is quadratic, and their intersection is always at an axis.
B. The sharp corners of the L1-ball are more likely to be the first point of contact as the elliptical level sets of the loss function expand, and these corners lie on the axes where some coefficients are zero.
C. The L2-ball's curved surface ensures that the point of tangency with the loss function's level sets will almost never be on an axis, while the L1-ball has flat sides.
D. The L1-ball has a smaller volume than the L2-ball for a given radius, which forces more coefficients to be zero.

57 A model's performance is evaluated using a learning curve, plotting training and validation error against the number of training samples. In a classic case of high variance (overfitting), what is the expected behavior of these curves as the number of training samples increases?

Overfitting and underfitting from a mathematical viewpoint Hard
A. Both errors will be high and will remain high, indicating the model cannot learn the data.
B. The training error will be very low and will increase, while the validation error will be high and will decrease. They will converge towards each other.
C. The training error will start low and stay low, while the validation error will start high and stay high.
D. Both errors will start high and decrease, with a persistent large gap between them.

58 Consider the bias-variance decomposition: . Which of these terms can, in principle, be reduced to zero by choosing a sufficiently complex model and having an infinite amount of training data?

Bias-variance trade-off Hard
A. Only Variance.
B. Only Bias.
C. Bias, Variance, and Noise.
D. Both Bias and Variance.

59 Elastic Net regularization combines L1 and L2 penalties: . What is the primary advantage of this combination, especially in a scenario with a group of highly correlated features and ?

L1 and L2 regularization Hard
A. It allows the use of standard gradient descent, which is not possible for the pure L1 penalty.
B. It is computationally more efficient to optimize than either Lasso or Ridge alone.
C. It exhibits the 'grouping effect', tending to select all the correlated features together, while still promoting overall sparsity.
D. It creates a solution that is sparser than Lasso and more stable than Ridge.

60 Consider the hypothesis class H of 1-Nearest Neighbor (1-NN) classifiers defined by a training set of points in . What is the VC dimension of this hypothesis class with respect to the training set itself?

VC dimension (intuition) Hard
A. 1
B. Infinite
C. n
D. d+1