1 $In the context of machine learning, what does 'bias' represent?$

Bias-variance trade-off Easy

A.

The inherent noise present in the dataset.

B.

The model's accuracy on the test dataset.

C.

The error from the model's sensitivity to small fluctuations in the training set.

D.

The error from erroneous assumptions in the learning algorithm, leading to a model that is too simple.

2 $A model that has high variance is likely to:$

Bias-variance trade-off Easy

A.

Perform very well on both the training data and test data.

B.

Perform poorly on both the training data and test data.

C.

Perform very well on the training data but poorly on unseen test data.

D.

Perform poorly on the training data but well on unseen test data.

3 $Which scenario best describes an 'underfitting' model?$

Overfitting and underfitting from a mathematical viewpoint Easy

A.

Low training error and low test error.

B.

High training error and low test error.

C.

Low training error and high test error.

D.

High training error and high test error.

4 $Overfitting occurs when a model has:$

Overfitting and underfitting from a mathematical viewpoint Easy

A.

Low bias and low variance.

B.

Low bias and high variance.

C.

High bias and high variance.

D.

High bias and low variance.

5 $What is the primary goal of applying regularization to a machine learning model?$

L1 and L2 regularization Easy

A.

To eliminate the need for a separate test set.

B.

To prevent overfitting by penalizing model complexity.

C.

To reduce the training time of the model.

D.

To increase the model's training accuracy to 100%.

6 $The L2 regularization penalty term added to the loss function is proportional to the:$

L1 and L2 regularization Easy

A.

Number of non-zero weights in the model.

B.

Sum of the absolute magnitudes of the model's weights.

C.

Sum of the squared magnitudes of the model's weights.

D.

Maximum absolute value among the model's weights.

7 $A key advantage of L1 regularization (Lasso) over L2 regularization (Ridge) is that it:$

L1 and L2 regularization Easy

A.

Does not require tuning a hyperparameter.

B.

Can produce sparse models by shrinking some weights to exactly zero.

C.

Is computationally less expensive to calculate.

D.

Always produces a model with lower bias.

8 $According to the bias-variance trade-off, as we increase a model's complexity, what typically happens to bias and variance?$

Bias-variance trade-off Easy

A.

Bias increases and variance decreases.

B.

Both bias and variance increase.

C.

Bias decreases and variance increases.

D.

Both bias and variance decrease.

9 $The L1 norm of a vector is defined as:$

Norm-based constraints Easy

A.

B.

C.

D.

10 $From a geometric viewpoint, L1 regularization constrains the coefficients to lie within a:$

Norm-based constraints Easy

A.

Hypercube.

B.

Hyperdiamond (a square rotated by 45 degrees in 2D).

C.

Hypersphere (a circle in 2D).

D.

Hyperplane.

11 $In the regularization term, what happens if the hyperparameter is set to zero?$

L1 and L2 regularization Easy

A.

The model becomes maximally regularized, and all weights go to zero.

B.

The model becomes a standard, unregularized model.

C.

The model will fail to train.

D.

The model's bias becomes infinitely high.

12 $What is 'empirical risk' in the context of machine learning?$

Structural risk minimization Easy

A.

The expected loss over all possible unseen data.

B.

The irreducible error or noise in the data.

C.

The measured average loss over the training set.

D.

The complexity of the chosen model.

13 $The principle of Structural Risk Minimization (SRM) suggests choosing a model that:$

Structural risk minimization Easy

A.

Maximizes the model complexity to fit all data points.

B.

Minimizes only the empirical risk.

C.

Minimizes only the variance of the model.

D.

Balances low empirical risk with low model complexity.

14 $What does the VC dimension of a hypothesis class (a set of models) measure?$

VC dimension (intuition) Easy

A.

The learning rate of the optimization algorithm.

B.

The capacity or expressive power of the class.

C.

The amount of bias in the model.

D.

The number of data points in the training set.

15 $A model with an infinite VC dimension is:$

VC dimension (intuition) Easy

A.

So simple that it cannot learn any patterns.

B.

So powerful that it can memorize any training set, making it highly prone to overfitting.

C.

The ideal model for all classification tasks.

D.

A model that cannot be trained.

16 $For a linear classifier in a 2D plane (a simple line), what is the maximum number of points that it can 'shatter'?$

VC dimension (intuition) Easy

A.

Unlimited

B.

3

C.

2

D.

4

17 $From a mathematical viewpoint, what is a common characteristic of the weights (coefficients) in a heavily overfitted polynomial regression model?$

Overfitting and underfitting from a mathematical viewpoint Easy

A.

They are all positive.

B.

They are all very close to zero.

C.

They have very large positive and negative magnitudes.

D.

They are all exactly one.

18 $When L2 regularization is applied to linear regression, it is commonly known as:$

L1 and L2 regularization Easy

A.

Elastic Net

B.

Lasso Regression

C.

Principal Component Regression

D.

Ridge Regression

19 $Which of the following describes a model with high bias?$

Bias-variance trade-off Easy

A.

A model with low training error.

B.

An underfitting model.

C.

An overfitting model.

D.

A complex model like a deep neural network.

20 $Which algorithm is most directly associated with L1 regularization for the purpose of feature selection?$

L1 and L2 regularization Easy

A.

Ridge

B.

PCA

C.

k-Nearest Neighbors

D.

LASSO

21 $A machine learning model is trained on a dataset, and its performance is evaluated. The expected squared prediction error is decomposed as . If we are given a perfect model and an infinite amount of training data, which of these terms would still remain non-zero?$

Bias-variance trade-off Medium

A.

, the squared bias

B.

, the irreducible error

C.

, the variance

D.

All terms would become zero

22 $A model is considered to be overfitting when...$

Overfitting and underfitting from a mathematical viewpoint Medium

A.

Training error is low but validation error is high and potentially increasing.

B.

Both training error and validation error are high.

C.

Both training error and validation error are low and converging.

D.

Training error is high but validation error is low.

23 $In the context of linear regression, what is the primary difference in the effect of L1 (Lasso) versus L2 (Ridge) regularization on the model's weight vector ?$

L1 and L2 regularization Medium

A.

L1 regularization can force some weights to be exactly zero, while L2 only shrinks them towards zero.

B.

L2 regularization can force some weights to be exactly zero, while L1 only shrinks them towards zero.

C.

L1 regularization penalizes large weights more severely than L2 regularization.

D.

L1 regularization always results in a lower bias model compared to L2 regularization.

24 $Consider the k-Nearest Neighbors (k-NN) algorithm. How does decreasing the value of (e.g., from to) typically affect the bias and variance of the model?$

Bias-variance trade-off Medium

A.

Both bias and variance decrease.

B.

Bias increases, and variance decreases.

C.

Both bias and variance increase.

D.

Bias decreases, and variance increases.

25 $The principle of Structural Risk Minimization (SRM) provides a framework for model selection by balancing two key quantities. What are they?$

Structural risk minimization Medium

A.

Bias and variance.

B.

Empirical risk (training error) and a model complexity term.

C.

Training accuracy and validation accuracy.

D.

The L1 norm and the L2 norm of the weights.

26 $The cost function for Ridge regression is given by . This is equivalent to which of the following constrained optimization problems for some value ?$

Norm-based constraints Medium

A.

Minimize subject to

B.

Minimize subject to

C.

Minimize subject to

D.

Minimize subject to

27 $If model class A has a VC dimension of 10 and model class B has a VC dimension of 1000, which of the following is generally true when training on a dataset of a fixed size?$

VC dimension (intuition) Medium

A.

Model A will always have a lower training error than Model B.

B.

Model B has a higher capacity to overfit the data compared to Model A.

C.

Model A is guaranteed to generalize better than Model B.

D.

Model B will have lower variance than Model A.

28 $You have fit a high-degree polynomial regression model to a dataset with a small number of data points (). You observe that the number of model parameters () is greater than . What is the most likely issue?$

Overfitting and underfitting from a mathematical viewpoint Medium

A.

The model has high bias and low variance.

B.

The model is overfitting because its capacity is too high for the amount of data.

C.

The optimization algorithm failed to converge.

D.

The model is underfitting because polynomials cannot capture the true relationship.

29 $Consider a Ridge regression model. What happens to the learned weight coefficients as the regularization parameter approaches infinity ()?$

L1 and L2 regularization Medium

A.

The weight coefficients become equal to the ordinary least squares solution.

B.

All weight coefficients approach zero.

C.

All weight coefficients approach infinity.

D.

Some weight coefficients become exactly zero, while others remain large.

30 $A data scientist observes that their model performs poorly on both the training set and the test set. The errors are high and very similar. Which of the following best describes this situation in terms of bias and variance?$

Bias-variance trade-off Medium

A.

High bias, low variance

B.

Low bias, high variance

C.

High bias, high variance

D.

Low bias, low variance

31 $From a geometric perspective, why does L1 regularization (Lasso) promote sparse solutions (i.e., weights being exactly zero)?$

Norm-based constraints Medium

A.

The L1 norm is not differentiable, which causes the optimization to get stuck at zero.

B.

The L1 norm penalizes small weights more than large weights, forcing them to zero.

C.

The L1 norm constraint region is a hypersphere, which forces weights to be small and uniform.

D.

The L1 norm constraint region is a hyperdiamond, and the elliptical contours of the loss function are likely to make contact at a corner.

32 $You are working on a problem with 10,000 features, but you suspect that only about 100 of them are actually useful. Which regularization technique would be a more suitable initial choice and why?$

L1 and L2 regularization Medium

A.

L2 (Ridge), because it handles multicollinearity better by shrinking all weights.

B.

L2 (Ridge), because it results in a model with lower bias.

C.

L1 (Lasso), because it performs automatic feature selection by driving irrelevant feature weights to zero.

D.

L1 (Lasso), because its optimization is computationally faster than L2.

33 $According to generalization theory involving VC dimension, what is the impact of increasing the number of training samples () on the generalization gap, assuming the model class (and thus its VC dimension) remains fixed?$

VC dimension (intuition) Medium

A.

The upper bound on the generalization gap tends to increase.

B.

The generalization gap remains constant.

C.

The impact depends on whether the model is linear or non-linear.

D.

The upper bound on the generalization gap tends to decrease.

34 $What is the most likely mathematical consequence of adding a significant number of new, relevant data points to the training set of a model that is currently overfitting?$

Overfitting and underfitting from a mathematical viewpoint Medium

A.

The model's bias will likely increase, and the training error will increase.

B.

The model's bias will likely decrease, and the validation error will worsen.

C.

The model's variance will likely increase, and the validation error will worsen.

D.

The model's variance will likely decrease, and the validation error will improve.

35 $How does adding an L2 regularization term,, to a loss function practically implement the Structural Risk Minimization (SRM) principle?$

Structural risk minimization Medium

A.

It adds a penalty for model complexity, where complexity is measured by the magnitude of the weights.

B.

It guarantees that the empirical risk will be zero.

C.

It increases the empirical risk to better match the true risk.

D.

It directly minimizes the VC dimension of the model class.

36 $The absolute value function in the L1 penalty,, is not differentiable at . What is a practical implication of this for training a model with L1 regularization?$

L1 and L2 regularization Medium

A.

The model can never learn a weight that is exactly zero.

B.

Standard gradient descent cannot be used directly; specialized optimizers like Coordinate Descent or subgradient methods are needed.

C.

The cost function becomes non-convex, making it hard to find a global minimum.

D.

The training process becomes significantly faster than with L2 regularization.

37 $A team builds a very complex neural network with millions of parameters for a simple classification task with only a few hundred data points. Without any regularization, the model achieves 99.9% training accuracy but only 60% test accuracy. This large gap between training and test accuracy is primarily due to...$

Bias-variance trade-off Medium

A.

High bias

B.

High variance

C.

High irreducible error

D.

A non-convex loss function

38 $Comparing the L1 and L2 norms as penalty functions, how do they differ in their treatment of a single, very large weight coefficient versus many medium-sized coefficients (assuming the sum of magnitudes is the same)?$

Norm-based constraints Medium

A.

L1 penalizes the single large weight more heavily than L2.

B.

L2 penalizes the single large weight more heavily than L1 due to the squaring effect.

C.

Both norms penalize them equally as long as is the same.

D.

L2 encourages a single large weight, while L1 encourages many medium-sized weights.

39 $An underfit model is characterized by high bias. Which of the following is a direct mathematical interpretation of high bias?$

Overfitting and underfitting from a mathematical viewpoint Medium

A.

The model's average prediction over all possible training sets is far from the true underlying function.

B.

The model has a large number of parameters relative to the number of data points.

C.

The model's predictions vary significantly for different training sets.

D.

The training error is close to zero, but the test error is large.

40 $In a linear model, if two features and are highly correlated, how would Ridge (L2) and Lasso (L1) regularization typically distribute the weights and ?$

L1 and L2 regularization Medium

A.

Ridge will make one coefficient large and positive and the other large and negative, while Lasso will shrink both towards zero.

B.

Both Ridge and Lasso will set one of the correlated feature weights to zero.

C.

Ridge tends to shrink both and together, giving them similar coefficients, while Lasso might arbitrarily set one to zero and keep the other.

D.

Lasso tends to give both and similar, non-zero coefficients, while Ridge will set one to zero.

41 $The standard bias-variance decomposition for squared error loss is . If the loss function is changed to the absolute error,, how does the decomposition of the Expected Prediction Error change?$

Bias-variance trade-off Hard

A.

The decomposition remains the same, but Bias is calculated with respect to the median of and Variance is calculated using the L1 norm.

B.

It becomes .

C.

The decomposition is no longer possible for non-differentiable loss functions.

D.

The decomposition no longer holds in this simple additive form; the relationship becomes more complex and is often expressed as an inequality.

42 $Consider a linear regression problem with two highly correlated features, . The model is . How would the estimated coefficients behave for Lasso (L1) vs. Ridge (L2) regression as the regularization strength is increased?$

L1 and L2 regularization Hard

A.

Lasso will tend to select one feature and set the other's coefficient to zero, while Ridge will shrink both coefficients towards zero but keep them roughly equal.

B.

Lasso will shrink both coefficients to zero at the same rate, while Ridge will arbitrarily pick one to shrink faster.

C.

Ridge will set one coefficient to zero while Lasso will shrink both, keeping them roughly equal.

D.

Both Lasso and Ridge will shrink both coefficients towards each other and then towards zero.

43 $Consider a hypothesis class consisting of classifiers that are unions of at most disjoint intervals on the real line. A point is classified as +1 if it falls within any of these intervals. What is the Vapnik-Chervonenkis (VC) dimension of this hypothesis class?$

VC dimension (intuition) Hard

A.

2k+1

B.

k

C.

Infinite

D.

2k

44 $According to the Structural Risk Minimization (SRM) principle, the generalization error is bounded by, where is the empirical risk, is the sample size, and is the VC dimension. If for a fixed dataset, Model A (VCdim=5) has an empirical error of 0.10 and Model B (VCdim=10) has an empirical error of 0.08, which statement is most accurate?$

Structural risk minimization Hard

A.

Without knowing the exact form of the complexity penalty and the values of and, we cannot definitively choose between Model A and Model B based on SRM.

B.

The SRM principle would select the model that minimizes, so Model A is chosen (vs).

C.

Model A is always better because its complexity (VC dimension) is lower.

D.

Model B is always better because its empirical error is lower.

45 $In a high-dimensional setting (), consider a linear model . The variance of the Ordinary Least Squares (OLS) estimator is given by . How does the phenomenon of multicollinearity (high correlation between predictor variables) specifically affect the bias-variance trade-off for this model?$

Bias-variance trade-off Hard

A.

It dramatically increases the variance of the coefficient estimates without affecting the bias, which remains zero for OLS.

B.

It increases both the bias and the variance, as the model struggles to attribute effects to specific predictors.

C.

It has no effect on the bias-variance trade-off, as it is an issue of numerical stability, not statistical performance.

D.

It increases the bias by forcing some coefficients to be estimated incorrectly, but it decreases the variance by stabilizing the model.

46 $A model is trained on a dataset of size and has parameters. Let be the training error and be the true generalization error. From a mathematical viewpoint, which condition provides the strongest evidence of significant overfitting?$

Overfitting and underfitting from a mathematical viewpoint Hard

A.

The learning curve for training error and validation error shows a large, persistent gap.

B.

is high and .

C.

and .

D.

.

47 $The solution for Ridge Regression is given by . What is the behavior of the norm of this solution,, as the regularization parameter ?$

L1 and L2 regularization Hard

A.

The norm is undefined as becomes singular.

B.

approaches a non-zero constant determined by .

C.

.

D.

.

48 $Consider solving a least squares problem subject to a norm constraint: subject to . Which statement correctly describes the relationship between this constrained optimization problem and the standard regularized regression formulation ?$

Norm-based constraints Hard

A.

The regularized formulation is only an approximation of the constrained problem and there is no guaranteed equivalence.

B.

The constrained formulation is a relaxation of the regularized formulation and always yields a sparser solution.

C.

For any, there exists a corresponding such that the two formulations have the same solution, and vice-versa (except for a boundary case at max). This is due to Lagrangian duality.

D.

The two formulations are equivalent only for the L2 norm () but not for the L1 norm ().

49 $Consider the hypothesis class of all circles in (a point is classified as +1 if inside the circle, -1 if outside). What is the VC dimension of ?$

VC dimension (intuition) Hard

A.

3

B.

4

C.

Infinite

D.

2

50 $For a polynomial regression model of degree, the coefficients are estimated by minimizing sum of squared errors. As increases to the point of overfitting, the L2 norm of the weight vector,, often grows very large. What is the mathematical reason for this phenomenon?$

Overfitting and underfitting from a mathematical viewpoint Hard

A.

The number of parameters exceeds the number of data points, and the Moore-Penrose pseudoinverse used to solve the system results in large coefficient values.

B.

To fit the noise and specific data points more precisely, the polynomial must make sharp turns, which requires coefficients of large magnitude with alternating signs.

C.

A higher degree introduces multicollinearity between the polynomial features (), which always inflates the coefficient estimates.

D.

The loss function becomes non-convex for large, leading to unstable solutions with large norms.

51 $The L1 penalty term is . Why is this function not differentiable at, and what is a common approach to optimize loss functions involving this term?$

L1 and L2 regularization Hard

A.

The derivative is infinite at 0. This is handled by adding a small constant,, to .

B.

The function is not convex. Optimization requires specialized non-convex solvers.

C.

The function has a value of 0 at the origin, but its gradient is non-zero, violating differentiability conditions. Proximal gradient descent is used.

D.

The derivative has a jump discontinuity at 0 (from -1 to +1). Optimization is typically handled using subgradient descent or coordinate descent.

52 $For a k-Nearest Neighbors (k-NN) regression model, how do the bias and variance change as the value of is increased from 1 to (the total number of data points)?$

Bias-variance trade-off Hard

A.

Bias increases and variance decreases.

B.

Bias decreases and variance increases.

C.

Both bias and variance decrease.

D.

Both bias and variance increase.

53 $The SRM principle is based on minimizing a guaranteed upper bound on the true risk, often of the form . What does the presence of the sample size in the denominator of the complexity term imply about the relationship between Empirical Risk Minimization (ERM) and SRM?$

Structural risk minimization Hard

A.

For small, SRM is more important than ERM, and for large, ERM is more important than SRM.

B.

As, the complexity term vanishes, and the SRM solution converges to the ERM solution.

C.

The bound is only valid for, implying ERM is never sufficient.

D.

The in the denominator indicates that a larger dataset justifies using a more complex model (higher VC dimension).

54 $How does standardizing features (rescaling to have mean 0 and standard deviation 1) before applying Ridge (L2) or Lasso (L1) regression affect the solution?$

Norm-based constraints Hard

A.

It has no effect on the final solution because the penalty is applied proportionally to all coefficients.

B.

It only improves the numerical stability of the optimization algorithm but does not change the optimal coefficient values found.

C.

It only matters for Ridge regression due to its quadratic penalty, but not for Lasso's linear penalty.

D.

It is crucial because regularization penalizes the magnitude of coefficients, making the solution dependent on the scale of the features. Without it, features with larger scales are unfairly penalized.

55 $Which of the following statements correctly captures the relationship between the number of parameters in a model and its VC dimension?$

VC dimension (intuition) Hard

A.

The VC dimension is only defined for models with a finite number of parameters.

B.

The VC dimension is always equal to the number of trainable parameters plus one.

C.

The VC dimension can be much larger or much smaller than the number of parameters; there is no universal direct relationship.

D.

The VC dimension is always upper-bounded by the number of parameters.

56 $Consider the geometry of the unregularized least-squares loss function, which has elliptical level sets. Lasso (L1) and Ridge (L2) add a constraint region that is a diamond (L1-ball) and a circle (L2-ball), respectively. Why is the Lasso solution more likely to be sparse (have zero-valued coefficients)?$

L1 and L2 regularization Hard

A.

The L1-ball has a smaller volume than the L2-ball for a given radius, which forces more coefficients to be zero.

B.

The sharp corners of the L1-ball are more likely to be the first point of contact as the elliptical level sets of the loss function expand, and these corners lie on the axes where some coefficients are zero.

C.

The L1 norm is a linear function, while the loss function is quadratic, and their intersection is always at an axis.

D.

The L2-ball's curved surface ensures that the point of tangency with the loss function's level sets will almost never be on an axis, while the L1-ball has flat sides.

57 $A model's performance is evaluated using a learning curve, plotting training and validation error against the number of training samples. In a classic case of high variance (overfitting), what is the expected behavior of these curves as the number of training samples increases?$

Overfitting and underfitting from a mathematical viewpoint Hard

A.

Both errors will be high and will remain high, indicating the model cannot learn the data.

B.

The training error will start low and stay low, while the validation error will start high and stay high.

C.

The training error will be very low and will increase, while the validation error will be high and will decrease. They will converge towards each other.

D.

Both errors will start high and decrease, with a persistent large gap between them.

58 $Consider the bias-variance decomposition: . Which of these terms can, in principle, be reduced to zero by choosing a sufficiently complex model and having an infinite amount of training data?$

Bias-variance trade-off Hard

A.

Bias, Variance, and Noise.

B.

Only Bias.

C.

Only Variance.

D.

Both Bias and Variance.

59 $Elastic Net regularization combines L1 and L2 penalties: . What is the primary advantage of this combination, especially in a scenario with a group of highly correlated features and ?$

L1 and L2 regularization Hard

A.

It creates a solution that is sparser than Lasso and more stable than Ridge.

B.

It is computationally more efficient to optimize than either Lasso or Ridge alone.

C.

It exhibits the 'grouping effect', tending to select all the correlated features together, while still promoting overall sparsity.

D.

It allows the use of standard gradient descent, which is not possible for the pure L1 penalty.

60 $Consider the hypothesis class H of 1-Nearest Neighbor (1-NN) classifiers defined by a training set of points in . What is the VC dimension of this hypothesis class with respect to the training set itself?$

VC dimension (intuition) Hard

A.

d+1

B.

Infinite

C.

n

D.

1

Unit 6 - Practice Quiz