Unit 6 - Practice Quiz

INT394

1 What is the VC Dimension (Vapnik-Chervonenkis dimension) of a hypothesis class ?

A. The maximum number of points that can be shattered by .
B. The minimum number of points required to train a model from .
C. The total number of parameters in the model.
D. The number of training errors made by the hypothesis.

2 If a hypothesis space has an infinite VC dimension, what does this imply about the learnability of the task?

A. The task is learnable with a small number of samples.
B. The task is not PAC learnable.
C. The model will always underfit.
D. The training error will always be high.

3 What is the VC dimension of a linear classifier (perceptron) in a -dimensional space ()?

A.
B.
C.
D. Infinite

4 Rademacher Complexity measures the ability of a hypothesis class to fit:

A. Random noise
B. The true underlying distribution
C. Linear functions only
D. Training data with zero error

5 In the Bias-Variance Decomposition, the Bias term corresponds to:

A. The error due to sensitivity to small fluctuations in the training set.
B. The error due to erroneous assumptions in the learning algorithm (e.g., assuming data is linear when it is quadratic).
C. The inherent noise in the problem itself.
D. The computational cost of the algorithm.

6 As model complexity increases, what generally happens to Bias and Variance?

A. Bias increases, Variance increases
B. Bias decreases, Variance decreases
C. Bias decreases, Variance increases
D. Bias increases, Variance decreases

7 Which scenario best describes Overfitting?

A. High Training Error, High Test Error
B. Low Training Error, Low Test Error
C. Low Training Error, High Test Error
D. High Training Error, Low Test Error

8 Which of the following is NOT a technique to prevent overfitting?

A. Regularization (L1/L2)
B. Cross-validation
C. Early Stopping
D. Increasing the number of features significantly without increasing data

9 Mathematically, L2 Regularization (Ridge Regression) adds which term to the loss function ?

A.
B.
C.
D.

10 What is the primary feature selection property of L1 Regularization (Lasso)?

A. It forces weights to be small but non-zero.
B. It drives some weights exactly to zero, inducing sparsity.
C. It smooths the decision boundary more than L2.
D. It increases the variance of the estimator.

11 Structural Risk Minimization (SRM) aims to minimize:

A. Training error only
B. Test error only
C. Empirical Risk + Complexity Penalty
D. Validation error minus Training error

12 In k-fold Cross-Validation, how many times is the model trained and tested?

A. 1 time
B. times
C. times
D. times

13 Leave-One-Out Cross-Validation (LOOCV) is a special case of k-fold cross-validation where equals:

A. 1
B. 5
C. 10
D. (number of data points)

14 In the context of Gradient Descent, what is the role of the Learning Rate ()?

A. It determines the direction of the descent.
B. It determines the step size taken in the direction of the negative gradient.
C. It determines the number of iterations.
D. It determines the starting point of the parameters.

15 The standard Gradient Descent update rule for a parameter is:

A.
B.
C.
D.

16 Which variant of Gradient Descent updates weights using only one training example at a time?

A. Batch Gradient Descent
B. Mini-batch Gradient Descent
C. Stochastic Gradient Descent (SGD)
D. Newton's Method

17 What is the primary advantage of Momentum in optimization algorithms?

A. It guarantees finding the global minimum in non-convex functions.
B. It reduces the learning rate automatically.
C. It helps accelerate gradients vectors in the right directions, thus leading to faster converging.
D. It eliminates the need for a learning rate.

18 In RMSprop, the learning rate is adapted by dividing by:

A. The number of iterations.
B. The sum of past gradients.
C. The square root of the exponential moving average of squared gradients.
D. The L2 norm of the weights.

19 For a convex loss function, Gradient Descent is guaranteed to converge to:

A. A local minimum
B. The global minimum
C. A saddle point
D. Any point on the boundary

20 What is Hyperparameter Tuning?

A. Updating weights during backpropagation.
B. Selecting the optimal values for parameters like learning rate, regularization strength, or tree depth.
C. Cleaning the data before training.
D. Selecting the best features for the model.

21 Which search strategy involves testing a fixed set of hyperparameters arranged in a lattice structure?

A. Random Search
B. Grid Search
C. Bayesian Optimization
D. Gradient Search

22 If a learning curve shows that both training and validation errors are high and close to each other, the model suffers from:

A. High Variance (Overfitting)
B. High Bias (Underfitting)
C. Ideally tuned parameters
D. Data leakage

23 What is the relationship between the number of training samples () and the generalization bound involving VC dimension ()?

A. Generalization error
B. Generalization error
C. Generalization error
D. Generalization error is independent of

24 In the context of optimization, what is a Saddle Point?

A. A point where the gradient is zero, but it is a minimum in one direction and a maximum in another.
B. The lowest point in the loss landscape.
C. A point where the gradient is infinite.
D. The point where training starts.

25 The Empirical Risk corresponds to:

A. The expected error on unseen data.
B. The average loss calculated over the training dataset.
C. The maximum possible error of the classifier.
D. The error on the validation set.

26 Which regularization technique randomly sets a fraction of input units to 0 at each update during training time?

A. L2 Regularization
B. Early Stopping
C. Dropout
D. Data Augmentation

27 What happens if the learning rate in Gradient Descent is set too high?

A. The algorithm converges very slowly.
B. The algorithm may oscillate or diverge.
C. The algorithm will get stuck in a local minimum.
D. The model will overfit.

28 Rademacher Complexity is often considered tighter (more accurate) than VC dimension bounds because:

A. It is independent of the data distribution.
B. It depends on the specific data distribution and the training sample size.
C. It is easier to calculate.
D. It is always zero for linear models.

29 The 'Shattering' coefficient (Growth function) for a hypothesis class with finite VC dimension grows:

A. Exponentially with ()
B. Polynomially with ()
C. Logarithmically with
D. Constantly

30 Which optimization algorithm combines the properties of AdaGrad and Momentum (specifically using exponentially moving averages of squared gradients)?

A. SGD
B. Batch Gradient Descent
C. RMSprop
D. Adam

31 The approximation error in the Bias-Variance decomposition is associated with:

A. Variance
B. Bias
C. Irreducible Error
D. Noise

32 In the context of Regularization, (lambda) is a hyperparameter that controls:

A. The learning rate.
B. The strength of the penalty on the weights.
C. The number of epochs.
D. The size of the validation set.

33 Why is Early Stopping considered a regularization technique?

A. It adds a penalty term to the loss function.
B. It stops training when validation error starts to increase, preventing the model from learning noise.
C. It removes features from the dataset.
D. It increases the training data size.

34 Which of the following indicates that a model has High Variance?

A. Training error: 1%, Validation error: 15%
B. Training error: 15%, Validation error: 16%
C. Training error: 20%, Validation error: 20%
D. Training error: 0%, Validation error: 0%

35 In convergence analysis, if the objective function is Lipschitz continuous gradients, it implies:

A. The function is convex.
B. The rate of change of the gradient is bounded.
C. The function has no global minimum.
D. Gradient descent cannot be used.

36 The total expected error of a learning algorithm can be decomposed into:

A. Bias + Variance
B. Bias + Variance + Irreducible Error
C. Bias - Variance
D. Training Error + Test Error

37 Which of the following is true regarding Batch Gradient Descent vs Stochastic Gradient Descent (SGD)?

A. SGD is computationally more expensive per iteration than Batch.
B. Batch gradient descent always converges faster in terms of time.
C. SGD updates are noisier, helping to escape local minima.
D. Batch gradient descent uses a subset of data.

38 When using k-fold cross-validation for hyperparameter tuning, the final model is typically trained on:

A. One of the folds.
B. The validation sets only.
C. The entire training dataset using the best hyperparameters found.
D. The test set.

39 The Occam's Razor principle in machine learning supports:

A. Choosing the most complex model that fits the data.
B. Choosing the simplest model that explains the data well.
C. Ignoring training errors.
D. Using only linear models.

40 In the momentum update rule , what does represent?

A. Learning rate
B. Momentum coefficient (decay factor)
C. Regularization parameter
D. Gradient magnitude

41 Which complexity measure is derived from the maximum correlation between the function class and a set of random signs?

A. VC Dimension
B. Rademacher Complexity
C. Shattering Coefficient
D. L2 Norm

42 If the training set size is much smaller than the VC dimension (), then:

A. Overfitting is highly likely.
B. Underfitting is highly likely.
C. The model will generalize well.
D. Training error will be high.

43 Gradient Descent with Momentum helps specifically in scenarios where:

A. The surface curves much more steeply in one dimension than in another (ravines).
B. The learning rate is zero.
C. The function is perfectly spherical.
D. There is no gradient.

44 The No Free Lunch Theorem implies that:

A. One algorithm is superior to all others for all problems.
B. Averaged over all possible problems, no algorithm performs better than random guessing.
C. Regularization is always necessary.
D. Gradient descent is the best optimizer.

45 What is the convergence rate of Gradient Descent for a strongly convex function?

A. Linear (Geometric)
B. Logarithmic
C. Quadratic
D. Exponential

46 Lasso Regression (L1) can be interpreted as a Bayesian estimate with a specific prior distribution on the weights. Which distribution?

A. Gaussian (Normal) Prior
B. Laplace Prior
C. Uniform Prior
D. Bernoulli Prior

47 Ridge Regression (L2) corresponds to a Bayesian estimate with which prior?

A. Gaussian (Normal) Prior
B. Laplace Prior
C. Poisson Prior
D. Beta Prior

48 A model with Low Bias and Low Variance is:

A. Impossible to achieve.
B. The ideal goal of machine learning.
C. An underfitted model.
D. An overfitted model.

49 In Structural Risk Minimization, the bound on True Risk is given by . As the sample size , the complexity term typically:

A. Approaches Infinity
B. Approaches 0
C. Remains constant
D. Oscillates

50 Which gradient descent variant adapts the learning rate for each parameter individually based on the history of gradients?

A. Standard Gradient Descent
B. Momentum
C. Adagrad / RMSprop
D. Nesterov Momentum