Unit 6 - Practice Quiz

INT394 50 Questions
0 Correct 0 Wrong 50 Left
0/50

1 What is the VC Dimension (Vapnik-Chervonenkis dimension) of a hypothesis class ?

A. The total number of parameters in the model.
B. The minimum number of points required to train a model from .
C. The number of training errors made by the hypothesis.
D. The maximum number of points that can be shattered by .

2 If a hypothesis space has an infinite VC dimension, what does this imply about the learnability of the task?

A. The task is learnable with a small number of samples.
B. The training error will always be high.
C. The task is not PAC learnable.
D. The model will always underfit.

3 What is the VC dimension of a linear classifier (perceptron) in a -dimensional space ()?

A. Infinite
B.
C.
D.

4 Rademacher Complexity measures the ability of a hypothesis class to fit:

A. Training data with zero error
B. The true underlying distribution
C. Random noise
D. Linear functions only

5 In the Bias-Variance Decomposition, the Bias term corresponds to:

A. The error due to erroneous assumptions in the learning algorithm (e.g., assuming data is linear when it is quadratic).
B. The computational cost of the algorithm.
C. The error due to sensitivity to small fluctuations in the training set.
D. The inherent noise in the problem itself.

6 As model complexity increases, what generally happens to Bias and Variance?

A. Bias increases, Variance decreases
B. Bias decreases, Variance increases
C. Bias increases, Variance increases
D. Bias decreases, Variance decreases

7 Which scenario best describes Overfitting?

A. High Training Error, High Test Error
B. Low Training Error, High Test Error
C. Low Training Error, Low Test Error
D. High Training Error, Low Test Error

8 Which of the following is NOT a technique to prevent overfitting?

A. Early Stopping
B. Cross-validation
C. Regularization (L1/L2)
D. Increasing the number of features significantly without increasing data

9 Mathematically, L2 Regularization (Ridge Regression) adds which term to the loss function ?

A.
B.
C.
D.

10 What is the primary feature selection property of L1 Regularization (Lasso)?

A. It forces weights to be small but non-zero.
B. It smooths the decision boundary more than L2.
C. It drives some weights exactly to zero, inducing sparsity.
D. It increases the variance of the estimator.

11 Structural Risk Minimization (SRM) aims to minimize:

A. Training error only
B. Validation error minus Training error
C. Test error only
D. Empirical Risk + Complexity Penalty

12 In k-fold Cross-Validation, how many times is the model trained and tested?

A. 1 time
B. times
C. times
D. times

13 Leave-One-Out Cross-Validation (LOOCV) is a special case of k-fold cross-validation where equals:

A. 10
B. 1
C. (number of data points)
D. 5

14 In the context of Gradient Descent, what is the role of the Learning Rate ()?

A. It determines the number of iterations.
B. It determines the direction of the descent.
C. It determines the starting point of the parameters.
D. It determines the step size taken in the direction of the negative gradient.

15 The standard Gradient Descent update rule for a parameter is:

A.
B.
C.
D.

16 Which variant of Gradient Descent updates weights using only one training example at a time?

A. Batch Gradient Descent
B. Stochastic Gradient Descent (SGD)
C. Newton's Method
D. Mini-batch Gradient Descent

17 What is the primary advantage of Momentum in optimization algorithms?

A. It helps accelerate gradients vectors in the right directions, thus leading to faster converging.
B. It guarantees finding the global minimum in non-convex functions.
C. It eliminates the need for a learning rate.
D. It reduces the learning rate automatically.

18 In RMSprop, the learning rate is adapted by dividing by:

A. The L2 norm of the weights.
B. The square root of the exponential moving average of squared gradients.
C. The number of iterations.
D. The sum of past gradients.

19 For a convex loss function, Gradient Descent is guaranteed to converge to:

A. A saddle point
B. A local minimum
C. The global minimum
D. Any point on the boundary

20 What is Hyperparameter Tuning?

A. Selecting the optimal values for parameters like learning rate, regularization strength, or tree depth.
B. Updating weights during backpropagation.
C. Cleaning the data before training.
D. Selecting the best features for the model.

21 Which search strategy involves testing a fixed set of hyperparameters arranged in a lattice structure?

A. Bayesian Optimization
B. Grid Search
C. Gradient Search
D. Random Search

22 If a learning curve shows that both training and validation errors are high and close to each other, the model suffers from:

A. Data leakage
B. High Variance (Overfitting)
C. High Bias (Underfitting)
D. Ideally tuned parameters

23 What is the relationship between the number of training samples () and the generalization bound involving VC dimension ()?

A. Generalization error
B. Generalization error is independent of
C. Generalization error
D. Generalization error

24 In the context of optimization, what is a Saddle Point?

A. The point where training starts.
B. A point where the gradient is zero, but it is a minimum in one direction and a maximum in another.
C. A point where the gradient is infinite.
D. The lowest point in the loss landscape.

25 The Empirical Risk corresponds to:

A. The error on the validation set.
B. The average loss calculated over the training dataset.
C. The expected error on unseen data.
D. The maximum possible error of the classifier.

26 Which regularization technique randomly sets a fraction of input units to 0 at each update during training time?

A. Early Stopping
B. Data Augmentation
C. Dropout
D. L2 Regularization

27 What happens if the learning rate in Gradient Descent is set too high?

A. The algorithm will get stuck in a local minimum.
B. The algorithm may oscillate or diverge.
C. The model will overfit.
D. The algorithm converges very slowly.

28 Rademacher Complexity is often considered tighter (more accurate) than VC dimension bounds because:

A. It is always zero for linear models.
B. It is independent of the data distribution.
C. It is easier to calculate.
D. It depends on the specific data distribution and the training sample size.

29 The 'Shattering' coefficient (Growth function) for a hypothesis class with finite VC dimension grows:

A. Logarithmically with
B. Exponentially with ()
C. Constantly
D. Polynomially with ()

30 Which optimization algorithm combines the properties of AdaGrad and Momentum (specifically using exponentially moving averages of squared gradients)?

A. Batch Gradient Descent
B. Adam
C. RMSprop
D. SGD

31 The approximation error in the Bias-Variance decomposition is associated with:

A. Irreducible Error
B. Variance
C. Bias
D. Noise

32 In the context of Regularization, (lambda) is a hyperparameter that controls:

A. The strength of the penalty on the weights.
B. The number of epochs.
C. The size of the validation set.
D. The learning rate.

33 Why is Early Stopping considered a regularization technique?

A. It increases the training data size.
B. It adds a penalty term to the loss function.
C. It stops training when validation error starts to increase, preventing the model from learning noise.
D. It removes features from the dataset.

34 Which of the following indicates that a model has High Variance?

A. Training error: 20%, Validation error: 20%
B. Training error: 0%, Validation error: 0%
C. Training error: 1%, Validation error: 15%
D. Training error: 15%, Validation error: 16%

35 In convergence analysis, if the objective function is Lipschitz continuous gradients, it implies:

A. Gradient descent cannot be used.
B. The function is convex.
C. The function has no global minimum.
D. The rate of change of the gradient is bounded.

36 The total expected error of a learning algorithm can be decomposed into:

A. Training Error + Test Error
B. Bias - Variance
C. Bias + Variance + Irreducible Error
D. Bias + Variance

37 Which of the following is true regarding Batch Gradient Descent vs Stochastic Gradient Descent (SGD)?

A. SGD updates are noisier, helping to escape local minima.
B. Batch gradient descent always converges faster in terms of time.
C. Batch gradient descent uses a subset of data.
D. SGD is computationally more expensive per iteration than Batch.

38 When using k-fold cross-validation for hyperparameter tuning, the final model is typically trained on:

A. The entire training dataset using the best hyperparameters found.
B. The test set.
C. One of the folds.
D. The validation sets only.

39 The Occam's Razor principle in machine learning supports:

A. Choosing the most complex model that fits the data.
B. Ignoring training errors.
C. Using only linear models.
D. Choosing the simplest model that explains the data well.

40 In the momentum update rule , what does represent?

A. Momentum coefficient (decay factor)
B. Gradient magnitude
C. Learning rate
D. Regularization parameter

41 Which complexity measure is derived from the maximum correlation between the function class and a set of random signs?

A. Rademacher Complexity
B. L2 Norm
C. Shattering Coefficient
D. VC Dimension

42 If the training set size is much smaller than the VC dimension (), then:

A. Training error will be high.
B. Overfitting is highly likely.
C. Underfitting is highly likely.
D. The model will generalize well.

43 Gradient Descent with Momentum helps specifically in scenarios where:

A. The function is perfectly spherical.
B. The learning rate is zero.
C. The surface curves much more steeply in one dimension than in another (ravines).
D. There is no gradient.

44 The No Free Lunch Theorem implies that:

A. Gradient descent is the best optimizer.
B. Averaged over all possible problems, no algorithm performs better than random guessing.
C. Regularization is always necessary.
D. One algorithm is superior to all others for all problems.

45 What is the convergence rate of Gradient Descent for a strongly convex function?

A. Linear (Geometric)
B. Exponential
C. Quadratic
D. Logarithmic

46 Lasso Regression (L1) can be interpreted as a Bayesian estimate with a specific prior distribution on the weights. Which distribution?

A. Gaussian (Normal) Prior
B. Laplace Prior
C. Uniform Prior
D. Bernoulli Prior

47 Ridge Regression (L2) corresponds to a Bayesian estimate with which prior?

A. Beta Prior
B. Poisson Prior
C. Laplace Prior
D. Gaussian (Normal) Prior

48 A model with Low Bias and Low Variance is:

A. Impossible to achieve.
B. The ideal goal of machine learning.
C. An underfitted model.
D. An overfitted model.

49 In Structural Risk Minimization, the bound on True Risk is given by . As the sample size , the complexity term typically:

A. Oscillates
B. Approaches Infinity
C. Approaches 0
D. Remains constant

50 Which gradient descent variant adapts the learning rate for each parameter individually based on the history of gradients?

A. Adagrad / RMSprop
B. Nesterov Momentum
C. Standard Gradient Descent
D. Momentum