1What is the VC Dimension (Vapnik-Chervonenkis dimension) of a hypothesis class ?
A.The maximum number of points that can be shattered by .
B.The minimum number of points required to train a model from .
C.The total number of parameters in the model.
D.The number of training errors made by the hypothesis.
Correct Answer: The maximum number of points that can be shattered by .
Explanation: The VC dimension is a measure of the capacity of a hypothesis space. It is defined as the cardinality of the largest set of points that the algorithm can shatter (assign all possible binary labels to).
Incorrect! Try again.
2If a hypothesis space has an infinite VC dimension, what does this imply about the learnability of the task?
A.The task is learnable with a small number of samples.
B.The task is not PAC learnable.
C.The model will always underfit.
D.The training error will always be high.
Correct Answer: The task is not PAC learnable.
Explanation:According to the fundamental theorem of statistical learning, a hypothesis class is PAC (Probably Approximately Correct) learnable if and only if its VC dimension is finite.
Incorrect! Try again.
3What is the VC dimension of a linear classifier (perceptron) in a -dimensional space ()?
A.
B.
C.
D.Infinite
Correct Answer:
Explanation:A linear classifier in dimensions involves weights and 1 bias term. It can shatter at most points.
Incorrect! Try again.
4Rademacher Complexity measures the ability of a hypothesis class to fit:
A.Random noise
B.The true underlying distribution
C.Linear functions only
D.Training data with zero error
Correct Answer: Random noise
Explanation:Empirical Rademacher complexity measures how well a function class correlates with random noise (Rademacher variables ). High complexity means the model can fit random noise well, implying a risk of overfitting.
Incorrect! Try again.
5In the Bias-Variance Decomposition, the Bias term corresponds to:
A.The error due to sensitivity to small fluctuations in the training set.
B.The error due to erroneous assumptions in the learning algorithm (e.g., assuming data is linear when it is quadratic).
C.The inherent noise in the problem itself.
D.The computational cost of the algorithm.
Correct Answer: The error due to erroneous assumptions in the learning algorithm (e.g., assuming data is linear when it is quadratic).
Explanation:Bias represents the error introduced by approximating a real-world problem (which may be complex) by a much simpler model. High bias usually leads to underfitting.
Incorrect! Try again.
6As model complexity increases, what generally happens to Bias and Variance?
Explanation:A more complex model can fit the training data better (lower bias) but becomes more sensitive to specific samples in the training data (higher variance), leading to the bias-variance trade-off.
Incorrect! Try again.
7Which scenario best describes Overfitting?
A.High Training Error, High Test Error
B.Low Training Error, Low Test Error
C.Low Training Error, High Test Error
D.High Training Error, Low Test Error
Correct Answer: Low Training Error, High Test Error
Explanation:Overfitting occurs when the model learns the noise and details of the training data to the extent that it negatively impacts the performance of the model on new data.
Incorrect! Try again.
8Which of the following is NOT a technique to prevent overfitting?
A.Regularization (L1/L2)
B.Cross-validation
C.Early Stopping
D.Increasing the number of features significantly without increasing data
Correct Answer: Increasing the number of features significantly without increasing data
Explanation:Increasing features (dimensions) without more data typically increases model complexity and exacerbates the curse of dimensionality, leading to more overfitting.
Incorrect! Try again.
9Mathematically, L2 Regularization (Ridge Regression) adds which term to the loss function ?
A.
B.
C.
D.
Correct Answer:
Explanation:L2 regularization penalizes the sum of the squares of the weights. The term added is proportional to the squared Euclidean norm of the weight vector.
Incorrect! Try again.
10What is the primary feature selection property of L1 Regularization (Lasso)?
A.It forces weights to be small but non-zero.
B.It drives some weights exactly to zero, inducing sparsity.
C.It smooths the decision boundary more than L2.
D.It increases the variance of the estimator.
Correct Answer: It drives some weights exactly to zero, inducing sparsity.
Explanation:Due to the geometry of the L1 norm (diamond shape), the optimal solution often lies on the axes where some parameters are exactly zero, effectively performing feature selection.
Incorrect! Try again.
11Structural Risk Minimization (SRM) aims to minimize:
Explanation:SRM balances the empirical risk (training error) and a confidence interval term related to the VC dimension (complexity penalty) to bound the true risk.
Incorrect! Try again.
12In k-fold Cross-Validation, how many times is the model trained and tested?
A.1 time
B. times
C. times
D. times
Correct Answer: times
Explanation:In k-fold cross-validation, the data is divided into subsets. The model is trained times, each time using subsets for training and the remaining one for testing.
Incorrect! Try again.
13Leave-One-Out Cross-Validation (LOOCV) is a special case of k-fold cross-validation where equals:
A.1
B.5
C.10
D. (number of data points)
Correct Answer: (number of data points)
Explanation:In LOOCV, every single data point is used as a test set once, while the remaining points form the training set. Thus, .
Incorrect! Try again.
14In the context of Gradient Descent, what is the role of the Learning Rate ()?
A.It determines the direction of the descent.
B.It determines the step size taken in the direction of the negative gradient.
C.It determines the number of iterations.
D.It determines the starting point of the parameters.
Correct Answer: It determines the step size taken in the direction of the negative gradient.
Explanation:The learning rate controls how large of a step the algorithm takes when updating parameters. If too small, convergence is slow; if too large, it may diverge.
Incorrect! Try again.
15The standard Gradient Descent update rule for a parameter is:
A.
B.
C.
D.
Correct Answer:
Explanation:To minimize the loss function , we move in the direction opposite to the gradient ().
Incorrect! Try again.
16Which variant of Gradient Descent updates weights using only one training example at a time?
A.Batch Gradient Descent
B.Mini-batch Gradient Descent
C.Stochastic Gradient Descent (SGD)
D.Newton's Method
Correct Answer: Stochastic Gradient Descent (SGD)
Explanation:SGD computes the gradient and updates parameters using a single training example per iteration, making it faster but noisier.
Incorrect! Try again.
17What is the primary advantage of Momentum in optimization algorithms?
A.It guarantees finding the global minimum in non-convex functions.
B.It reduces the learning rate automatically.
C.It helps accelerate gradients vectors in the right directions, thus leading to faster converging.
D.It eliminates the need for a learning rate.
Correct Answer: It helps accelerate gradients vectors in the right directions, thus leading to faster converging.
Explanation:Momentum accumulates a moving average of past gradients to smooth out updates and dampen oscillations, helping to pass through local minima or flat regions faster.
Incorrect! Try again.
18In RMSprop, the learning rate is adapted by dividing by:
A.The number of iterations.
B.The sum of past gradients.
C.The square root of the exponential moving average of squared gradients.
D.The L2 norm of the weights.
Correct Answer: The square root of the exponential moving average of squared gradients.
Explanation:RMSprop normalizes the gradient by the magnitude of recent gradients (root mean square) so that parameters with large gradients have reduced effective learning rates, and vice versa.
Incorrect! Try again.
19For a convex loss function, Gradient Descent is guaranteed to converge to:
A.A local minimum
B.The global minimum
C.A saddle point
D.Any point on the boundary
Correct Answer: The global minimum
Explanation:Convex functions have only one minimum, which is the global minimum. Standard gradient descent (with an appropriate learning rate) will converge to it.
Incorrect! Try again.
20What is Hyperparameter Tuning?
A.Updating weights during backpropagation.
B.Selecting the optimal values for parameters like learning rate, regularization strength, or tree depth.
C.Cleaning the data before training.
D.Selecting the best features for the model.
Correct Answer: Selecting the optimal values for parameters like learning rate, regularization strength, or tree depth.
Explanation:Hyperparameters are configuration variables external to the model whose value cannot be estimated from data directly (unlike weights) and must be tuned to optimize performance.
Incorrect! Try again.
21Which search strategy involves testing a fixed set of hyperparameters arranged in a lattice structure?
A.Random Search
B.Grid Search
C.Bayesian Optimization
D.Gradient Search
Correct Answer: Grid Search
Explanation:Grid search exhaustively searches through a manually specified subset of the hyperparameter space.
Incorrect! Try again.
22If a learning curve shows that both training and validation errors are high and close to each other, the model suffers from:
A.High Variance (Overfitting)
B.High Bias (Underfitting)
C.Ideally tuned parameters
D.Data leakage
Correct Answer: High Bias (Underfitting)
Explanation:When both errors are high, the model is not complex enough to capture the underlying pattern in the data, indicating underfitting (high bias).
Incorrect! Try again.
23What is the relationship between the number of training samples () and the generalization bound involving VC dimension ()?
A.Generalization error
B.Generalization error
C.Generalization error
D.Generalization error is independent of
Correct Answer: Generalization error
Explanation:The bound on the difference between test error and training error typically scales with the square root of the VC dimension divided by the number of samples.
Incorrect! Try again.
24In the context of optimization, what is a Saddle Point?
A.A point where the gradient is zero, but it is a minimum in one direction and a maximum in another.
B.The lowest point in the loss landscape.
C.A point where the gradient is infinite.
D.The point where training starts.
Correct Answer: A point where the gradient is zero, but it is a minimum in one direction and a maximum in another.
Explanation:Saddle points are critical points (zero gradient) that are not local extrema. They pose challenges for optimization, especially in high dimensions.
Incorrect! Try again.
25The Empirical Risk corresponds to:
A.The expected error on unseen data.
B.The average loss calculated over the training dataset.
C.The maximum possible error of the classifier.
D.The error on the validation set.
Correct Answer: The average loss calculated over the training dataset.
Explanation:Empirical risk is the error measured strictly on the provided training samples, defined as .
Incorrect! Try again.
26Which regularization technique randomly sets a fraction of input units to 0 at each update during training time?
A.L2 Regularization
B.Early Stopping
C.Dropout
D.Data Augmentation
Correct Answer: Dropout
Explanation:Dropout is a regularization technique mainly used in neural networks where neurons are randomly 'dropped' (ignored) during training to prevent co-adaptation and overfitting.
Incorrect! Try again.
27What happens if the learning rate in Gradient Descent is set too high?
A.The algorithm converges very slowly.
B.The algorithm may oscillate or diverge.
C.The algorithm will get stuck in a local minimum.
D.The model will overfit.
Correct Answer: The algorithm may oscillate or diverge.
Explanation:A large learning rate causes the updates to overshoot the minimum, potentially leading to increasing error (divergence) or oscillation around the minimum.
Incorrect! Try again.
28Rademacher Complexity is often considered tighter (more accurate) than VC dimension bounds because:
A.It is independent of the data distribution.
B.It depends on the specific data distribution and the training sample size.
C.It is easier to calculate.
D.It is always zero for linear models.
Correct Answer: It depends on the specific data distribution and the training sample size.
Explanation:Unlike VC dimension which is a distribution-free worst-case measure, Rademacher complexity is data-dependent, offering a sharper bound for the specific data at hand.
Incorrect! Try again.
29The 'Shattering' coefficient (Growth function) for a hypothesis class with finite VC dimension grows:
A.Exponentially with ()
B.Polynomially with ()
C.Logarithmically with
D.Constantly
Correct Answer: Polynomially with ()
Explanation:By Sauer's Lemma, if the VC dimension is , the growth function is bounded polynomially by for .
Incorrect! Try again.
30Which optimization algorithm combines the properties of AdaGrad and Momentum (specifically using exponentially moving averages of squared gradients)?
A.SGD
B.Batch Gradient Descent
C.RMSprop
D.Adam
Correct Answer: RMSprop
Explanation:While Adam also does this (adding momentum to the update), RMSprop is specifically known for introducing the moving average of squared gradients to adapt learning rates. (Note: Adam combines RMSprop and Momentum ideas).
Incorrect! Try again.
31The approximation error in the Bias-Variance decomposition is associated with:
A.Variance
B.Bias
C.Irreducible Error
D.Noise
Correct Answer: Bias
Explanation:Approximation error refers to the inability of the hypothesis class to represent the true function perfectly. This is the definition of Bias.
Incorrect! Try again.
32In the context of Regularization, (lambda) is a hyperparameter that controls:
A.The learning rate.
B.The strength of the penalty on the weights.
C.The number of epochs.
D.The size of the validation set.
Correct Answer: The strength of the penalty on the weights.
Explanation:A higher imposes a stronger penalty, forcing weights to be smaller (more regularization), while implies no regularization.
Incorrect! Try again.
33Why is Early Stopping considered a regularization technique?
A.It adds a penalty term to the loss function.
B.It stops training when validation error starts to increase, preventing the model from learning noise.
C.It removes features from the dataset.
D.It increases the training data size.
Correct Answer: It stops training when validation error starts to increase, preventing the model from learning noise.
Explanation:By stopping the training process before the training error reaches its minimum but after the validation error begins to rise, it effectively restricts the optimization complexity, similar to regularization.
Incorrect! Try again.
34Which of the following indicates that a model has High Variance?
A.Training error: 1%, Validation error: 15%
B.Training error: 15%, Validation error: 16%
C.Training error: 20%, Validation error: 20%
D.Training error: 0%, Validation error: 0%
Correct Answer: Training error: 1%, Validation error: 15%
Explanation:A large gap between training error (low) and validation error (high) indicates that the model generalizes poorly, which is a symptom of high variance (overfitting).
Incorrect! Try again.
35In convergence analysis, if the objective function is Lipschitz continuous gradients, it implies:
A.The function is convex.
B.The rate of change of the gradient is bounded.
C.The function has no global minimum.
D.Gradient descent cannot be used.
Correct Answer: The rate of change of the gradient is bounded.
Explanation:Lipschitz continuity of gradients means the gradient does not change arbitrarily fast, which ensures that taking small steps will result in a predictable decrease in the loss function.
Incorrect! Try again.
36The total expected error of a learning algorithm can be decomposed into:
Explanation:The standard decomposition is Bias + Variance + Noise (Irreducible Error).
Incorrect! Try again.
37Which of the following is true regarding Batch Gradient Descent vs Stochastic Gradient Descent (SGD)?
A.SGD is computationally more expensive per iteration than Batch.
B.Batch gradient descent always converges faster in terms of time.
C.SGD updates are noisier, helping to escape local minima.
D.Batch gradient descent uses a subset of data.
Correct Answer: SGD updates are noisier, helping to escape local minima.
Explanation:Because SGD calculates gradients on single samples, the path to the minimum oscillates. This noise can be beneficial for escaping shallow local minima.
Incorrect! Try again.
38When using k-fold cross-validation for hyperparameter tuning, the final model is typically trained on:
A.One of the folds.
B.The validation sets only.
C.The entire training dataset using the best hyperparameters found.
D.The test set.
Correct Answer: The entire training dataset using the best hyperparameters found.
Explanation:Cross-validation is used to estimate performance and select parameters. Once selected, the final model should be retrained on all available training data to maximize learning.
Incorrect! Try again.
39The Occam's Razor principle in machine learning supports:
A.Choosing the most complex model that fits the data.
B.Choosing the simplest model that explains the data well.
C.Ignoring training errors.
D.Using only linear models.
Correct Answer: Choosing the simplest model that explains the data well.
Explanation:Occam's Razor suggests that among competing hypotheses that predict equally well, the one with the fewest assumptions (simplest) should be selected (Structural Risk Minimization embodies this).
Incorrect! Try again.
40In the momentum update rule , what does represent?
Explanation: (usually around 0.9) determines how much of the previous velocity is retained. It acts as a friction/decay term.
Incorrect! Try again.
41Which complexity measure is derived from the maximum correlation between the function class and a set of random signs?
A.VC Dimension
B.Rademacher Complexity
C.Shattering Coefficient
D.L2 Norm
Correct Answer: Rademacher Complexity
Explanation:This is the definition of Empirical Rademacher Complexity: the expectation of the maximum correlation between the model predictions and random signs (Rademacher variables).
Incorrect! Try again.
42If the training set size is much smaller than the VC dimension (), then:
A.Overfitting is highly likely.
B.Underfitting is highly likely.
C.The model will generalize well.
D.Training error will be high.
Correct Answer: Overfitting is highly likely.
Explanation:If the model capacity (VC dimension) significantly exceeds the number of data points, the model can memorize the data, leading to zero training error but poor generalization (overfitting).
Incorrect! Try again.
43Gradient Descent with Momentum helps specifically in scenarios where:
A.The surface curves much more steeply in one dimension than in another (ravines).
B.The learning rate is zero.
C.The function is perfectly spherical.
D.There is no gradient.
Correct Answer: The surface curves much more steeply in one dimension than in another (ravines).
Explanation:In ravines, standard SGD oscillates across the narrow slopes. Momentum dampens these oscillations and accelerates along the shallow direction.
Incorrect! Try again.
44The No Free Lunch Theorem implies that:
A.One algorithm is superior to all others for all problems.
B.Averaged over all possible problems, no algorithm performs better than random guessing.
C.Regularization is always necessary.
D.Gradient descent is the best optimizer.
Correct Answer: Averaged over all possible problems, no algorithm performs better than random guessing.
Explanation:It states that there is no universally best learning algorithm; an algorithm's performance depends on the specific problem structure and priors.
Incorrect! Try again.
45What is the convergence rate of Gradient Descent for a strongly convex function?
A.Linear (Geometric)
B.Logarithmic
C.Quadratic
D.Exponential
Correct Answer: Linear (Geometric)
Explanation:For strongly convex functions with Lipschitz gradients, Gradient Descent converges linearly (the error decreases by a constant factor at each step).
Incorrect! Try again.
46Lasso Regression (L1) can be interpreted as a Bayesian estimate with a specific prior distribution on the weights. Which distribution?
A.Gaussian (Normal) Prior
B.Laplace Prior
C.Uniform Prior
D.Bernoulli Prior
Correct Answer: Laplace Prior
Explanation:L1 regularization is equivalent to Maximum A Posteriori (MAP) estimation with a Laplace prior (which is peaked at zero), promoting sparsity.
Incorrect! Try again.
47Ridge Regression (L2) corresponds to a Bayesian estimate with which prior?
A.Gaussian (Normal) Prior
B.Laplace Prior
C.Poisson Prior
D.Beta Prior
Correct Answer: Gaussian (Normal) Prior
Explanation:L2 regularization is equivalent to MAP estimation with a Gaussian prior on the weights.
Incorrect! Try again.
48A model with Low Bias and Low Variance is:
A.Impossible to achieve.
B.The ideal goal of machine learning.
C.An underfitted model.
D.An overfitted model.
Correct Answer: The ideal goal of machine learning.
Explanation:This represents a model that captures the underlying trend accurately (low bias) and is robust to noise (low variance), minimizing the total error.
Incorrect! Try again.
49In Structural Risk Minimization, the bound on True Risk is given by . As the sample size , the complexity term typically:
A.Approaches Infinity
B.Approaches 0
C.Remains constant
D.Oscillates
Correct Answer: Approaches 0
Explanation:As the amount of data increases, the confidence interval (complexity penalty) shrinks, and the empirical risk becomes a better approximation of the true risk.
Incorrect! Try again.
50Which gradient descent variant adapts the learning rate for each parameter individually based on the history of gradients?
A.Standard Gradient Descent
B.Momentum
C.Adagrad / RMSprop
D.Nesterov Momentum
Correct Answer: Adagrad / RMSprop
Explanation:Algorithms like Adagrad, RMSprop, and Adam are 'Adaptive' methods that maintain per-parameter learning rates.
Incorrect! Try again.
Give Feedback
Help us improve by sharing your thoughts or reporting issues.