1 $What is the primary motivation behind using Ensemble Methods in machine learning?$

A.

To increase the computational speed of training models

B.

To reduce the number of features in the dataset

C.

To combine multiple weak models to improve overall performance and generalization

D.

To eliminate the need for hyperparameter tuning

2 $Which of the following statements best explains how Ensemble methods reduce error according to the bias-variance decomposition ?$

A.

They always reduce bias without affecting variance.

B.

They always reduce variance without affecting bias.

C.

Bagging primarily reduces variance, while Boosting primarily reduces bias.

D.

Bagging primarily reduces bias, while Boosting primarily reduces variance.

3 $In the context of ensemble learning, what is a Weak Learner ?$

A.

A model that performs slightly better than random guessing

B.

A model that has 0% training error

C.

A model that has too many parameters

D.

A model that overfits the data significantly

4 $What does Bagging stand for?$

A.

Basic Aggregating

B.

Bootstrap Aggregating

C.

Binary Aggregating

D.

Backward Aggregating

5 $Which statistical technique involves sampling data subsets with replacement ?$

A.

Jackknife

B.

Bootstrapping

C.

Cross-Validation

D.

Stratification

6 $In a Random Forest, which two randomization techniques are combined?$

A.

Boosting and Bagging

B.

Bootstrap sampling and random feature selection

C.

Grid Search and Random Search

D.

L1 and L2 regularization

7 $If you are training a Bagging ensemble with samples, approximately what fraction of samples are left out of a single bootstrap sample (Out-Of-Bag)?$

A.

B.

C.

D.

8 $Which of the following is true regarding the parallelization of Bagging and Boosting?$

A.

Both Bagging and Boosting can be easily parallelized.

B.

Neither can be parallelized.

C.

Bagging is easy to parallelize, whereas Boosting is inherently sequential.

D.

Boosting is easy to parallelize, whereas Bagging is inherently sequential.

9 $In AdaBoost, how are the weights of training instances updated after each iteration?$

A.

Weights are kept constant throughout training.

B.

Misclassified instances are given higher weights.

C.

Correctly classified instances are given higher weights.

D.

Weights are assigned randomly.

10 $What is the main difference between AdaBoost and Gradient Boosting ?$

A.

AdaBoost minimizes the loss function using gradient descent, while Gradient Boosting uses weighted voting.

B.

AdaBoost changes sample weights, while Gradient Boosting fits the new predictor to the residual errors of the previous predictor.

C.

AdaBoost cannot be used for regression, while Gradient Boosting can.

D.

There is no difference; they are synonyms.

11 $In the context of Stacking, what is a Meta-Learner ?$

A.

The first layer of base models

B.

A model that learns how to combine the predictions of the base models

C.

A model used for hyperparameter tuning

D.

A specific type of Deep Neural Network

12 $Which ensemble method is mathematically represented by, where is the learning rate?$

A.

Random Forest

B.

Gradient Boosting

C.

Hard Voting

D.

Stacking

13 $What is the primary risk when using Boosting with a large number of iterations (trees)?$

A.

Underfitting

B.

Overfitting

C.

High bias

D.

Vanishing gradients

14 $What is Hard Voting in ensemble classifiers?$

A.

Averaging the probabilities of all classifiers

B.

Taking the majority class prediction as the final output

C.

Weighting votes based on classifier confidence

D.

Using a meta-model to decide the vote

15 $What is Soft Voting ?$

A.

Predicting the class with the highest summed predicted probability across classifiers

B.

Predicting the class with the most votes

C.

Randomly selecting a classifier's output

D.

Using a soft-margin SVM as the ensemble

16 $Why is diversity important in an ensemble?$

A.

It ensures all models are identical.

B.

It allows models to make independent errors, which cancel out when aggregated.

C.

It increases the bias of the ensemble.

D.

It simplifies the hyperparameter tuning process.

17 $Which of the following is NOT a Hyperparameter?$

A.

The depth of a decision tree

B.

The number of neighbors () in KNN

C.

The weights learned by a linear regression model

D.

The learning rate in Gradient Descent

18 $What is the primary purpose of Hyperparameter Tuning ?$

A.

To train the model parameters like weights and biases

B.

To select the optimal configuration for the learning algorithm to maximize performance

C.

To clean the dataset

D.

To visualize the results

19 $How does Grid Search work?$

A.

It randomly samples hyperparameters from a distribution.

B.

It exhaustively tries every combination of a specified list of values for hyperparameters.

C.

It uses gradient descent to find optimal hyperparameters.

D.

It manually asks the user to input values during training.

20 $What is the major drawback of Grid Search ?$

A.

It does not find the optimal parameters.

B.

It suffers from the Curse of Dimensionality (computationally expensive with many parameters).

C.

It is difficult to implement.

D.

It only works for Decision Trees.

21 $How does Random Search differ from Grid Search?$

A.

It checks more combinations than Grid Search.

B.

It samples a fixed number of parameter settings from specified distributions.

C.

It is always slower than Grid Search.

D.

It guarantees finding the global optimum.

22 $According to Bergstra and Bengio, why is Random Search often more efficient than Grid Search?$

A.

Because random numbers are faster to generate.

B.

Because usually only a few hyperparameters are actually important for model performance.

C.

Because Grid Search introduces bias.

D.

Because Random Search uses deep learning.

23 $What happens if we tune hyperparameters on the Test Set ?$

A.

The model will generalize better.

B.

Information leakage occurs, leading to an optimistic bias in performance estimation.

C.

The training time decreases.

D.

Nothing; this is standard practice.

24 $Which technique is commonly used alongside Grid Search to evaluate the performance of each parameter combination?$

A.

Standardization

B.

K-Fold Cross-Validation

C.

Principal Component Analysis

D.

Clustering

25 $In a Bagging classifier, if the base models are unstable (e.g., fully grown Decision Trees), what is the expected outcome?$

A.

The ensemble will perform worse than a single model.

B.

The ensemble will significantly reduce variance and improve accuracy.

C.

The ensemble will increase bias significantly.

D.

Bagging cannot be used with unstable models.

26 $If you perform a Grid Search with: Parameter A = [1, 2, 3], Parameter B = [10, 20], and 5-fold Cross-Validation, how many total training runs are executed?$

A.

B.

C.

D.

27 $What is Stacking usually vulnerable to if not implemented correctly with cross-validation?$

A.

Underfitting

B.

Data leakage / Overfitting on the training data

C.

High bias

D.

Convergence failure

28 $In Gradient Boosting, what is the role of the Learning Rate (shrinkage)?$

A.

It controls the size of the tree.

B.

It scales the contribution of each tree; lower values require more trees but improve generalization.

C.

It determines the number of features to select.

D.

It sets the random seed.

29 $Which of the following ensemble methods uses Decision Stumps as the default base estimator?$

A.

Random Forest

B.

Bagging

C.

AdaBoost

D.

Stacking

30 $What is the key difference between Stacking and Blending ?$

A.

Stacking uses regression; Blending uses classification.

B.

Stacking typically uses cross-validated predictions for the meta-learner; Blending uses a hold-out validation set.

C.

Blending is an older name for Bagging.

D.

Stacking is parallel; Blending is sequential.

31 $When performing hyperparameter tuning for a Decision Tree, which parameter typically controls overfitting?$

A.

Max Depth

B.

Criterion (Gini/Entropy)

C.

Random State

D.

Splitter (Best/Random)

32 $Which theoretical theorem states that if individual classifiers are independent and better than random guessing, the ensemble accuracy approaches 1 as the number of classifiers increases?$

A.

Bayes Theorem

B.

Condorcet's Jury Theorem

C.

Central Limit Theorem

D.

No Free Lunch Theorem

33 $In Random Forest, increasing the number of trees () typically:$

A.

Causes overfitting.

B.

Decreases the variance up to a point without significantly increasing overfitting.

C.

Increases the bias significantly.

D.

Makes the model faster to train.

34 $Which method is best suited if you have high-variance models (e.g., unpruned decision trees)?$

A.

Boosting

B.

Bagging

C.

Linear Regression

D.

Logistic Regression

35 $Which method is best suited if you have high-bias models (e.g., shallow trees)?$

A.

Boosting

B.

Bagging

C.

Naive Bayes

D.

Clustering

36 $What is the OOB (Out-Of-Bag) Error used for?$

A.

To calculate the gradient in boosting.

B.

To estimate the generalization error of a Bagging ensemble without needing a separate validation set.

C.

To select features in Grid Search.

D.

To stop the training early.

37 $In the context of hyperparameter tuning, what is a continuous hyperparameter?$

A.

Number of trees

B.

Depth of a tree

C.

Learning rate ()

D.

Number of features

38 $Why might one choose XGBoost over standard Gradient Boosting?$

A.

XGBoost is slower.

B.

XGBoost includes regularization (L1/L2) and is optimized for speed/scalability.

C.

XGBoost does not support regression.

D.

XGBoost is a bagging technique.

39 $What is the Base Estimator in a heterogeneous Stacking ensemble?$

A.

It must be a Decision Tree.

B.

It can be any supervised learning algorithm (SVM, KNN, Tree, etc.).

C.

It must be the same algorithm with different hyperparameters.

D.

It must be a Neural Network.

40 $Which search strategy uses probability to choose the next set of hyperparameters based on past results (e.g., using Gaussian Processes)?$

A.

Grid Search

B.

Random Search

C.

Bayesian Optimization

D.

Exhaustive Search

41 $In Grid Search, if the optimal value lies between two grid points, the method will:$

A.

Automatically interpolate to find it.

B.

Fail.

C.

Select the closest defined grid point.

D.

Switch to Random Search.

42 $Which of the following is an advantage of Ensemble Methods ?$

A.

Interpretability (easy to explain distinct rules).

B.

Compactness (small model size).

C.

Robustness and Stability.

D.

Low training time.

43 $In a Voting Classifier, what requirement must be met to use Soft Voting ?$

A.

The base classifiers must support the predict_proba method.

B.

The base classifiers must be Decision Trees.

C.

The data must be linearly separable.

D.

There must be an odd number of classifiers.

44 $When defining a parameter grid for SVM, which parameters are commonly tuned?$

A.

and

B.

and (Gamma)

C.

and distance metric

D.

Learning rate and momentum

45 $What is the concept of Feature Subsampling in Gradient Boosting?$

A.

Removing features that are not important.

B.

Using only a random fraction of features at each split or tree construction to reduce variance.

C.

Using PCA before training.

D.

Manually selecting features.

46 $A Random Forest with features total. For classification, what is the recommended number of features to search at each split?$

A.

B.

C.

D.

47 $Why is Accuracy sometimes a poor metric to optimize during hyperparameter tuning?$

A.

It is computationally expensive to calculate.

B.

It is not differentiable.

C.

In imbalanced datasets, it can be misleading (e.g., predicting the majority class exclusively).

D.

Grid search does not support accuracy.

48 $In Stacking, the Level-0 models are:$

A.

The meta-learners.

B.

The base models trained on the original dataset.

C.

The models used for feature selection.

D.

The final output layer.

49 $Which component of the error does Random Forest specifically aim to keep low compared to a single Decision Tree?$

A.

Bias

B.

Variance

C.

Noise

D.

Computation time

50 $When using Random Search, if you increase the number of iterations:$

A.

The probability of finding the optimal parameters decreases.

B.

The computational cost decreases.

C.

The probability of finding a near-optimal combination increases.

D.

The search space shrinks.

Unit 3 - Practice Quiz