Unit 3 - Practice Quiz

INT395 50 Questions
0 Correct 0 Wrong 50 Left
0/50

1 What is the primary motivation behind using Ensemble Methods in machine learning?

A. To combine multiple weak models to improve overall performance and generalization
B. To eliminate the need for hyperparameter tuning
C. To increase the computational speed of training models
D. To reduce the number of features in the dataset

2 Which of the following statements best explains how Ensemble methods reduce error according to the bias-variance decomposition?

A. They always reduce variance without affecting bias.
B. Bagging primarily reduces bias, while Boosting primarily reduces variance.
C. They always reduce bias without affecting variance.
D. Bagging primarily reduces variance, while Boosting primarily reduces bias.

3 In the context of ensemble learning, what is a Weak Learner?

A. A model that has 0% training error
B. A model that overfits the data significantly
C. A model that has too many parameters
D. A model that performs slightly better than random guessing

4 What does Bagging stand for?

A. Basic Aggregating
B. Backward Aggregating
C. Binary Aggregating
D. Bootstrap Aggregating

5 Which statistical technique involves sampling data subsets with replacement?

A. Bootstrapping
B. Cross-Validation
C. Stratification
D. Jackknife

6 In a Random Forest, which two randomization techniques are combined?

A. Boosting and Bagging
B. Grid Search and Random Search
C. L1 and L2 regularization
D. Bootstrap sampling and random feature selection

7 If you are training a Bagging ensemble with samples, approximately what fraction of samples are left out of a single bootstrap sample (Out-Of-Bag)?

A.
B.
C.
D.

8 Which of the following is true regarding the parallelization of Bagging and Boosting?

A. Bagging is easy to parallelize, whereas Boosting is inherently sequential.
B. Neither can be parallelized.
C. Both Bagging and Boosting can be easily parallelized.
D. Boosting is easy to parallelize, whereas Bagging is inherently sequential.

9 In AdaBoost, how are the weights of training instances updated after each iteration?

A. Weights are kept constant throughout training.
B. Misclassified instances are given higher weights.
C. Weights are assigned randomly.
D. Correctly classified instances are given higher weights.

10 What is the main difference between AdaBoost and Gradient Boosting?

A. AdaBoost minimizes the loss function using gradient descent, while Gradient Boosting uses weighted voting.
B. AdaBoost cannot be used for regression, while Gradient Boosting can.
C. There is no difference; they are synonyms.
D. AdaBoost changes sample weights, while Gradient Boosting fits the new predictor to the residual errors of the previous predictor.

11 In the context of Stacking, what is a Meta-Learner?

A. A model that learns how to combine the predictions of the base models
B. A specific type of Deep Neural Network
C. The first layer of base models
D. A model used for hyperparameter tuning

12 Which ensemble method is mathematically represented by , where is the learning rate?

A. Random Forest
B. Hard Voting
C. Gradient Boosting
D. Stacking

13 What is the primary risk when using Boosting with a large number of iterations (trees)?

A. Vanishing gradients
B. Overfitting
C. Underfitting
D. High bias

14 What is Hard Voting in ensemble classifiers?

A. Using a meta-model to decide the vote
B. Weighting votes based on classifier confidence
C. Averaging the probabilities of all classifiers
D. Taking the majority class prediction as the final output

15 What is Soft Voting?

A. Predicting the class with the most votes
B. Predicting the class with the highest summed predicted probability across classifiers
C. Randomly selecting a classifier's output
D. Using a soft-margin SVM as the ensemble

16 Why is diversity important in an ensemble?

A. It increases the bias of the ensemble.
B. It ensures all models are identical.
C. It simplifies the hyperparameter tuning process.
D. It allows models to make independent errors, which cancel out when aggregated.

17 Which of the following is NOT a Hyperparameter?

A. The weights learned by a linear regression model
B. The number of neighbors () in KNN
C. The learning rate in Gradient Descent
D. The depth of a decision tree

18 What is the primary purpose of Hyperparameter Tuning?

A. To clean the dataset
B. To visualize the results
C. To select the optimal configuration for the learning algorithm to maximize performance
D. To train the model parameters like weights and biases

19 How does Grid Search work?

A. It uses gradient descent to find optimal hyperparameters.
B. It exhaustively tries every combination of a specified list of values for hyperparameters.
C. It randomly samples hyperparameters from a distribution.
D. It manually asks the user to input values during training.

20 What is the major drawback of Grid Search?

A. It does not find the optimal parameters.
B. It is difficult to implement.
C. It suffers from the Curse of Dimensionality (computationally expensive with many parameters).
D. It only works for Decision Trees.

21 How does Random Search differ from Grid Search?

A. It guarantees finding the global optimum.
B. It samples a fixed number of parameter settings from specified distributions.
C. It is always slower than Grid Search.
D. It checks more combinations than Grid Search.

22 According to Bergstra and Bengio, why is Random Search often more efficient than Grid Search?

A. Because Random Search uses deep learning.
B. Because Grid Search introduces bias.
C. Because usually only a few hyperparameters are actually important for model performance.
D. Because random numbers are faster to generate.

23 What happens if we tune hyperparameters on the Test Set?

A. The model will generalize better.
B. Information leakage occurs, leading to an optimistic bias in performance estimation.
C. The training time decreases.
D. Nothing; this is standard practice.

24 Which technique is commonly used alongside Grid Search to evaluate the performance of each parameter combination?

A. K-Fold Cross-Validation
B. Clustering
C. Standardization
D. Principal Component Analysis

25 In a Bagging classifier, if the base models are unstable (e.g., fully grown Decision Trees), what is the expected outcome?

A. Bagging cannot be used with unstable models.
B. The ensemble will significantly reduce variance and improve accuracy.
C. The ensemble will perform worse than a single model.
D. The ensemble will increase bias significantly.

26 If you perform a Grid Search with: Parameter A = [1, 2, 3], Parameter B = [10, 20], and 5-fold Cross-Validation, how many total training runs are executed?

A.
B.
C.
D.

27 What is Stacking usually vulnerable to if not implemented correctly with cross-validation?

A. Underfitting
B. High bias
C. Data leakage / Overfitting on the training data
D. Convergence failure

28 In Gradient Boosting, what is the role of the Learning Rate (shrinkage)?

A. It sets the random seed.
B. It determines the number of features to select.
C. It controls the size of the tree.
D. It scales the contribution of each tree; lower values require more trees but improve generalization.

29 Which of the following ensemble methods uses Decision Stumps as the default base estimator?

A. Bagging
B. Stacking
C. AdaBoost
D. Random Forest

30 What is the key difference between Stacking and Blending?

A. Stacking typically uses cross-validated predictions for the meta-learner; Blending uses a hold-out validation set.
B. Stacking uses regression; Blending uses classification.
C. Stacking is parallel; Blending is sequential.
D. Blending is an older name for Bagging.

31 When performing hyperparameter tuning for a Decision Tree, which parameter typically controls overfitting?

A. Criterion (Gini/Entropy)
B. Random State
C. Max Depth
D. Splitter (Best/Random)

32 Which theoretical theorem states that if individual classifiers are independent and better than random guessing, the ensemble accuracy approaches 1 as the number of classifiers increases?

A. Central Limit Theorem
B. Bayes Theorem
C. No Free Lunch Theorem
D. Condorcet's Jury Theorem

33 In Random Forest, increasing the number of trees () typically:

A. Decreases the variance up to a point without significantly increasing overfitting.
B. Increases the bias significantly.
C. Causes overfitting.
D. Makes the model faster to train.

34 Which method is best suited if you have high-variance models (e.g., unpruned decision trees)?

A. Linear Regression
B. Bagging
C. Boosting
D. Logistic Regression

35 Which method is best suited if you have high-bias models (e.g., shallow trees)?

A. Naive Bayes
B. Clustering
C. Bagging
D. Boosting

36 What is the OOB (Out-Of-Bag) Error used for?

A. To estimate the generalization error of a Bagging ensemble without needing a separate validation set.
B. To calculate the gradient in boosting.
C. To stop the training early.
D. To select features in Grid Search.

37 In the context of hyperparameter tuning, what is a continuous hyperparameter?

A. Learning rate ()
B. Depth of a tree
C. Number of trees
D. Number of features

38 Why might one choose XGBoost over standard Gradient Boosting?

A. XGBoost is a bagging technique.
B. XGBoost includes regularization (L1/L2) and is optimized for speed/scalability.
C. XGBoost is slower.
D. XGBoost does not support regression.

39 What is the Base Estimator in a heterogeneous Stacking ensemble?

A. It can be any supervised learning algorithm (SVM, KNN, Tree, etc.).
B. It must be a Neural Network.
C. It must be a Decision Tree.
D. It must be the same algorithm with different hyperparameters.

40 Which search strategy uses probability to choose the next set of hyperparameters based on past results (e.g., using Gaussian Processes)?

A. Bayesian Optimization
B. Grid Search
C. Random Search
D. Exhaustive Search

41 In Grid Search, if the optimal value lies between two grid points, the method will:

A. Fail.
B. Switch to Random Search.
C. Select the closest defined grid point.
D. Automatically interpolate to find it.

42 Which of the following is an advantage of Ensemble Methods?

A. Low training time.
B. Compactness (small model size).
C. Interpretability (easy to explain distinct rules).
D. Robustness and Stability.

43 In a Voting Classifier, what requirement must be met to use Soft Voting?

A. The data must be linearly separable.
B. The base classifiers must support the predict_proba method.
C. The base classifiers must be Decision Trees.
D. There must be an odd number of classifiers.

44 When defining a parameter grid for SVM, which parameters are commonly tuned?

A. and (Gamma)
B. Learning rate and momentum
C. and
D. and distance metric

45 What is the concept of Feature Subsampling in Gradient Boosting?

A. Manually selecting features.
B. Using only a random fraction of features at each split or tree construction to reduce variance.
C. Using PCA before training.
D. Removing features that are not important.

46 A Random Forest with features total. For classification, what is the recommended number of features to search at each split?

A.
B.
C.
D.

47 Why is Accuracy sometimes a poor metric to optimize during hyperparameter tuning?

A. It is computationally expensive to calculate.
B. It is not differentiable.
C. In imbalanced datasets, it can be misleading (e.g., predicting the majority class exclusively).
D. Grid search does not support accuracy.

48 In Stacking, the Level-0 models are:

A. The meta-learners.
B. The models used for feature selection.
C. The final output layer.
D. The base models trained on the original dataset.

49 Which component of the error does Random Forest specifically aim to keep low compared to a single Decision Tree?

A. Bias
B. Computation time
C. Noise
D. Variance

50 When using Random Search, if you increase the number of iterations:

A. The search space shrinks.
B. The probability of finding a near-optimal combination increases.
C. The computational cost decreases.
D. The probability of finding the optimal parameters decreases.