Explanation:Bagging aggregates high-variance low-bias models (like deep trees) to reduce variance. Boosting aggregates high-bias low-variance models (like stumps) to reduce bias.
Incorrect! Try again.
3In the context of ensemble learning, what is a Weak Learner?
A.A model that performs slightly better than random guessing
B.A model that has 0% training error
C.A model that has too many parameters
D.A model that overfits the data significantly
Correct Answer: A model that performs slightly better than random guessing
Explanation:A weak learner is a classifier that is only slightly correlated with the true classification (better than random guessing), which can be boosted into a strong learner.
Incorrect! Try again.
4What does Bagging stand for?
A.Basic Aggregating
B.Bootstrap Aggregating
C.Binary Aggregating
D.Backward Aggregating
Correct Answer: Bootstrap Aggregating
Explanation:Bagging stands for Bootstrap aggregating.
Incorrect! Try again.
5Which statistical technique involves sampling data subsets with replacement?
A.Jackknife
B.Bootstrapping
C.Cross-Validation
D.Stratification
Correct Answer: Bootstrapping
Explanation:Bootstrapping is a resampling technique used to estimate statistics on a population by sampling a dataset with replacement.
Incorrect! Try again.
6In a Random Forest, which two randomization techniques are combined?
A.Boosting and Bagging
B.Bootstrap sampling and random feature selection
C.Grid Search and Random Search
D.L1 and L2 regularization
Correct Answer: Bootstrap sampling and random feature selection
Explanation:Random Forest uses Bagging (Bootstrap sampling) for rows and random feature selection for split candidates at each node to decorrelate trees.
Incorrect! Try again.
7If you are training a Bagging ensemble with samples, approximately what fraction of samples are left out of a single bootstrap sample (Out-Of-Bag)?
A.
B.
C.
D.
Correct Answer:
Explanation:The probability of a sample not being picked is . As , this converges to .
Incorrect! Try again.
8Which of the following is true regarding the parallelization of Bagging and Boosting?
A.Both Bagging and Boosting can be easily parallelized.
B.Neither can be parallelized.
C.Bagging is easy to parallelize, whereas Boosting is inherently sequential.
D.Boosting is easy to parallelize, whereas Bagging is inherently sequential.
Correct Answer: Bagging is easy to parallelize, whereas Boosting is inherently sequential.
Explanation:In Bagging, trees are independent and can be trained simultaneously. In Boosting, each tree depends on the errors of the previous tree.
Incorrect! Try again.
9In AdaBoost, how are the weights of training instances updated after each iteration?
A.Weights are kept constant throughout training.
B.Misclassified instances are given higher weights.
C.Correctly classified instances are given higher weights.
D.Weights are assigned randomly.
Correct Answer: Misclassified instances are given higher weights.
Explanation:AdaBoost increases the weights of misclassified samples so the next weak learner focuses more on the 'hard' cases.
Incorrect! Try again.
10What is the main difference between AdaBoost and Gradient Boosting?
A.AdaBoost minimizes the loss function using gradient descent, while Gradient Boosting uses weighted voting.
B.AdaBoost changes sample weights, while Gradient Boosting fits the new predictor to the residual errors of the previous predictor.
C.AdaBoost cannot be used for regression, while Gradient Boosting can.
D.There is no difference; they are synonyms.
Correct Answer: AdaBoost changes sample weights, while Gradient Boosting fits the new predictor to the residual errors of the previous predictor.
Explanation:Gradient Boosting generalizes boosting by training on residuals (gradients of the loss function), whereas AdaBoost specifically reweights samples.
Incorrect! Try again.
11In the context of Stacking, what is a Meta-Learner?
A.The first layer of base models
B.A model that learns how to combine the predictions of the base models
C.A model used for hyperparameter tuning
D.A specific type of Deep Neural Network
Correct Answer: A model that learns how to combine the predictions of the base models
Explanation:Stacking involves a second-level model (meta-learner) that takes the outputs of the first-level models as input features to make the final prediction.
Incorrect! Try again.
12Which ensemble method is mathematically represented by , where is the learning rate?
A.Random Forest
B.Gradient Boosting
C.Hard Voting
D.Stacking
Correct Answer: Gradient Boosting
Explanation:This equation represents additive modeling in Gradient Boosting, where a new weak learner is added to the previous ensemble prediction scaled by a learning rate.
Incorrect! Try again.
13What is the primary risk when using Boosting with a large number of iterations (trees)?
A.Underfitting
B.Overfitting
C.High bias
D.Vanishing gradients
Correct Answer: Overfitting
Explanation:Because Boosting focuses on reducing bias and correcting errors, running it for too many iterations can lead to overfitting the training noise.
Incorrect! Try again.
14What is Hard Voting in ensemble classifiers?
A.Averaging the probabilities of all classifiers
B.Taking the majority class prediction as the final output
C.Weighting votes based on classifier confidence
D.Using a meta-model to decide the vote
Correct Answer: Taking the majority class prediction as the final output
Explanation:Hard voting predicts the class that receives the largest number of votes from the individual classifiers.
Incorrect! Try again.
15What is Soft Voting?
A.Predicting the class with the highest summed predicted probability across classifiers
B.Predicting the class with the most votes
C.Randomly selecting a classifier's output
D.Using a soft-margin SVM as the ensemble
Correct Answer: Predicting the class with the highest summed predicted probability across classifiers
Explanation:Soft voting averages the class probabilities (confidence) predicted by the base estimators and selects the class with the highest average probability.
Incorrect! Try again.
16Why is diversity important in an ensemble?
A.It ensures all models are identical.
B.It allows models to make independent errors, which cancel out when aggregated.
C.It increases the bias of the ensemble.
D.It simplifies the hyperparameter tuning process.
Correct Answer: It allows models to make independent errors, which cancel out when aggregated.
Explanation:If all models make the same errors, averaging them yields no benefit. Diversity ensures errors are uncorrelated, reducing overall variance.
Incorrect! Try again.
17Which of the following is NOT a Hyperparameter?
A.The depth of a decision tree
B.The number of neighbors () in KNN
C.The weights learned by a linear regression model
D.The learning rate in Gradient Descent
Correct Answer: The weights learned by a linear regression model
Explanation:Weights are model parameters learned during training. Hyperparameters are configuration settings external to the model defined before training.
Incorrect! Try again.
18What is the primary purpose of Hyperparameter Tuning?
A.To train the model parameters like weights and biases
B.To select the optimal configuration for the learning algorithm to maximize performance
C.To clean the dataset
D.To visualize the results
Correct Answer: To select the optimal configuration for the learning algorithm to maximize performance
Explanation:Hyperparameter tuning optimizes the settings that govern the learning process to prevent overfitting/underfitting and improve accuracy.
Incorrect! Try again.
19How does Grid Search work?
A.It randomly samples hyperparameters from a distribution.
B.It exhaustively tries every combination of a specified list of values for hyperparameters.
C.It uses gradient descent to find optimal hyperparameters.
D.It manually asks the user to input values during training.
Correct Answer: It exhaustively tries every combination of a specified list of values for hyperparameters.
Explanation:Grid Search defines a grid of hyperparameter values and evaluates the model performance for every possible combination in that grid.
Incorrect! Try again.
20What is the major drawback of Grid Search?
A.It does not find the optimal parameters.
B.It suffers from the Curse of Dimensionality (computationally expensive with many parameters).
C.It is difficult to implement.
D.It only works for Decision Trees.
Correct Answer: It suffers from the Curse of Dimensionality (computationally expensive with many parameters).
Explanation:The number of combinations grows exponentially with the number of hyperparameters, making Grid Search extremely slow for high-dimensional spaces.
Incorrect! Try again.
21How does Random Search differ from Grid Search?
A.It checks more combinations than Grid Search.
B.It samples a fixed number of parameter settings from specified distributions.
C.It is always slower than Grid Search.
D.It guarantees finding the global optimum.
Correct Answer: It samples a fixed number of parameter settings from specified distributions.
Explanation:Random Search picks random combinations from the search space, which is often more efficient than Grid Search for finding good parameters in high-dimensional spaces.
Incorrect! Try again.
22According to Bergstra and Bengio, why is Random Search often more efficient than Grid Search?
A.Because random numbers are faster to generate.
B.Because usually only a few hyperparameters are actually important for model performance.
C.Because Grid Search introduces bias.
D.Because Random Search uses deep learning.
Correct Answer: Because usually only a few hyperparameters are actually important for model performance.
Explanation:In high dimensions, not all parameters affect the objective function equally. Random search explores unique values for important parameters more effectively than a grid.
Incorrect! Try again.
23What happens if we tune hyperparameters on the Test Set?
A.The model will generalize better.
B.Information leakage occurs, leading to an optimistic bias in performance estimation.
C.The training time decreases.
D.Nothing; this is standard practice.
Correct Answer: Information leakage occurs, leading to an optimistic bias in performance estimation.
Explanation:The test set must remain unseen. Tuning on it incorporates test data info into the model configuration, invalidating the test set as an unbiased evaluation metric.
Incorrect! Try again.
24Which technique is commonly used alongside Grid Search to evaluate the performance of each parameter combination?
A.Standardization
B.K-Fold Cross-Validation
C.Principal Component Analysis
D.Clustering
Correct Answer: K-Fold Cross-Validation
Explanation:To ensure the hyperparameter performance isn't specific to one random split of data, Cross-Validation is used to average performance across multiple folds.
Incorrect! Try again.
25In a Bagging classifier, if the base models are unstable (e.g., fully grown Decision Trees), what is the expected outcome?
A.The ensemble will perform worse than a single model.
B.The ensemble will significantly reduce variance and improve accuracy.
C.The ensemble will increase bias significantly.
D.Bagging cannot be used with unstable models.
Correct Answer: The ensemble will significantly reduce variance and improve accuracy.
Explanation:Bagging works best with unstable, high-variance models. By averaging them, the variance is smoothed out.
Incorrect! Try again.
26If you perform a Grid Search with: Parameter A = [1, 2, 3], Parameter B = [10, 20], and 5-fold Cross-Validation, how many total training runs are executed?
A.
B.
C.
D.
Correct Answer:
Explanation:Total runs = (Number of combinations) (Number of folds). Combinations = . Total runs = .
Incorrect! Try again.
27What is Stacking usually vulnerable to if not implemented correctly with cross-validation?
A.Underfitting
B.Data leakage / Overfitting on the training data
C.High bias
D.Convergence failure
Correct Answer: Data leakage / Overfitting on the training data
Explanation:If the meta-learner is trained on the same data used to train base learners (without cross-validated prediction generation), it will learn to rely on the base learners' overfitting.
Incorrect! Try again.
28In Gradient Boosting, what is the role of the Learning Rate (shrinkage)?
A.It controls the size of the tree.
B.It scales the contribution of each tree; lower values require more trees but improve generalization.
C.It determines the number of features to select.
D.It sets the random seed.
Correct Answer: It scales the contribution of each tree; lower values require more trees but improve generalization.
Explanation:The learning rate () scales the update: . Smaller steps lead to better convergence but require more iterations.
Incorrect! Try again.
29Which of the following ensemble methods uses Decision Stumps as the default base estimator?
A.Random Forest
B.Bagging
C.AdaBoost
D.Stacking
Correct Answer: AdaBoost
Explanation:AdaBoost typically uses decision stumps (trees with a depth of 1) as weak learners.
Incorrect! Try again.
30What is the key difference between Stacking and Blending?
B.Stacking typically uses cross-validated predictions for the meta-learner; Blending uses a hold-out validation set.
C.Blending is an older name for Bagging.
D.Stacking is parallel; Blending is sequential.
Correct Answer: Stacking typically uses cross-validated predictions for the meta-learner; Blending uses a hold-out validation set.
Explanation:Blending is a simplified version of stacking where predictions for the meta-learner are generated from a hold-out set rather than full k-fold cross-validation.
Incorrect! Try again.
31When performing hyperparameter tuning for a Decision Tree, which parameter typically controls overfitting?
A.Max Depth
B.Criterion (Gini/Entropy)
C.Random State
D.Splitter (Best/Random)
Correct Answer: Max Depth
Explanation:Limiting the Max Depth prevents the tree from growing too complex and memorizing noise in the training data.
Incorrect! Try again.
32Which theoretical theorem states that if individual classifiers are independent and better than random guessing, the ensemble accuracy approaches 1 as the number of classifiers increases?
A.Bayes Theorem
B.Condorcet's Jury Theorem
C.Central Limit Theorem
D.No Free Lunch Theorem
Correct Answer: Condorcet's Jury Theorem
Explanation:Condorcet's Jury Theorem provides the mathematical justification for why combining weak independent learners results in a strong learner.
Incorrect! Try again.
33In Random Forest, increasing the number of trees () typically:
A.Causes overfitting.
B.Decreases the variance up to a point without significantly increasing overfitting.
C.Increases the bias significantly.
D.Makes the model faster to train.
Correct Answer: Decreases the variance up to a point without significantly increasing overfitting.
Explanation:Unlike Boosting, adding more trees to a Random Forest does not lead to overfitting; the error rate usually stabilizes.
Incorrect! Try again.
34Which method is best suited if you have high-variance models (e.g., unpruned decision trees)?
A.Boosting
B.Bagging
C.Linear Regression
D.Logistic Regression
Correct Answer: Bagging
Explanation:Bagging reduces variance by averaging, making it ideal for high-variance base models.
Incorrect! Try again.
35Which method is best suited if you have high-bias models (e.g., shallow trees)?
A.Boosting
B.Bagging
C.Naive Bayes
D.Clustering
Correct Answer: Boosting
Explanation:Boosting turns weak learners (high bias) into strong learners by sequentially correcting errors.
Incorrect! Try again.
36What is the OOB (Out-Of-Bag) Error used for?
A.To calculate the gradient in boosting.
B.To estimate the generalization error of a Bagging ensemble without needing a separate validation set.
C.To select features in Grid Search.
D.To stop the training early.
Correct Answer: To estimate the generalization error of a Bagging ensemble without needing a separate validation set.
Explanation:Since ~37% of data is not seen by each tree, these samples can be used as a built-in validation set to evaluate performance.
Incorrect! Try again.
37In the context of hyperparameter tuning, what is a continuous hyperparameter?
A.Number of trees
B.Depth of a tree
C.Learning rate ()
D.Number of features
Correct Answer: Learning rate ()
Explanation:Integer values like depth or number of trees are discrete. Learning rate is a floating-point value and is continuous.
Incorrect! Try again.
38Why might one choose XGBoost over standard Gradient Boosting?
A.XGBoost is slower.
B.XGBoost includes regularization (L1/L2) and is optimized for speed/scalability.
C.XGBoost does not support regression.
D.XGBoost is a bagging technique.
Correct Answer: XGBoost includes regularization (L1/L2) and is optimized for speed/scalability.
Explanation:XGBoost (Extreme Gradient Boosting) is an optimized implementation that includes regularization terms in the objective function to control overfitting and supports parallel processing.
Incorrect! Try again.
39What is the Base Estimator in a heterogeneous Stacking ensemble?
A.It must be a Decision Tree.
B.It can be any supervised learning algorithm (SVM, KNN, Tree, etc.).
C.It must be the same algorithm with different hyperparameters.
D.It must be a Neural Network.
Correct Answer: It can be any supervised learning algorithm (SVM, KNN, Tree, etc.).
Explanation:Stacking thrives on heterogeneity; combining different types of algorithms often yields better results than combining variations of the same algorithm.
Incorrect! Try again.
40Which search strategy uses probability to choose the next set of hyperparameters based on past results (e.g., using Gaussian Processes)?
A.Grid Search
B.Random Search
C.Bayesian Optimization
D.Exhaustive Search
Correct Answer: Bayesian Optimization
Explanation:Bayesian Optimization builds a probabilistic model of the function mapping hyperparameters to a target objective to select the most promising hyperparameters to evaluate next.
Incorrect! Try again.
41In Grid Search, if the optimal value lies between two grid points, the method will:
A.Automatically interpolate to find it.
B.Fail.
C.Select the closest defined grid point.
D.Switch to Random Search.
Correct Answer: Select the closest defined grid point.
Explanation:Grid Search is restricted to the specific values provided in the grid. It cannot find values that were not explicitly requested.
Incorrect! Try again.
42Which of the following is an advantage of Ensemble Methods?
A.Interpretability (easy to explain distinct rules).
B.Compactness (small model size).
C.Robustness and Stability.
D.Low training time.
Correct Answer: Robustness and Stability.
Explanation:Ensembles are generally more robust to noise and outliers than single models. However, they lose interpretability and are computationally heavier.
Incorrect! Try again.
43In a Voting Classifier, what requirement must be met to use Soft Voting?
A.The base classifiers must support the predict_proba method.
B.The base classifiers must be Decision Trees.
C.The data must be linearly separable.
D.There must be an odd number of classifiers.
Correct Answer: The base classifiers must support the predict_proba method.
Explanation:Soft voting relies on averaging predicted probabilities, so the underlying models must be able to output probability estimates.
Incorrect! Try again.
44When defining a parameter grid for SVM, which parameters are commonly tuned?
A. and
B. and (Gamma)
C. and distance metric
D.Learning rate and momentum
Correct Answer: and (Gamma)
Explanation: controls the regularization (margin hardness) and defines the influence of a single training example in RBF kernels.
Incorrect! Try again.
45What is the concept of Feature Subsampling in Gradient Boosting?
A.Removing features that are not important.
B.Using only a random fraction of features at each split or tree construction to reduce variance.
C.Using PCA before training.
D.Manually selecting features.
Correct Answer: Using only a random fraction of features at each split or tree construction to reduce variance.
Explanation:Similar to Random Forests, Stochastic Gradient Boosting can subsample columns (features) to decorrelate trees and reduce overfitting.
Incorrect! Try again.
46A Random Forest with features total. For classification, what is the recommended number of features to search at each split?
A.
B.
C.
D.
Correct Answer:
Explanation:A common heuristic for the number of features to consider at each split in Random Forest classification is the square root of the total features.
Incorrect! Try again.
47Why is Accuracy sometimes a poor metric to optimize during hyperparameter tuning?
A.It is computationally expensive to calculate.
B.It is not differentiable.
C.In imbalanced datasets, it can be misleading (e.g., predicting the majority class exclusively).
D.Grid search does not support accuracy.
Correct Answer: In imbalanced datasets, it can be misleading (e.g., predicting the majority class exclusively).
Explanation:In skewed classes, a model can achieve high accuracy by ignoring the minority class. Metrics like F1-score or AUC-ROC are often better targets for tuning.
Incorrect! Try again.
48In Stacking, the Level-0 models are:
A.The meta-learners.
B.The base models trained on the original dataset.
C.The models used for feature selection.
D.The final output layer.
Correct Answer: The base models trained on the original dataset.
Explanation:Level-0 refers to the base models. Level-1 refers to the meta-learner that stacks the predictions of Level-0.
Incorrect! Try again.
49Which component of the error does Random Forest specifically aim to keep low compared to a single Decision Tree?
A.Bias
B.Variance
C.Noise
D.Computation time
Correct Answer: Variance
Explanation:A single deep decision tree has low bias but high variance. Random Forest averages many such trees to lower the variance while maintaining low bias.
Incorrect! Try again.
50When using Random Search, if you increase the number of iterations:
A.The probability of finding the optimal parameters decreases.
B.The computational cost decreases.
C.The probability of finding a near-optimal combination increases.
D.The search space shrinks.
Correct Answer: The probability of finding a near-optimal combination increases.
Explanation:More iterations mean more samples from the hyperparameter space, increasing the likelihood of hitting a high-performance configuration.
Incorrect! Try again.
Give Feedback
Help us improve by sharing your thoughts or reporting issues.