1What is the primary motivation behind using Ensemble Learning methods compared to single individual models?
A.To increase the computational speed of training
B.To improve the predictive performance by combining the strengths of multiple models
C.To reduce the number of features required for training
D.To simplify the interpretability of the final model
Correct Answer: To improve the predictive performance by combining the strengths of multiple models
Explanation:Ensemble learning combines several base models to produce one optimal predictive model. The goal is to improve generalization errors and robustness over a single estimator.
Incorrect! Try again.
2In the context of ensemble learning, what is the Bias-Variance Trade-off implication for Bagging?
A.Bagging primarily reduces bias while variance remains high
B.Bagging primarily reduces variance while bias remains unchanged
C.Bagging increases both bias and variance
D.Bagging reduces bias but significantly increases variance
Correct Answer: Bagging primarily reduces variance while bias remains unchanged
Explanation:Bagging (Bootstrap Aggregation) averages the predictions of high-variance low-bias models (like deep decision trees), effectively reducing the variance without significantly increasing bias.
Incorrect! Try again.
3Which of the following describes Hard Voting in a classification ensemble?
A.Averaging the predicted probabilities of all classifiers
B.Weighting the votes based on the confidence of the classifier
C.Selecting the class that receives the majority of votes from the individual classifiers
D.Using a meta-classifier to learn from the predictions of base classifiers
Correct Answer: Selecting the class that receives the majority of votes from the individual classifiers
Explanation:Hard voting involves predicting the class label that gets the majority of votes (the mode) from the base classifiers. Soft voting involves averaging probabilities.
Incorrect! Try again.
4What is Bootstrapping in the context of Bagging?
A.Sampling data subsets without replacement
B.Sampling data subsets with replacement
C.Sampling features only without touching rows
D.Training models sequentially on residuals
Correct Answer: Sampling data subsets with replacement
Explanation:Bootstrapping is a statistical method that relies on random sampling with replacement. In Bagging, multiple datasets are created by sampling with replacement from the original data.
Incorrect! Try again.
5In a Random Forest, how does the algorithm introduce randomness beyond simple Bagging?
A.By using different loss functions for each tree
B.By selecting a random subset of features to consider for each split
C.By randomly pruning the trees after construction
D.By assigning random weights to data points
Correct Answer: By selecting a random subset of features to consider for each split
Explanation:Random Forest de-correlates the trees by considering only a random subset of features (usually ) at each candidate split, rather than searching all features.
Incorrect! Try again.
6What is the Out-of-Bag (OOB) Error in Random Forests?
A.The error calculated on a separate validation set
B.The training error averaged across all trees
C.The error calculated using data points that were not included in the bootstrap sample for a specific tree
D.The error resulting from missing values in the dataset
Correct Answer: The error calculated using data points that were not included in the bootstrap sample for a specific tree
Explanation:Since bootstrapping samples with replacement, about 1/3 of the data is left out (out-of-bag) for each tree. These samples can be used to estimate generalization error without a separate validation set.
Incorrect! Try again.
7Which algorithm is best described as an ensemble method that trains predictors sequentially, where each new predictor tries to correct the errors of its predecessor?
A.Bagging
B.Random Forest
C.Boosting
D.Stacking
Correct Answer: Boosting
Explanation:Boosting is a sequential technique where subsequent models focus on the observations that were misclassified or had high errors in the previous models.
Incorrect! Try again.
8In AdaBoost (Adaptive Boosting), how are the weights of misclassified instances updated?
A.Their weights are decreased
B.Their weights represent the average of neighbors
C.Their weights are increased so the next classifier focuses on them
D.Their weights remain constant throughout the process
Correct Answer: Their weights are increased so the next classifier focuses on them
Explanation:AdaBoost increases the weights of misclassified samples and decreases the weights of correctly classified samples, forcing the next weak learner to focus on the difficult cases.
Incorrect! Try again.
9What is the typical Base Estimator used in standard AdaBoost?
A.Deep Decision Trees
B.Decision Stumps (trees with depth 1)
C.Linear Regression models
D.Support Vector Machines with RBF kernels
Correct Answer: Decision Stumps (trees with depth 1)
Explanation:AdaBoost typically uses Decision Stumps (one-level decision trees) as weak learners because they are simple, fast, and define a high-bias classifier.
Incorrect! Try again.
10Which loss function does AdaBoost essentially minimize?
A.Mean Squared Error (MSE)
B.Hinge Loss
C.Exponential Loss ()
D.Log Loss
Correct Answer: Exponential Loss ()
Explanation:Ideally, AdaBoost minimizes the exponential loss function, which penalizes incorrect predictions exponentially.
Incorrect! Try again.
11How does Gradient Boosting differ from AdaBoost?
A.Gradient Boosting uses parallel processing, AdaBoost is sequential
B.Gradient Boosting fits new models to the residuals (negative gradients) of the previous model
C.Gradient Boosting only works for regression, AdaBoost only for classification
D.Gradient Boosting cannot use decision trees
Correct Answer: Gradient Boosting fits new models to the residuals (negative gradients) of the previous model
Explanation:While AdaBoost reweights data points, Gradient Boosting trains the next learner on the residual errors (pseudo-residuals) of the ensemble so far.
Incorrect! Try again.
12In Gradient Boosting, what is the role of the Learning Rate (shrinkage)?
A.It determines the maximum depth of the trees
B.It scales the contribution of each new tree to the final prediction
C.It sets the number of cross-validation folds
D.It controls the ratio of features sampled
Correct Answer: It scales the contribution of each new tree to the final prediction
Explanation:The learning rate () scales the output of each new tree (e.g., ). Lower learning rates generally require more trees but lead to better generalization.
Incorrect! Try again.
13Which regularization technique is native to XGBoost but not standard Gradient Boosting (sklearn implementation)?
A. and regularization on leaf weights
B.Tree pruning based on max depth
C.Minimum samples per leaf
D.Bootstrap sampling
Correct Answer: and regularization on leaf weights
Explanation:XGBoost includes regularization terms (/Alpha and /Lambda) in its objective function to control model complexity and prevent overfitting.
Incorrect! Try again.
14How does LightGBM typically grow its trees, differing from XGBoost's level-wise approach?
A.Depth-wise (column-wise)
B.Leaf-wise (best-first)
C.Breadth-first
D.Random growth
Correct Answer: Leaf-wise (best-first)
Explanation:LightGBM uses a leaf-wise growth strategy, splitting the leaf with the maximum loss reduction, which often results in lower loss but can overfit on small datasets.
Incorrect! Try again.
15What is the primary feature of CatBoost that distinguishes it from other boosting libraries?
A.It only works on CPUs
B.It handles categorical features automatically using Ordered Target Statistics
C.It uses neural networks as base learners
D.It requires one-hot encoding for all inputs
Correct Answer: It handles categorical features automatically using Ordered Target Statistics
Explanation:CatBoost is specifically designed to handle categorical variables efficiently without explicit preprocessing like One-Hot Encoding, using a technique called Ordered Target Statistics.
Incorrect! Try again.
16In Ensemble Regression, if you have predictions from base regressor models, what is the simplest way to combine them?
A.Majority voting
B.Calculating the standard deviation
C.Simple Averaging ()
D.Selecting the prediction with the highest value
Correct Answer: Simple Averaging ()
Explanation:For regression, the simplest ensemble technique is averaging the predictions of the base models. Weighted averaging is also common.
Incorrect! Try again.
17When building an ensemble pipeline, why is it crucial to perform data preprocessing (e.g., scaling) inside the cross-validation loop?
A.To save memory
B.To prevent Data Leakage
C.To speed up the training process
D.To ensure the scaler fits to the test set
Correct Answer: To prevent Data Leakage
Explanation:Preprocessing on the whole dataset before splitting allows information from the test set to leak into the training process (e.g., mean/variance calculation). Doing it inside the loop ensures the scaler is fitted only on the training fold.
Incorrect! Try again.
18What is Grid Search in the context of hyperparameter tuning?
A.An optimization algorithm using derivatives
B.A technique that tries every combination of a preset list of values for hyperparameters
C.A method of randomly selecting hyperparameters
D.A manual process of guessing parameters
Correct Answer: A technique that tries every combination of a preset list of values for hyperparameters
Explanation:Grid Search exhaustively generates candidates from a grid of parameter values specified by the user and evaluates them.
Incorrect! Try again.
19What is the main advantage of Random Search over Grid Search?
A.It guarantees finding the global minimum
B.It is computationally more expensive
C.It is often more efficient because not all hyperparameters are equally important
D.It checks every possible combination
Correct Answer: It is often more efficient because not all hyperparameters are equally important
Explanation:Random search samples parameter settings a fixed number of times. It is statistically more likely to find good values for important parameters with fewer iterations than grid search in high-dimensional spaces.
Incorrect! Try again.
20Which method uses a probabilistic model (often a Gaussian Process) to model the objective function and decide which hyperparameters to evaluate next?
A.Grid Search
B.Random Search
C.Bayesian Optimization
D.Gradient Descent
Correct Answer: Bayesian Optimization
Explanation:Bayesian Optimization builds a surrogate probability model of the objective function and uses an acquisition function to choose the next hyperparameters to evaluate, balancing exploration and exploitation.
Incorrect! Try again.
21In K-Fold Cross-Validation, if , what percentage of the data is used for validation in each iteration?
A.10%
B.20%
C.25%
D.50%
Correct Answer: 20%
Explanation:In K-Fold CV, the data is split into parts. In each iteration, 1 part is used for validation and for training. or 20%.
Incorrect! Try again.
22Which cross-validation strategy is recommended for imbalanced classification datasets?
A.Standard K-Fold
B.Leave-One-Out CV
C.Stratified K-Fold
D.TimeSeriesSplit
Correct Answer: Stratified K-Fold
Explanation:Stratified K-Fold ensures that each fold of the dataset has the same proportion of observations with a given label as the whole dataset, which is critical for imbalanced classes.
Incorrect! Try again.
23What is the Condorcet's Jury Theorem relevance to Ensemble Learning?
A.It states that adding more weak learners always increases variance
B.It suggests that if individual classifiers are slightly better than random guess and independent, the majority vote accuracy approaches 100% as the number of voters increases
C.It proves that Neural Networks are superior to Decision Trees
D.It defines the stopping criteria for Boosting
Correct Answer: It suggests that if individual classifiers are slightly better than random guess and independent, the majority vote accuracy approaches 100% as the number of voters increases
Explanation:This theorem provides the theoretical foundation for why majority voting works, assuming independence and accuracy > 0.5 for base learners.
Incorrect! Try again.
24In Stacking (Stacked Generalization), what is the 'Meta-Learner' trained on?
A.The original raw features
B.The residuals of the base models
C.The predictions (outputs) of the base models
D.Random noise
Correct Answer: The predictions (outputs) of the base models
Explanation:In Stacking, base models predict on the dataset, and these predictions become the input features for the second-level model (meta-learner).
Incorrect! Try again.
25When using XGBoost, what does the parameter colsample_bytree control?
A.The fraction of rows to subsample
B.The learning rate
C.The fraction of columns (features) to be randomly sampled for each tree
D.The maximum depth of the tree
Correct Answer: The fraction of columns (features) to be randomly sampled for each tree
Explanation:colsample_bytree is the subsample ratio of columns when constructing each tree. It is similar to max_features in Random Forest.
Incorrect! Try again.
26Which of the following is a disadvantage of Random Forests compared to a single Decision Tree?
A.Lower accuracy
B.Higher risk of overfitting
C.Lack of model interpretability/visualizability
D.Inability to handle categorical data
Correct Answer: Lack of model interpretability/visualizability
Explanation:While a single decision tree is easily visualized and interpreted (white box), a Random Forest consists of hundreds of trees, making it a 'black box' model that is harder to interpret intuitively.
Incorrect! Try again.
27In the context of Pipelines, what is the purpose of the fit_transform() method?
A.It trains the model and makes predictions simultaneously
B.It fits the transformer to the data and then returns the transformed version of the data
C.It is used only for the final estimator in the pipeline
D.It transforms the data without learning any parameters
Correct Answer: It fits the transformer to the data and then returns the transformed version of the data
Explanation:fit_transform() is a convenience method that calls fit() and then transform() on the same data, commonly used on the training set during preprocessing.
Incorrect! Try again.
28What is Nested Cross-Validation used for?
A.To tune hyperparameters only
B.To estimate the generalization error of the model while performing hyperparameter tuning, preventing bias
C.To visualize the decision boundary
D.To handle missing values in time series
Correct Answer: To estimate the generalization error of the model while performing hyperparameter tuning, preventing bias
Explanation:Nested CV separates the hyperparameter tuning step (inner loop) from the error estimation step (outer loop) to prevent overfitting the hyperparameters to the validation set.
Incorrect! Try again.
29What technique does LightGBM use to bundle mutually exclusive features to reduce dimensionality?
A.Gradient-based One-Side Sampling (GOSS)
B.Exclusive Feature Bundling (EFB)
C.Principal Component Analysis (PCA)
D.Feature hashing
Correct Answer: Exclusive Feature Bundling (EFB)
Explanation:EFB is a technique in LightGBM that bundles mutually exclusive features (features that are rarely non-zero simultaneously) into a single feature to speed up training.
Incorrect! Try again.
30In Bayesian Optimization, what is an Acquisition Function?
A.The function that calculates the training error
B.A function that guides the search by determining which point to evaluate next based on the surrogate model
C.The actual cost function of the machine learning model
D.A function to acquire data from the database
Correct Answer: A function that guides the search by determining which point to evaluate next based on the surrogate model
Explanation:The acquisition function (e.g., Expected Improvement) uses the posterior distribution of the surrogate model to trade off exploration (high uncertainty) and exploitation (low predicted mean).
Incorrect! Try again.
31Which ensemble method is generally considered the fastest to train on large datasets among the following?
A.Standard Gradient Boosting (sklearn)
B.XGBoost (exact greedy algorithm)
C.LightGBM
D.Random Forest with 10000 trees
Correct Answer: LightGBM
Explanation:LightGBM uses histogram-based algorithms and GOSS to significantly speed up training and reduce memory usage compared to standard GBM or exact XGBoost.
Incorrect! Try again.
32What is the effect of increasing n_estimators (number of trees) in Random Forest?
A.It causes severe overfitting
B.It increases the variance of the model
C.It stabilizes the error rate but increases computational cost
D.It reduces the bias significantly
Correct Answer: It stabilizes the error rate but increases computational cost
Explanation:Unlike Boosting, Random Forests do not overfit as n_estimators increases. The error rate stabilizes, but the model becomes slower to train and predict.
Incorrect! Try again.
33What is the effect of increasing n_estimators in Boosting without adjusting the learning rate?
A.It always improves accuracy
B.It leads to overfitting
C.It decreases model complexity
D.It has no effect
Correct Answer: It leads to overfitting
Explanation:In Boosting, adding too many trees allows the model to learn the noise in the training data, leading to overfitting. It requires early stopping or learning rate adjustment.
Incorrect! Try again.
34What does GOSS stand for in LightGBM?
A.Global Optimization Search Strategy
B.Gradient-based One-Side Sampling
C.Generalized Ordered Subset Selection
D.Gaussian Over-Sampling Strategy
Correct Answer: Gradient-based One-Side Sampling
Explanation:GOSS keeps instances with large gradients (large errors) and randomly samples instances with small gradients, maintaining accuracy while reducing data size.
Incorrect! Try again.
35If an ensemble model uses Weighted Voting, how is the final class determined?
A.
B.
C.
D.Random selection
Correct Answer:
Explanation:In weighted voting, the vote of classifier () is multiplied by its weight . The class with the highest sum of weighted votes is selected.
Incorrect! Try again.
36Which evaluation metric is most appropriate for a regression ensemble model predicting house prices?
A.Accuracy
B.F1-Score
C.Root Mean Squared Error (RMSE)
D.ROC-AUC
Correct Answer: Root Mean Squared Error (RMSE)
Explanation:RMSE is a standard metric for regression that measures the average magnitude of the errors in the same units as the target variable.
Incorrect! Try again.
37In Extremely Randomized Trees (ExtraTrees), how are splits chosen compared to Random Forest?
A.They calculate the optimal split for every feature
B.They select cut-points completely randomly for each feature and pick the best among them
C.They use the entire dataset instead of bootstrapping
D.They use Gradient Descent to find splits
Correct Answer: They select cut-points completely randomly for each feature and pick the best among them
Explanation:ExtraTrees adds more randomness by choosing random cut-points for features instead of searching for the optimal cut-point, further reducing variance.
Incorrect! Try again.
38What is Early Stopping in the context of training Gradient Boosting models?
A.Stopping training when the training error reaches zero
B.Stopping training when the validation score stops improving for a specified number of rounds
C.Stopping training after the first tree is built
D.Stopping training when CPU usage is too high
Correct Answer: Stopping training when the validation score stops improving for a specified number of rounds
Explanation:Early stopping is a regularization technique where training is halted if performance on a hold-out validation set fails to improve, preventing overfitting.
Incorrect! Try again.
39When performing Hyperparameter Tuning with sklearn, how do you access a parameter of a specific step in a Pipeline (e.g., n_estimators of a step named rf)?
Explanation:Scikit-learn uses a double underscore convention (<step_name>__<parameter_name>) to access parameters of steps inside a pipeline for grid/random search.
Incorrect! Try again.
40Which of the following describes a Heterogeneous Ensemble?
A.An ensemble where all base learners are Decision Trees
B.An ensemble combining different types of algorithms (e.g., SVM + Naive Bayes + Decision Tree)
C.An ensemble used for regression only
D.An ensemble trained on different hardware
Correct Answer: An ensemble combining different types of algorithms (e.g., SVM + Naive Bayes + Decision Tree)
Explanation:Heterogeneous ensembles combine models from different hypothesis spaces/algorithms, whereas homogeneous ensembles (like Random Forest) use the same algorithm.
Incorrect! Try again.
41Why is Cross-Validation preferred over a single Train/Test split for model evaluation?
A.It is faster
B.It provides a more reliable estimate of model performance by reducing the variance associated with the data split
C.It eliminates the need for a test set completely
D.It automatically tunes hyperparameters
Correct Answer: It provides a more reliable estimate of model performance by reducing the variance associated with the data split
Explanation:A single split might be lucky or unlucky (biased). CV averages results over multiple splits, giving a better estimate of how the model performs on unseen data.
Incorrect! Try again.
42In CatBoost, what is the concept of Symmetric Trees?
A.Trees where the left child is always deeper than the right
B.Trees where the same split condition is applied to all nodes at the same depth
C.Trees that are mirror images of each other
D.Trees with only two leaves
Correct Answer: Trees where the same split condition is applied to all nodes at the same depth
Explanation:CatBoost builds oblivious (symmetric) trees where the splitting criterion is consistent across an entire level of the tree. This allows for very fast inference.
Incorrect! Try again.
43What is the main drawback of Grid Search when the dimensionality of the hyperparameter space is high?
A.It is too inaccurate
B.It suffers from the Curse of Dimensionality and becomes computationally infeasible
C.It cannot handle categorical parameters
D.It requires GPU acceleration
Correct Answer: It suffers from the Curse of Dimensionality and becomes computationally infeasible
Explanation:As the number of hyperparameters increases, the number of combinations in Grid Search grows exponentially, making it impractical.
Incorrect! Try again.
44Which method involves training a model on the full dataset and testing on a single observation, repeated for every observation?
Explanation:LOOCV is K-Fold CV where equals the number of observations (). It is very computationally expensive for large datasets.
Incorrect! Try again.
45In the context of XGBoost, what is the purpose of the Gamma () parameter?
A.It is the learning rate
B.It is the minimum loss reduction required to make a further partition on a leaf node
C.It is the maximum depth
D.It is the subsample ratio
Correct Answer: It is the minimum loss reduction required to make a further partition on a leaf node
Explanation:Gamma specifies the minimum loss reduction required to make a split. It acts as a pseudo-regularization parameter to control tree growth (pruning).
Incorrect! Try again.
46When using TimeSeriesSplit for cross-validation, how are the training and validation sets created?
A.Randomly shuffling time points
B.Successive training sets are supersets of those that come before them, preserving temporal order
C.Standard K-Fold split
D.Using future data to predict past data
Correct Answer: Successive training sets are supersets of those that come before them, preserving temporal order
Explanation:In time series, we cannot train on future data to predict the past. TimeSeriesSplit creates expanding windows of training data with the subsequent window as the test set.
Incorrect! Try again.
47Which of the following is an example of Automated Machine Learning (AutoML) capabilities regarding ensembles?
A.Manually tuning a Decision Tree
B.Automatically selecting, tuning, and stacking multiple models to form an ensemble
C.Writing a loop for Grid Search
D.Calculating the mean of a column
Correct Answer: Automatically selecting, tuning, and stacking multiple models to form an ensemble
Explanation:AutoML frameworks (like H2O, Auto-sklearn) automate the process of algorithm selection and hyperparameter tuning, often resulting in complex stacked ensembles.
Incorrect! Try again.
48In Soft Voting, if Classifier A predicts [0.9, 0.1] and Classifier B predicts [0.6, 0.4] for classes [0, 1], what is the averaged probability for Class 0?
A.0.9
B.0.75
C.0.6
D.1.5
Correct Answer: 0.75
Explanation:Soft voting averages the probabilities. For Class 0: .
Incorrect! Try again.
49Why might one choose Random Search or Bayesian Optimization over manual tuning?
A.Manual tuning is always superior due to human intuition
B.To remove human bias and systematically explore the hyperparameter space more efficiently
C.Because manual tuning is not supported by Python libraries
D.To increase the bias of the model
Correct Answer: To remove human bias and systematically explore the hyperparameter space more efficiently
Explanation:Automated tuning methods explore the space more systematically and often find better non-intuitive combinations than manual trial and error.
Incorrect! Try again.
50What is the primary difference between Bagging and Pasting?
A.Bagging uses decision trees, Pasting uses SVMs
B.Bagging samples with replacement, Pasting samples without replacement
C.Bagging is for regression, Pasting is for classification
D.Pasting allows parallel processing, Bagging does not
Correct Answer: Bagging samples with replacement, Pasting samples without replacement
Explanation:Both are ensemble methods based on sampling. Bagging (Bootstrap Aggregation) samples with replacement, while Pasting samples without replacement.