Unit 3 - Notes

INT395 7 min read

Unit 3: Ensemble Methods and Hyperparameter Tuning

1. Overview of Ensemble Models

1.1 Definition

Ensemble learning is a machine learning paradigm where multiple distinct models, known as "weak learners" or "base learners," are trained to solve the same problem and combined to get a single result. The primary hypothesis is that combining multiple models will produce a predictive performance that is superior to any single constituent model.

1.2 The "Wisdom of the Crowds"

The theoretical foundation of ensemble methods mimics the "wisdom of crowds." If you ask a single expert a complex question, they might make a mistake due to bias or specific knowledge gaps. However, if you ask 100 people and take the majority vote (classification) or the average (regression), individual errors tend to cancel each other out, leading to a more accurate collective decision.

1.3 Types of Weak Learners

Homogeneous Ensembles: Use a single type of base algorithm (e.g., a forest of only Decision Trees).
Heterogeneous Ensembles: Use different types of algorithms (e.g., combining Logistic Regression, SVM, and k-NN).

2. How Ensembles Reduce Error

To understand how ensembles improve performance, one must understand the decomposition of prediction error.

\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}

2.1 Bias vs. Variance

Bias: The error introduced by approximating a real-world problem with a simplified model. High bias causes underfitting (the model misses the relevant relations).
Variance: The error introduced by the model's sensitivity to small fluctuations in the training set. High variance causes overfitting (modeling the random noise in the training data).

2.2 The Ensemble Solution

Ensemble methods target these error components specifically:

Variance Reduction: Algorithms like Bagging average the predictions of multiple models. If the models are uncorrelated, averaging reduces the variance without increasing the bias.
- Analogy: Diversification in a stock portfolio reduces risk (variance) without necessarily lowering expected return.
Bias Reduction: Algorithms like Boosting sequentially focus on the difficult observations that previous models got wrong. This increases the model's complexity and ability to capture underlying patterns, thereby reducing bias.

3. Bagging (Bootstrap Aggregating)

Bagging is an ensemble technique mainly used to reduce variance and prevent overfitting. It is most effective with high-variance base learners (like deep Decision Trees).

3.1 The Mechanism

Bagging involves two core steps:

Bootstrapping: The creation of multiple subsets of the original training data using random sampling with replacement.
- If the original dataset has size $N$ , each subset also has size $N$ .
- Because of replacement, some data points may appear multiple times in a subset, while others are omitted (roughly 37% of data is usually omitted, known as "Out-of-Bag" data).
Aggregating: Independent models are trained on each bootstrap subset in parallel.
- Regression: The final prediction is the average of all model predictions.
- Classification: The final prediction is determined by majority voting (Hard Voting) or averaging probabilities (Soft Voting).

3.2 Random Forest (An Extension of Bagging)

While Bagging can work with any algorithm, its most famous implementation is the Random Forest. Random Forest improves on standard Bagging by introducing feature randomness.

Standard Bagging: Considers all features when splitting a node in a tree.
Random Forest: At each split, selects a random subset of features. This "decorrelates" the trees, ensuring that strong predictors do not dominate every tree, which further reduces variance.

4. Boosting

Boosting is an ensemble technique that converts weak learners into strong learners by training models sequentially, not in parallel. Its primary goal is to reduce bias.

4.1 The Mechanism

Sequential Training: Train a weak model $M_1$ on the data.
Error Analysis: Identify the instances where $M_1$ made errors.
Reweighting: Increase the importance (weight) of the misclassified instances.
Iterate: Train model $M_2$ on the weighted data. $M_2$ is forced to focus on the hard cases $M_1$ missed.
Combine: Combine the models using a weighted sum, where models with higher accuracy are given more influence.

4.2 Popular Boosting Algorithms

A. AdaBoost (Adaptive Boosting)

Uses "stumps" (trees with a single split) as base learners.
After every iteration, it increases weights of misclassified data points and decreases weights of correctly classified ones.
Sensitive to noisy data and outliers.

B. Gradient Boosting

Instead of updating weights of data points, it trains subsequent models to predict the residuals (errors) of the previous models.
Mathematically, it uses Gradient Descent to minimize a loss function by adding new trees that move the prediction closer to the target.

C. XGBoost (Extreme Gradient Boosting)

An optimized implementation of gradient boosting designed for speed and performance.
Includes regularization (L1/L2) to prevent overfitting, making it a standard winner in ML competitions.

5. Stacking (Stacked Generalization)

Stacking is a heterogeneous ensemble method that combines multiple classification or regression models via a meta-classifier or meta-regressor.

5.1 The Architecture

Stacking typically involves two levels:

Level 0 (Base Learners): A diverse set of models (e.g., SVM, Random Forest, Neural Net) are trained on the full training dataset.
Level 1 (Meta Learner): The predictions made by the Level 0 models are used as input features for the Level 1 model. The Level 1 model learns how to best combine the predictions of the base models to predict the final target.

5.2 The Process

Split training data into $K$ -folds.
Train base models on $K-1$ folds and predict on the hold-out fold.
Repeat until predictions exist for the entire dataset. These predictions form a "New Feature Matrix."
Train the Meta-Model on this New Feature Matrix.

Key Difference: Unlike Bagging (voting) or Boosting (weighted sum), Stacking allows the meta-model to learn when to trust specific base models (e.g., "Trust the SVM when Feature X is high, but trust the Tree when Feature X is low").

6. Need for Hyperparameter Tuning

6.1 Parameters vs. Hyperparameters

Model Parameters: Internal variables learned from the training data automatically (e.g., coefficients in Linear Regression, weights in Neural Networks, split points in Decision Trees).
Hyperparameters: Configuration settings used to tune the learning process itself. They are set before training begins and cannot be learned directly from the data (e.g., Learning rate, $K$ in K-NN, Max Depth of a tree, Number of estimators in Random Forest).

6.2 Why Tune?

Controlling Model Complexity: Tuning adjusts the balance between Underfitting and Overfitting.
- Example: Increasing the "Max Depth" of a tree increases complexity (low bias/high variance). Tuning finds the sweet spot.
Algorithm Specificity: Default values in libraries (like scikit-learn) are generic and rarely optimal for a specific dataset.
Performance Maximization: Proper tuning allows the extraction of the maximum possible accuracy/F1-score from a chosen architecture.

7. Hyperparameter Search Strategies

Finding the optimal combination of hyperparameters is an optimization problem. Two common strategies are Grid Search and Random Search.

7.1 Grid Search

Grid Search is an exhaustive search technique.

Methodology:
1. Define a specific grid of hyperparameter values.
2. The algorithm builds a model for every possible combination of these values.
3. Evaluates performance using Cross-Validation.
4. Selects the combination with the best score.
Example:
- Parameter A options: [1, 10, 100]
- Parameter B options: [X, Y]
- Total models trained: $3 \times 2 = 6$ models.
Pros:
- Guaranteed to find the best combination within the specified grid.
- Reproducible.
Cons:
- Computationally Expensive: The cost grows exponentially with the number of parameters (Curse of Dimensionality).
- Inefficient if the optimal value lies between the grid points (e.g., if you check 10 and 20, but the best is 14).

7.2 Random Search

Random Search is a stochastic technique.

Methodology:
1. Define a statistical distribution (or a discrete list) for each hyperparameter.
2. Select a fixed number of iterations (e.g., 100 trials).
3. For each iteration, select a random combination of parameters.
4. Evaluate via Cross-Validation and keep the best one.
Pros:
- Efficiency: Can explore a wider range of values. Empirical studies show that Random Search is more efficient than Grid Search for high-dimensional spaces because not all hyperparameters are equally important.
- Can find values "between the grid points" of a standard grid search.
- Better computational budget control (you set the number of iterations).
Cons:
- No guarantee of finding the global optimum.
- Results may vary slightly between runs (high variance).

7.3 Comparison Summary

Feature	Grid Search	Random Search
Search Space	Discrete, user-defined points	Statistical distributions or lists
Completeness	Exhaustive (within grid)	Sampled (Non-exhaustive)
Cost	High (Exponential growth)	Lower (Linear control)
Best Used For	Few parameters, small range	Many parameters, unknown range

7.4 Python Implementation Example (Concept)

PYTHON

# Grid Search Example (Pseudocode logic)
param_grid = {'max_depth': [3, 5, 10], 'n_estimators': [50, 100]}
# Result: Will train exactly 6 models.

# Random Search Example (Pseudocode logic)
param_dist = {'max_depth': randint(1, 20), 'n_estimators': randint(10, 200)}
# Result: Will train 'n_iter' models (e.g., 20) picking random values from distributions.

Unit 2

Unit 4