Unit 6 - Notes
INT234
Unit 6: MODEL PERFORMANCE
1. The Bias-Variance Trade-off
In predictive analytics, the primary goal is to build a model that generalizes well to unseen data. The prediction error of any supervised learning algorithm can be broken down into three fundamental components: Bias, Variance, and Irreducible Error. Understanding the balance between bias and variance is critical for preventing underfitting and overfitting.
1.1 Definitions
-
Bias (Error due to erroneous assumptions):
- Bias refers to the error introduced by approximating a real-world problem, which may be complex, by a simplified model.
- High Bias: Suggests the model pays very little attention to the training data and oversimplifies the model. It leads to high error on training and test data.
- Result: Underfitting. The model fails to capture the underlying trend of the data.
- Example: Linear regression applied to non-linear data.
-
Variance (Error due to sensitivity to fluctuations):
- Variance refers to the amount by which the estimate of the target function would change if we used a different training data set.
- High Variance: Suggests the model pays too much attention to the training data, capturing random noise rather than the intended outputs.
- Result: Overfitting. The model performs very well on training data but poorly on test data.
- Example: A decision tree grown to full depth with no pruning.
-
Irreducible Error ():
- The error inherent in the problem itself (noise in the system) that cannot be reduced by any model.
1.2 Mathematical Decomposition
The expected squared prediction error (Mean Squared Error) at a point is:
1.3 The Trade-off
There is an inverse relationship between bias and variance:
- Increasing Model Complexity: Decreases bias but increases variance.
- Decreasing Model Complexity: Increases bias but decreases variance.
The Goal: Find the "Sweet Spot" (Optimal Complexity) where the sum of Bias squared and Variance is minimized, resulting in the lowest Total Error.
2. Cross-Validation Methods
Cross-validation is a statistical method used to estimate the skill of machine learning models. It is primarily used to estimate how accurately a predictive model will perform in practice (generalization) and to tune hyperparameters.
2.1 Leave-One-Out Cross-Validation (LOO-CV)
LOO-CV is an exhaustive cross-validation technique.
Mechanism:
- Assume a dataset has observations.
- The model is trained on observations.
- The model is validated on the 1 remaining observation.
- This process is repeated times, such that each observation serves as the validation set exactly once.
- The final performance metric is the average of the individual errors.
Pros:
- Unbiased: Because almost all data is used for training (), the bias of the error estimate is very low.
- Deterministic: There is no randomness in the split; running LOO twice yields the exact same result.
Cons:
- Computationally Expensive: The model must be fitted times. For large datasets, this is often infeasible.
- High Variance in Error Estimate: Because the training sets are nearly identical (differing by only one observation), the test error estimates are highly correlated, leading to a higher variance in the estimation of the model performance.
2.2 K-Fold Cross-Validation
This is the standard approximation method for model evaluation.
Mechanism:
- Randomly shuffle the dataset.
- Split the dataset into groups (folds) of approximately equal size.
- For each unique group:
- Take the group as a hold out or test data set.
- Take the remaining groups as a training data set.
- Fit a model on the training set and evaluate it on the test set.
- Retain the evaluation score and discard the model.
- Summarize the skill of the model using the mean of the model evaluation scores.
Common Configurations:
- K=5 or K=10: These values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.
Pros:
- Computationally Efficient: Requires fitting the model only times (usually 5 or 10) rather than times.
- Bias-Variance Balance: -fold generally provides a better estimate of the test error (lower variance) than LOO-CV.
Pseudo-code Implementation:
# K-Fold Concept
data = [...]
K = 5
folds = split_data(data, K)
scores = []
for i in range(K):
test_set = folds[i]
train_set = union(folds - folds[i])
model = train(train_set)
score = evaluate(model, test_set)
scores.append(score)
final_performance = average(scores)
3. Ensemble Learning Overview
Ensemble methods combine several "base" models (weak learners) to produce one optimal predictive model. The core idea is that a group of weak learners can come together to form a strong learner.
- Bagging: Focuses on reducing Variance.
- Boosting: Focuses on reducing Bias (and variance to an extent).
4. Bagging (Bootstrap Aggregating)
Bagging is an ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression.
4.1 Mechanism
- Bootstrapping: From the original dataset of size , create new datasets ().
- Each dataset is created by sampling with replacement.
- Some observations may appear multiple times in a single , while others may be omitted (Out-of-Bag samples).
- Parallel Training: Train a separate model (base learner) on each bootstrap sample independently.
- The models are usually fully grown (high variance) decision trees.
- Aggregation: Combine the predictions of the models.
- Regression: Average the predictions.
- Classification: Majority voting (Hard voting).
4.2 Impact on Performance
- Variance Reduction: By averaging many models that are not perfectly correlated, Bagging reduces the variance of the final prediction without increasing the bias.
- Robustness: It helps avoid overfitting, even if the base models are prone to overfitting.
5. Random Forests
Random Forests are an extension and improvement of Bagging, specifically designed for Decision Trees.
5.1 The Problem with Standard Bagging
In standard bagging, if there is one very strong predictor (feature) in the dataset, most trees will use that feature for the top split. Consequently, all the trees will look very similar (highly correlated). Averaging highly correlated quantities does not significantly reduce variance.
5.2 The Random Forest Solution: Feature Randomness
Random Forests introduce a modification to the tree-growing process to decorrelate the trees.
- Bootstrap Samples: Like bagging, create bootstrap samples of the data.
- Split Selection: When building each tree, at each split point:
- The algorithm does not consider all available features ().
- Instead, it randomly selects a subset of features () where .
- Typically, for classification and for regression.
- Aggregation: Combine predictions via averaging or voting.
5.3 Key Characteristics
- Decorrelation: By forcing trees to use different features, the trees become diverse. When diverse trees are averaged, the variance reduction is far superior to standard bagging.
- No Pruning: Trees are typically grown deep without pruning.
- Out-of-Bag (OOB) Error: Random Forests allow for an internal estimation of error using the data not included in the bootstrap sample, negating the need for separate cross-validation in some contexts.
6. Boosting
Boosting is a sequential ensemble technique that converts weak learners into strong learners. Unlike Bagging (parallel), Boosting builds models sequentially.
6.1 Mechanism
- Initialize: Assign equal weight to all data points.
- Iterative Learning:
- Train a weak model (usually a shallow tree, called a "stump") on the data.
- Calculate the error. Identify which data points were misclassified.
- Reweight: Increase the weight of misclassified points and decrease the weight of correctly classified points.
- Train the next model. This new model is forced to focus on the "hard" examples that the previous model missed.
- Weighted Voting: The final prediction is a weighted sum of the sequential models. Models with higher accuracy are given more weight in the final vote.
6.2 Types of Boosting
- AdaBoost (Adaptive Boosting):
- Updates weights of data points after every iteration.
- Sensitive to noisy data and outliers.
- Gradient Boosting (GBM):
- Instead of updating weights, it trains the next model to predict the residuals (errors) of the previous model.
- It uses Gradient Descent to minimize a loss function.
- XGBoost (Extreme Gradient Boosting):
- An optimized implementation of Gradient Boosting.
- Includes regularization (L1/L2) to prevent overfitting.
- Highly efficient and widely used in competitions.
6.3 Bagging vs. Boosting Comparison
| Feature | Bagging (e.g., Random Forest) | Boosting (e.g., AdaBoost, XGBoost) |
|---|---|---|
| Goal | Decrease Variance | Decrease Bias (and Variance) |
| Method | Parallel ensemble | Sequential ensemble |
| Base Learners | Independent | Dependent (Predecessor influences Successor) |
| Tree Depth | Deep (High Variance, Low Bias) | Shallow (High Bias, Low Variance) |
| Weighting | All models weighted equally | Models weighted by performance |
| Overfitting | Robust to overfitting | Can overfit if iterations are too high |
Summary of Model Performance Strategies
| Strategy | Addresses | Best Used When |
|---|---|---|
| Cross-Validation | Generalization estimation | Determining how a model will perform on unseen data. |
| Bagging | High Variance | You have a complex model (like deep decision trees) that overfits. |
| Random Forest | High Variance | You have high-dimensional data and standard bagging produces correlated trees. |
| Boosting | High Bias | You have simple models (weak learners) that underfit the data. |