Unit 3 - Notes
INT394
Unit 3: Non-parametric classification and Ensemble Models
1. Introduction to Non-Parametric Classification
Unlike parametric models (e.g., Linear Regression) which assume a fixed number of parameters and a specific data distribution (like a normal distribution), non-parametric models do not make strong assumptions about the form of the mapping function. The number of parameters in these models grows with the amount of training data.
Key Characteristics:
- Flexibility: Can fit a wide number of functional forms.
- Data-Driven: The model structure is determined by the data, not a pre-defined equation.
- Performance: Generally performs better with large amounts of data but can be slower to train or predict.
2. K-Nearest Neighbours (KNN)
KNN is a supervised, instance-based (lazy) learning algorithm used for both classification and regression. It does not learn a discriminative function from the training data but memorizes the training dataset instead.
2.1 The Algorithm
- Store all training examples.
- Receive a new unseen query point .
- Calculate the distance between and all data points in the training set.
- Sort the distances and pick the nearest data points.
- Classification: Assign the class label based on the majority vote of the neighbors.
- Regression: Assign the value based on the average of the neighbors.
2.2 Distance Metrics
The choice of distance metric is critical.
- Euclidean Distance: Straight-line distance (L2 Norm). Most common for continuous variables.
- Manhattan Distance: Sum of absolute differences (L1 Norm). Useful for high-dimensional data (grid-like).
- Minkowski Distance: A generalization of Euclidean and Manhattan.
2.3 Choosing 'K'
- Small (e.g., ): Low bias, high variance. The model is sensitive to noise/outliers. Overfitting risk.
- Large : High bias, low variance. The decision boundary becomes smooth, potentially ignoring local patterns. Underfitting risk.
- Rule of thumb: , where is the number of samples. Usually, an odd number is chosen to avoid ties in binary classification.
3. Decision Trees
A Decision Tree is a flowchart-like tree structure where an internal node represents a feature (or attribute), the branch represents a decision rule, and each leaf node represents the outcome.
3.1 Terminology
- Root Node: The topmost node representing the entire population.
- Splitting: Dividing a node into two or more sub-nodes.
- Leaf Node: A node that cannot be split further (final output).
- Pruning: Removing sub-nodes of a decision node (opposite of splitting).
3.2 Splitting Criteria
To decide which attribute to split on, the algorithm evaluates which split results in the most homogenous sub-nodes.
A. Entropy
A measure of impurity or randomness in the data.
- Where is the probability of class .
- Entropy is 0 if the sample is completely homogeneous (all same class).
- Entropy is 1 if the sample is equally divided (50/50).
B. Information Gain (IG)
The reduction in entropy achieved by partitioning the data according to an attribute.
- The attribute with the highest Information Gain is chosen as the splitting node.
C. Gini Impurity
A measure of how often a randomly chosen element from the set would be incorrectly labeled.
- Computational efficiency: Gini is faster to calculate than Entropy (no log function).
- Range: [0, 0.5] for binary classification.
4. Decision Tree Algorithms: ID3, C4.5, CART
| Feature | ID3 (Iterative Dichotomiser 3) | C4.5 (Successor to ID3) | CART (Classification & Regression Trees) |
|---|---|---|---|
| Splitting Criterion | Information Gain | Gain Ratio | Gini Impurity (Classification) / Variance (Regression) |
| Attribute Type | Categorical only | Categorical & Continuous | Categorical & Continuous |
| Tree Structure | Multi-way splits | Multi-way splits | Binary splits only (Always two branches) |
| Missing Values | Cannot handle | Handles implicitly | Handles using surrogate splits |
| Pruning | None (prone to overfitting) | Error-Based Pruning | Cost-Complexity Pruning |
4.1 ID3
- Top-down greedy approach.
- Drawback: Bias towards attributes with many distinct values (e.g., "User ID" would yield high info gain but is useless).
4.2 C4.5
- Improves ID3 by using Gain Ratio to normalize Information Gain, reducing the bias toward high-cardinality attributes.
- Can handle continuous data by creating thresholds (e.g., Age > 25).
4.3 CART
- Constructs binary trees.
- Uses Gini Index for classification and Least Squared Deviation for regression.
- The standard algorithm used in libraries like Scikit-Learn.
5. Pruning and Preventing Overfitting
Decision trees are prone to overfitting, creating overly complex trees that memorize noise in the training data (High Variance).
5.1 Pre-Pruning (Early Stopping)
Halting the tree construction before it becomes perfectly pure.
- Max Depth: Limit the depth of the tree.
- Min Samples Split: Minimum samples required to split an internal node.
- Min Samples Leaf: Minimum samples required to be at a leaf node.
5.2 Post-Pruning
Growing the full tree and then removing branches that do not add predictive power.
- Reduced Error Pruning: Remove a node; if validation accuracy remains the same or improves, keep it removed.
- Cost Complexity Pruning (Weakest Link Pruning): Penalizes the objective function based on the number of terminal nodes.
- : Misclassification error.
- : Number of leaf nodes.
- : Complexity parameter (tunable).
6. Ensemble Models
6.1 Importance of Ensemble
Ensemble learning combines the predictions of multiple distinct models to achieve better predictive performance than any of the constituent models alone.
Why Ensembles work:
- Reduction of Variance: Averaging multiple models reduces the risk of picking a single model that fits the noise (Bagging).
- Reduction of Bias: Aggregating weak learners (models slightly better than random guessing) can create a strong learner (Boosting).
- Robustness: Less sensitive to specific data peculiarities.
6.2 Voting and Averaging (Basic Ensembles)
Voting (Classification)
- Hard Voting: Every model votes for a class. The majority wins.
- Model A: Class 1
- Model B: Class 1
- Model C: Class 2
- Result: Class 1
- Soft Voting: Average the probabilities predicted by each model. The class with the highest average probability wins. Generally performs better than hard voting.
Averaging (Regression)
Sum the predictions of all models and divide by the number of models.
7. Advanced Ensemble Techniques
7.1 Bagging (Bootstrap Aggregating)
Bagging aims to decrease variance. It is a parallel ensemble method.
- Bootstrapping: Create subsets of the original training data by sampling with replacement. Some data points may appear multiple times in a subset, others not at all (Out-of-Bag samples).
- Training: Train a separate model (usually decision trees) on each subset independently.
- Aggregating:
- Classification: Majority Vote.
- Regression: Average.
Key Example: Random Forest
- Uses Bagging with Decision Trees.
- Feature Randomness: When splitting a node, it looks for the best split only among a random subset of features (not all features). This de-correlates the trees, ensuring they learn different patterns.
7.2 Boosting
Boosting aims to decrease bias (and variance). It is a sequential ensemble method.
- Sequential Learning: Models are trained one after another.
- Error Focus: Each subsequent model focuses on the errors made by the previous model.
- Weighting:
- Instances that were misclassified are assigned higher weights.
- The next model is forced to concentrate on these "hard" cases.
- Final Prediction: A weighted sum of the weak learners' predictions.
Key Algorithms:
- AdaBoost (Adaptive Boosting): Uses decision stumps (trees with depth 1). Increases weights of misclassified points.
- Gradient Boosting: Optimizes a loss function. Each new tree predicts the residuals (errors) of the previous trees.
- XGBoost: An optimized distributed gradient boosting library, famous for high performance and regularization.
7.3 Stacking (Stacked Generalization)
Stacking combines models of different types (heterogeneous) using a meta-learner.
Architecture:
- Level 0 (Base Learners): Train different models (e.g., KNN, SVM, Decision Tree) on the full training set.
- Predictions: Generate predictions from these models.
- Level 1 (Meta Learner): Use the predictions of the Level 0 models as input features to train a new model (often Logistic Regression or Linear Regression).
- Goal: The meta-learner learns how to best combine the base models' outputs to minimize error.
8. Summary Comparison of Ensemble Methods
| Feature | Bagging | Boosting | Stacking |
|---|---|---|---|
| Goal | Reduce Variance (Overfitting) | Reduce Bias (Underfitting) & Variance | Improve predictive power |
| Structure | Parallel (Independent models) | Sequential (Dependent models) | Hierarchical |
| Models | Usually Homogeneous (e.g., all trees) | Usually Homogeneous (e.g., all stumps) | Heterogeneous (Trees, SVM, KNN) |
| Data Sampling | Bootstrap (Random with replacement) | Reweighted Data based on errors | Original Data + Predictions |
| Example | Random Forest | AdaBoost, XGBoost | Stacked Generalization |