Unit 3 - Notes

INT394 7 min read

Unit 3: Non-parametric classification and Ensemble Models

1. Introduction to Non-Parametric Classification

Unlike parametric models (e.g., Linear Regression) which assume a fixed number of parameters and a specific data distribution (like a normal distribution), non-parametric models do not make strong assumptions about the form of the mapping function. The number of parameters in these models grows with the amount of training data.

Key Characteristics:

Flexibility: Can fit a wide number of functional forms.
Data-Driven: The model structure is determined by the data, not a pre-defined equation.
Performance: Generally performs better with large amounts of data but can be slower to train or predict.

2. K-Nearest Neighbours (KNN)

KNN is a supervised, instance-based (lazy) learning algorithm used for both classification and regression. It does not learn a discriminative function from the training data but memorizes the training dataset instead.

2.1 The Algorithm

Store all training examples.
Receive a new unseen query point $x$ .
Calculate the distance between $x$ and all data points in the training set.
Sort the distances and pick the $k$ nearest data points.
Classification: Assign the class label based on the majority vote of the $k$ neighbors.
Regression: Assign the value based on the average of the $k$ neighbors.

2.2 Distance Metrics

The choice of distance metric is critical.

Euclidean Distance: Straight-line distance (L2 Norm). Most common for continuous variables.
$d(p, q) = \sqrt{\sum_{i=1}^n (q_i - p_i)^2}$
Manhattan Distance: Sum of absolute differences (L1 Norm). Useful for high-dimensional data (grid-like).
$d(p, q) = \sum_{i=1}^n |q_i - p_i|$
Minkowski Distance: A generalization of Euclidean and Manhattan.

2.3 Choosing 'K'

Small $k$ (e.g., $k=1$ ): Low bias, high variance. The model is sensitive to noise/outliers. Overfitting risk.
Large $k$ : High bias, low variance. The decision boundary becomes smooth, potentially ignoring local patterns. Underfitting risk.
Rule of thumb: $k = \sqrt{N}$ , where $N$ is the number of samples. Usually, an odd number is chosen to avoid ties in binary classification.

3. Decision Trees

A Decision Tree is a flowchart-like tree structure where an internal node represents a feature (or attribute), the branch represents a decision rule, and each leaf node represents the outcome.

3.1 Terminology

Root Node: The topmost node representing the entire population.
Splitting: Dividing a node into two or more sub-nodes.
Leaf Node: A node that cannot be split further (final output).
Pruning: Removing sub-nodes of a decision node (opposite of splitting).

3.2 Splitting Criteria

To decide which attribute to split on, the algorithm evaluates which split results in the most homogenous sub-nodes.

A. Entropy

A measure of impurity or randomness in the data.

H(S) = - \sum_{i=1}^{c} p_i \log_2 p_i

Where $p_i$ is the probability of class $i$ .
Entropy is 0 if the sample is completely homogeneous (all same class).
Entropy is 1 if the sample is equally divided (50/50).

B. Information Gain (IG)

The reduction in entropy achieved by partitioning the data according to an attribute.

IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v)

The attribute with the highest Information Gain is chosen as the splitting node.

C. Gini Impurity

A measure of how often a randomly chosen element from the set would be incorrectly labeled.

Gini = 1 - \sum_{i=1}^{c} (p_i)^2

Computational efficiency: Gini is faster to calculate than Entropy (no log function).
Range: [0, 0.5] for binary classification.

4. Decision Tree Algorithms: ID3, C4.5, CART

Feature	ID3 (Iterative Dichotomiser 3)	C4.5 (Successor to ID3)	CART (Classification & Regression Trees)
Splitting Criterion	Information Gain	Gain Ratio	Gini Impurity (Classification) / Variance (Regression)
Attribute Type	Categorical only	Categorical & Continuous	Categorical & Continuous
Tree Structure	Multi-way splits	Multi-way splits	Binary splits only (Always two branches)
Missing Values	Cannot handle	Handles implicitly	Handles using surrogate splits
Pruning	None (prone to overfitting)	Error-Based Pruning	Cost-Complexity Pruning

4.1 ID3

Top-down greedy approach.
Drawback: Bias towards attributes with many distinct values (e.g., "User ID" would yield high info gain but is useless).

4.2 C4.5

Improves ID3 by using Gain Ratio to normalize Information Gain, reducing the bias toward high-cardinality attributes.
$GainRatio(A) = \frac{InformationGain(A)}{SplitInfo(A)}$
Can handle continuous data by creating thresholds (e.g., Age > 25).

4.3 CART

Constructs binary trees.
Uses Gini Index for classification and Least Squared Deviation for regression.
The standard algorithm used in libraries like Scikit-Learn.

5. Pruning and Preventing Overfitting

Decision trees are prone to overfitting, creating overly complex trees that memorize noise in the training data (High Variance).

5.1 Pre-Pruning (Early Stopping)

Halting the tree construction before it becomes perfectly pure.

Max Depth: Limit the depth of the tree.
Min Samples Split: Minimum samples required to split an internal node.
Min Samples Leaf: Minimum samples required to be at a leaf node.

5.2 Post-Pruning

Growing the full tree and then removing branches that do not add predictive power.

Reduced Error Pruning: Remove a node; if validation accuracy remains the same or improves, keep it removed.
Cost Complexity Pruning (Weakest Link Pruning): Penalizes the objective function based on the number of terminal nodes.
- $R(T)$ : Misclassification error.
- $|T|$ : Number of leaf nodes.
- $\alpha$ : Complexity parameter (tunable).

6. Ensemble Models

6.1 Importance of Ensemble

Ensemble learning combines the predictions of multiple distinct models to achieve better predictive performance than any of the constituent models alone.

Why Ensembles work:

Reduction of Variance: Averaging multiple models reduces the risk of picking a single model that fits the noise (Bagging).
Reduction of Bias: Aggregating weak learners (models slightly better than random guessing) can create a strong learner (Boosting).
Robustness: Less sensitive to specific data peculiarities.

6.2 Voting and Averaging (Basic Ensembles)

Voting (Classification)

Hard Voting: Every model votes for a class. The majority wins.
- Model A: Class 1
- Model B: Class 1
- Model C: Class 2
- Result: Class 1
Soft Voting: Average the probabilities predicted by each model. The class with the highest average probability wins. Generally performs better than hard voting.

Averaging (Regression)

Sum the predictions of all models and divide by the number of models.

\hat{y} = \frac{1}{N} \sum_{i=1}^{N} y_i

7. Advanced Ensemble Techniques

7.1 Bagging (Bootstrap Aggregating)

Bagging aims to decrease variance. It is a parallel ensemble method.

Bootstrapping: Create $N$ subsets of the original training data by sampling with replacement. Some data points may appear multiple times in a subset, others not at all (Out-of-Bag samples).
Training: Train a separate model (usually decision trees) on each subset independently.
Aggregating:
- Classification: Majority Vote.
- Regression: Average.

Key Example: Random Forest

Uses Bagging with Decision Trees.
Feature Randomness: When splitting a node, it looks for the best split only among a random subset of features (not all features). This de-correlates the trees, ensuring they learn different patterns.

7.2 Boosting

Boosting aims to decrease bias (and variance). It is a sequential ensemble method.

Sequential Learning: Models are trained one after another.
Error Focus: Each subsequent model focuses on the errors made by the previous model.
Weighting:
- Instances that were misclassified are assigned higher weights.
- The next model is forced to concentrate on these "hard" cases.
Final Prediction: A weighted sum of the weak learners' predictions.

Key Algorithms:

AdaBoost (Adaptive Boosting): Uses decision stumps (trees with depth 1). Increases weights of misclassified points.
Gradient Boosting: Optimizes a loss function. Each new tree predicts the residuals (errors) of the previous trees.
XGBoost: An optimized distributed gradient boosting library, famous for high performance and regularization.

7.3 Stacking (Stacked Generalization)

Stacking combines models of different types (heterogeneous) using a meta-learner.

Architecture:

Level 0 (Base Learners): Train different models (e.g., KNN, SVM, Decision Tree) on the full training set.
Predictions: Generate predictions from these models.
Level 1 (Meta Learner): Use the predictions of the Level 0 models as input features to train a new model (often Logistic Regression or Linear Regression).
Goal: The meta-learner learns how to best combine the base models' outputs to minimize error.

8. Summary Comparison of Ensemble Methods

Feature	Bagging	Boosting	Stacking
Goal	Reduce Variance (Overfitting)	Reduce Bias (Underfitting) & Variance	Improve predictive power
Structure	Parallel (Independent models)	Sequential (Dependent models)	Hierarchical
Models	Usually Homogeneous (e.g., all trees)	Usually Homogeneous (e.g., all stumps)	Heterogeneous (Trees, SVM, KNN)
Data Sampling	Bootstrap (Random with replacement)	Reweighted Data based on errors	Original Data + Predictions
Example	Random Forest	AdaBoost, XGBoost	Stacked Generalization

Unit 2

Unit 4