Unit 3 - Notes

INT234 8 min read

Unit 3: SUPERVISED LEARNING: CLASSIFICATION

1. Introduction to Classification

Classification is a subcategory of Supervised Learning where the goal is to predict the categorical class labels (discrete values) of new instances based on past observations. Unlike regression, which predicts continuous values, classification answers questions like "Is this email spam?" or "Is the tumor malignant or benign?"

2. Lazy Learning: Nearest Neighbors

Lazy learning algorithms do not generate a model during the training phase. Instead, they store the training data and wait until a query (a new instance to classify) is made to perform generalization.

k-Nearest Neighbors (k-NN)

k-NN is a non-parametric method used for classification and regression. Ideally, it assumes that similar things exist in close proximity.

How it Works:

Store Data: Save all training examples.
Receive Query: A new data point needs classification.
Calculate Distance: Compute the distance between the new point and all stored training points using a distance metric.
Find Neighbors: Identify the $k$ nearest data points.
Vote: The majority class among the $k$ neighbors determines the class of the new point.

Distance Metrics:

Euclidean Distance: The straight-line distance between two points.
$d(p, q) = \sqrt{\sum_{i=1}^n (q_i - p_i)^2}$
Manhattan Distance: The sum of absolute differences (taxicab geometry).
$d(p, q) = \sum_{i=1}^n |q_i - p_i|$
Minkowski Distance: A generalization of Euclidean and Manhattan distances.

Choosing 'k' (Hyperparameter Tuning):

Small $k$ (e.g., $k=1$ ): The model is sensitive to noise (outliers). Highly complex decision boundaries. High variance, low bias (overfitting).
Large $k$ : The model smooths out the prediction boundaries. Computationally expensive. Low variance, high bias (underfitting).
Rule of Thumb: $k$ is often set to the square root of the number of samples ( $\sqrt{n}$ ) and is usually an odd number to avoid tied votes in binary classification.

Pros and Cons:

Pros: Simple to implement; effective with large training data; no training phase (fast training).
Cons: Slow prediction phase (must calculate distance to every point); sensitive to outliers; requires Feature Scaling (normalization) because variables with larger scales dominate distance calculations.

3. Naïve Bayes

Naïve Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' Theorem with strong (naïve) independence assumptions between the features.

Bayes' Theorem

P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

Where:

$P(A|B)$ : Posterior probability (Probability of Class A given Feature B).
$P(B|A)$ : Likelihood (Probability of Feature B given Class A).
$P(A)$ : Prior probability of Class A.
$P(B)$ : Evidence (Probability of Feature B).

The "Naïve" Assumption

The algorithm assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Even if features depend on each other (e.g., age and salary), Naïve Bayes treats them as independent.

Types of Naïve Bayes Classifiers:

Gaussian Naïve Bayes: Used when features are continuous and follow a normal distribution.
Multinomial Naïve Bayes: Used for discrete counts (e.g., text classification/word counts).
Bernoulli Naïve Bayes: Used for binary/boolean features.

The Laplace Smoothing (Additive Smoothing)

If a categorical feature value appears in the test data but not in the training data, the probability becomes 0, wiping out the entire calculation. Laplace smoothing adds a small count (usually 1) to all variable counts to ensure no probability is ever zero.

4. Divide and Conquer: Decision Trees and Rules

These algorithms function by recursively breaking down a problem into two or more sub-problems of the same or related type until these become simple enough to be solved directly.

A. Decision Trees

A decision tree is a flowchart-like structure where:

Root Node: Represents the entire population/sample.
Internal Node: Represents a test on an attribute (e.g., "Is Age > 30?").
Branch: Represents the outcome of the test.
Leaf Node (Terminal Node): Holds the class label (the final decision).

Splitting Criteria

To build the tree, the algorithm must decide which feature to split on at each step. Ideally, we want the split that results in the most homogeneous (pure) child nodes.

Entropy: A measure of randomness or disorder in the dataset.
- Entropy is 0 if the sample is completely homogeneous (all same class).
- Entropy is 1 if the sample is equally divided.
Information Gain: The reduction in entropy achieved by partitioning the data according to an attribute.
- The attribute with the highest Information Gain is chosen as the splitting node.
Gini Impurity: A measure of how often a randomly chosen element from the set would be incorrectly labeled. Used in the CART algorithm.
$\text{Gini} = 1 - \sum (p_i)^2$

Pruning

Decision trees are prone to overfitting (creating trees so complex they memorize the training data). Pruning reduces the size of the tree by removing sections that provide little power to classify instances.

Pre-pruning: Stop growing the tree earlier (e.g., limit max depth).
Post-pruning: Allow the tree to grow fully, then remove branches.

B. Classification Rules

Rules represent information in the form of IF-THEN statements.

Example: IF (Outlook = Sunny) AND (Humidity = High) THEN (Play = No).

Algorithms:

ZeroR: Predicts the majority class (baseline).
OneR (One Rule): Generates one rule for each predictor in the data, then selects the one with the smallest total error as the "one rule."
RIPPER (Repeated Incremental Pruning to Produce Error Reduction): A "Sequential Covering" algorithm. It finds a rule that covers a subset of positive examples, removes those examples from the dataset, and repeats the process until no positive examples remain.

5. Support Vector Machine (SVM)

SVM is a powerful algorithm capable of performing linear or non-linear classification, regression, and outlier detection. The objective is to find a hyperplane in an N-dimensional space that distinctly classifies the data points.

Key Concepts:

The Hyperplane: The decision boundary that separates the classes. In 2D, it is a line; in 3D, a plane.
Support Vectors: The data points closest to the hyperplane. These points are critical because they define the position and orientation of the hyperplane. If you remove other points, the boundary doesn't change; if you remove support vectors, it does.
The Margin: The distance between the hyperplane and the nearest data point from either class. SVM seeks to maximize this margin. A wider margin implies greater confidence in classification.

Handling Non-Linear Data (The Kernel Trick)

Real-world data is rarely linearly separable (cannot be separated by a straight line). SVM solves this by mapping input data into a higher-dimensional space where a linear separator can be found.

Linear Kernel: For linearly separable data.
Polynomial Kernel: Maps to curved lines.
Radial Basis Function (RBF): Maps to infinite dimensions (most common for non-linear data).

Soft Margin vs. Hard Margin (C-Parameter)

Hard Margin: Does not tolerate any misclassification (prone to overfitting, works only on linearly separable data).
Soft Margin: Allows some misclassification to achieve a wider margin (better generalization).
Parameter C: Controls the trade-off.
- High C = Strict (Try not to miss any, risk overfitting).
- Low C = Loose (Smoother boundary, risk underfitting).

6. Evaluate Model Performance

Evaluating how well a classification model performs is crucial to ensure it generalizes well to unseen data.

Confusion Matrix

A tabular summary of the number of correct and incorrect predictions made by a classifier.

	Predicted Negative	Predicted Positive
Actual Negative	TN (True Negative)	FP (False Positive - Type I Error)
Actual Positive	FN (False Negative - Type II Error)	TP (True Positive)

Type I Error (FP): False Alarm (e.g., telling a healthy man he has cancer).
Type II Error (FN): Miss (e.g., telling a sick man he is healthy).

Basic Metrics

1. Accuracy

The ratio of correctly predicted observations to the total observations.

\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

Limitation: Misleading if the dataset is imbalanced (e.g., 99% of data is Class A).

2. Precision (Exactness)

Of all instances predicted as positive, how many were actually positive?

\text{Precision} = \frac{TP}{TP + FP}

Crucial when the cost of False Positives is high (e.g., Email Spam detection).

3. Recall / Sensitivity (Completeness)

Of all actual positive instances, how many did the model capture?

\text{Recall} = \frac{TP}{TP + FN}

Crucial when the cost of False Negatives is high (e.g., Cancer detection).

4. F1 Score

The harmonic mean of Precision and Recall. It provides a balance between the two.

F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

Best metric to use when the class distribution is uneven (imbalanced).

Advanced Metrics

Logarithmic Loss (Log Loss)

Measures the performance of a classification model where the prediction input is a probability value between 0 and 1.

It penalizes false classifications.
It heavily penalizes classifiers that are confident about an incorrect prediction.
Goal: Minimize Log Loss (closer to 0 is better).

Area Under Curve (AUC) - ROC Curve

ROC (Receiver Operating Characteristic) Curve: A graph plotting True Positive Rate (Recall) against False Positive Rate (1 - Specificity) at various threshold settings.
AUC (Area Under the ROC Curve): Represents the degree of separability. It tells how much the model is capable of distinguishing between classes.
- AUC = 1.0: Perfect classifier.
- AUC = 0.5: No discrimination (random guessing).
- AUC = 0.0: Reciprocating (predicting all Negatives as Positives and vice versa).

Unit 2

Unit 4