Unit 2 - Notes
INT395
Unit 2: Classification with scikit-learn
1. Overview of Classification
Definition
Classification is a subcategory of Supervised Learning where the goal is to predict the categorical class labels of new instances based on past observations. The algorithm learns a mapping function from input variables to discrete output variables .
Types of Classification
- Binary Classification: The target variable has only two distinct classes (e.g., Spam vs. Not Spam, 0 vs. 1).
- Multi-class Classification: The target variable has more than two classes, but each sample belongs to only one class (e.g., classifying digits 0–9).
- Multi-label Classification: Each sample is assigned a set of target labels (e.g., tagging a movie as both "Action" and "Sci-Fi").
The Scikit-learn API Workflow
Scikit-learn (sklearn) provides a consistent interface for machine learning models (estimators). The standard workflow is:
- Import the model class.
- Instantiate the class with desired hyperparameters.
- Fit the model to the training data (
model.fit(X_train, y_train)). - Predict labels for new data (
model.predict(X_test)).
2. Evaluation Metrics
Evaluating the performance of a classification model is critical to understanding how well it generalizes to unseen data.
Confusion Matrix
A tabular summary of the number of correct and incorrect predictions broken down by each class.
| Predicted Negative (0) | Predicted Positive (1) | |
|---|---|---|
| Actual Negative (0) | True Negative (TN) | False Positive (FP) |
| Actual Positive (1) | False Negative (FN) | True Positive (TP) |
- TP: Correctly predicted positive.
- TN: Correctly predicted negative.
- FP (Type I Error): Incorrectly predicted positive.
- FN (Type II Error): Incorrectly predicted negative.
Core Metrics
1. Accuracy
The ratio of correctly predicted observations to the total observations.
- Limitation: Misleading in imbalanced datasets (e.g., if 99% of data is Class A, a model predicting "All A" has 99% accuracy but is useless).
2. Precision
The ratio of correctly predicted positive observations to the total predicted positive observations. Measures the quality of the positive predictions.
- Use case: Spam detection (we want to minimize FP, marking valid email as spam).
3. Recall (Sensitivity / True Positive Rate)
The ratio of correctly predicted positive observations to the all observations in actual class.
- Use case: Cancer diagnosis (we want to minimize FN, missing a sick patient).
4. F1-Score
The weighted average (harmonic mean) of Precision and Recall. It creates a balance between the two.
- Use case: Comparing models when dataset is imbalanced.
ROC-AUC
- ROC (Receiver Operating Characteristic) Curve: A plot of the True Positive Rate (Recall) against the False Positive Rate (1 - Specificity) at various threshold settings.
- AUC (Area Under the Curve): Represents the degree or measure of separability.
- AUC = 1.0: Perfect classifier.
- AUC = 0.5: Random classifier (no discrimination capacity).
- AUC < 0.5: Reciprocating predictor (worse than random).
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score
# Example usage
y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
3. The Perceptron
Concept
The Perceptron is the simplest type of artificial neural network (ANN). It is a linear classifier inspired by the biological neuron.
- Mechanism: It takes input features, multiplies them by weights, adds a bias term (), and passes the result through a step activation function.
- Equation:
- Decision:
- If , Output = 1
- If , Output = 0
Learning Rule
The weights are updated iteratively based on the error:
(Where is the learning rate)
Limitations
The single-layer Perceptron can only classify linearly separable data (it cannot solve the XOR problem).
Implementation
from sklearn.linear_model import Perceptron
clf = Perceptron(tol=1e-3, random_state=0)
clf.fit(X_train, y_train)
4. Logistic Regression
Concept
Despite its name, this is a classification algorithm. It estimates the probability that an instance belongs to a particular class.
- Sigmoid Function: Unlike linear regression which outputs continuous values, Logistic Regression applies the Sigmoid (Logistic) function to squash the output between 0 and 1.
- Decision Boundary: A threshold (usually 0.5) is applied to the probability to decide the class.
Key Features
- Probabilistic Output: Provides confidence scores, not just classes.
- Linear Decision Boundary: It creates a linear separation in the feature space.
- Regularization: Scikit-learn implements L2 (Ridge) regularization by default to prevent overfitting.
Implementation
from sklearn.linear_model import LogisticRegression
# C is the inverse of regularization strength (smaller C = stronger regularization)
clf = LogisticRegression(C=1.0, solver='lbfgs')
clf.fit(X_train, y_train)
5. k-Nearest Neighbors (k-NN)
Concept
k-NN is a non-parametric, lazy learning algorithm. It makes no assumptions about the underlying data distribution and does not "learn" a model during training. Instead, it stores the training dataset.
Mechanism
To classify a new data point:
- Calculate the distance (Euclidean, Manhattan, etc.) between the new point and all stored training points.
- Identify the nearest neighbors.
- Assign the class label based on a majority vote of those neighbors.
Choosing
- Small : Low bias, High variance (sensitive to noise/outliers).
- Large : High bias, Low variance (smoother decision boundary, but may miss local patterns).
Pros and Cons
- Pros: Simple, effective for multi-class problems.
- Cons: Computationally expensive at prediction time; performance degrades with high dimensionality (Curse of Dimensionality).
Implementation
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2) # p=2 is Euclidean
knn.fit(X_train, y_train)
6. Support Vector Machine (SVM)
Concept
SVM finds the optimal hyperplane that separates classes with the maximum margin.
- Hyperplane: The decision boundary.
- Support Vectors: The data points closest to the hyperplane. These are the "difficult" points that define the margin.
- Margin: The distance between the hyperplane and the support vectors. SVM maximizes this distance to improve generalization.
Hard vs. Soft Margin
- Hard Margin: Assumes data is perfectly linearly separable (strict).
- Soft Margin: Allows some misclassification to handle outliers and non-linear data (controlled by hyperparameter
C).- High C: Strict (tries hard not to miss, risk of overfitting).
- Low C: Loose (wider margin, allows errors, smoother boundary).
The Kernel Trick
SVM can handle non-linearly separable data by projecting inputs into a higher-dimensional space where they become linearly separable.
- Common Kernels: Linear, Polynomial, RBF (Radial Basis Function).
Implementation
from sklearn.svm import SVC
# kernel='rbf' is default for non-linear problems
svm = SVC(kernel='rbf', C=1.0, gamma='scale')
svm.fit(X_train, y_train)
7. Decision Tree
Concept
A Decision Tree builds a model in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets based on feature values.
- Root Node: Represents the entire population.
- Decision Node: A sub-node that splits into further sub-nodes.
- Leaf/Terminal Node: Nodes that do not split (holds the final class prediction).
Splitting Criteria
The algorithm selects the feature and threshold that results in the most homogenous (pure) child nodes.
- Gini Impurity (Default in sklearn): Measures the probability of misclassifying a randomly chosen element. Lower is better.
- Entropy (Information Gain): Measures the amount of disorder. The split aims to maximize Information Gain (reduction in Entropy).
Pruning
Decision Trees are prone to overfitting (creating overly complex trees that memorize noise).
- Pre-pruning: Setting limits like
max_depthormin_samples_split.
Implementation
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
tree.fit(X_train, y_train)
8. Naïve Bayes
Concept
A probabilistic classifier based on Bayes' Theorem with a strong (naive) assumption of independence between features.
Bayes' Theorem
- : Posterior probability (Probability of class given features ).
- : Likelihood.
- : Prior probability of class.
- : Predictor prior probability.
The "Naïve" Assumption
It assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Even if this doesn't hold true in reality, the classifier often performs remarkably well.
Types of Naïve Bayes Classifiers
- Gaussian Naïve Bayes: Assumes features follow a normal distribution. Good for continuous data (e.g., Iris dataset).
- Multinomial Naïve Bayes: Used for discrete counts. Standard for text classification (word counts).
- Bernoulli Naïve Bayes: Used for binary/boolean features.
Implementation
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)