Unit 2 - Notes

INT395

Unit 2: Classification with scikit-learn

1. Overview of Classification

Definition

Classification is a subcategory of Supervised Learning where the goal is to predict the categorical class labels of new instances based on past observations. The algorithm learns a mapping function from input variables to discrete output variables .

Types of Classification

  1. Binary Classification: The target variable has only two distinct classes (e.g., Spam vs. Not Spam, 0 vs. 1).
  2. Multi-class Classification: The target variable has more than two classes, but each sample belongs to only one class (e.g., classifying digits 0–9).
  3. Multi-label Classification: Each sample is assigned a set of target labels (e.g., tagging a movie as both "Action" and "Sci-Fi").

The Scikit-learn API Workflow

Scikit-learn (sklearn) provides a consistent interface for machine learning models (estimators). The standard workflow is:

  1. Import the model class.
  2. Instantiate the class with desired hyperparameters.
  3. Fit the model to the training data (model.fit(X_train, y_train)).
  4. Predict labels for new data (model.predict(X_test)).

2. Evaluation Metrics

Evaluating the performance of a classification model is critical to understanding how well it generalizes to unseen data.

Confusion Matrix

A tabular summary of the number of correct and incorrect predictions broken down by each class.

Predicted Negative (0) Predicted Positive (1)
Actual Negative (0) True Negative (TN) False Positive (FP)
Actual Positive (1) False Negative (FN) True Positive (TP)
  • TP: Correctly predicted positive.
  • TN: Correctly predicted negative.
  • FP (Type I Error): Incorrectly predicted positive.
  • FN (Type II Error): Incorrectly predicted negative.

Core Metrics

1. Accuracy

The ratio of correctly predicted observations to the total observations.

  • Limitation: Misleading in imbalanced datasets (e.g., if 99% of data is Class A, a model predicting "All A" has 99% accuracy but is useless).

2. Precision

The ratio of correctly predicted positive observations to the total predicted positive observations. Measures the quality of the positive predictions.

  • Use case: Spam detection (we want to minimize FP, marking valid email as spam).

3. Recall (Sensitivity / True Positive Rate)

The ratio of correctly predicted positive observations to the all observations in actual class.

  • Use case: Cancer diagnosis (we want to minimize FN, missing a sick patient).

4. F1-Score

The weighted average (harmonic mean) of Precision and Recall. It creates a balance between the two.

  • Use case: Comparing models when dataset is imbalanced.

ROC-AUC

  • ROC (Receiver Operating Characteristic) Curve: A plot of the True Positive Rate (Recall) against the False Positive Rate (1 - Specificity) at various threshold settings.
  • AUC (Area Under the Curve): Represents the degree or measure of separability.
    • AUC = 1.0: Perfect classifier.
    • AUC = 0.5: Random classifier (no discrimination capacity).
    • AUC < 0.5: Reciprocating predictor (worse than random).

PYTHON
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score

# Example usage
y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


3. The Perceptron

Concept

The Perceptron is the simplest type of artificial neural network (ANN). It is a linear classifier inspired by the biological neuron.

  • Mechanism: It takes input features, multiplies them by weights, adds a bias term (), and passes the result through a step activation function.
  • Equation:
  • Decision:
    • If , Output = 1
    • If , Output = 0

Learning Rule

The weights are updated iteratively based on the error:


(Where is the learning rate)

Limitations

The single-layer Perceptron can only classify linearly separable data (it cannot solve the XOR problem).

Implementation

PYTHON
from sklearn.linear_model import Perceptron
clf = Perceptron(tol=1e-3, random_state=0)
clf.fit(X_train, y_train)


4. Logistic Regression

Concept

Despite its name, this is a classification algorithm. It estimates the probability that an instance belongs to a particular class.

  • Sigmoid Function: Unlike linear regression which outputs continuous values, Logistic Regression applies the Sigmoid (Logistic) function to squash the output between 0 and 1.
  • Decision Boundary: A threshold (usually 0.5) is applied to the probability to decide the class.

Key Features

  • Probabilistic Output: Provides confidence scores, not just classes.
  • Linear Decision Boundary: It creates a linear separation in the feature space.
  • Regularization: Scikit-learn implements L2 (Ridge) regularization by default to prevent overfitting.

Implementation

PYTHON
from sklearn.linear_model import LogisticRegression
# C is the inverse of regularization strength (smaller C = stronger regularization)
clf = LogisticRegression(C=1.0, solver='lbfgs')
clf.fit(X_train, y_train)


5. k-Nearest Neighbors (k-NN)

Concept

k-NN is a non-parametric, lazy learning algorithm. It makes no assumptions about the underlying data distribution and does not "learn" a model during training. Instead, it stores the training dataset.

Mechanism

To classify a new data point:

  1. Calculate the distance (Euclidean, Manhattan, etc.) between the new point and all stored training points.
  2. Identify the nearest neighbors.
  3. Assign the class label based on a majority vote of those neighbors.

Choosing

  • Small : Low bias, High variance (sensitive to noise/outliers).
  • Large : High bias, Low variance (smoother decision boundary, but may miss local patterns).

Pros and Cons

  • Pros: Simple, effective for multi-class problems.
  • Cons: Computationally expensive at prediction time; performance degrades with high dimensionality (Curse of Dimensionality).

Implementation

PYTHON
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2) # p=2 is Euclidean
knn.fit(X_train, y_train)


6. Support Vector Machine (SVM)

Concept

SVM finds the optimal hyperplane that separates classes with the maximum margin.

  • Hyperplane: The decision boundary.
  • Support Vectors: The data points closest to the hyperplane. These are the "difficult" points that define the margin.
  • Margin: The distance between the hyperplane and the support vectors. SVM maximizes this distance to improve generalization.

Hard vs. Soft Margin

  • Hard Margin: Assumes data is perfectly linearly separable (strict).
  • Soft Margin: Allows some misclassification to handle outliers and non-linear data (controlled by hyperparameter C).
    • High C: Strict (tries hard not to miss, risk of overfitting).
    • Low C: Loose (wider margin, allows errors, smoother boundary).

The Kernel Trick

SVM can handle non-linearly separable data by projecting inputs into a higher-dimensional space where they become linearly separable.

  • Common Kernels: Linear, Polynomial, RBF (Radial Basis Function).

Implementation

PYTHON
from sklearn.svm import SVC
# kernel='rbf' is default for non-linear problems
svm = SVC(kernel='rbf', C=1.0, gamma='scale')
svm.fit(X_train, y_train)


7. Decision Tree

Concept

A Decision Tree builds a model in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets based on feature values.

  • Root Node: Represents the entire population.
  • Decision Node: A sub-node that splits into further sub-nodes.
  • Leaf/Terminal Node: Nodes that do not split (holds the final class prediction).

Splitting Criteria

The algorithm selects the feature and threshold that results in the most homogenous (pure) child nodes.

  1. Gini Impurity (Default in sklearn): Measures the probability of misclassifying a randomly chosen element. Lower is better.
  2. Entropy (Information Gain): Measures the amount of disorder. The split aims to maximize Information Gain (reduction in Entropy).

Pruning

Decision Trees are prone to overfitting (creating overly complex trees that memorize noise).

  • Pre-pruning: Setting limits like max_depth or min_samples_split.

Implementation

PYTHON
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
tree.fit(X_train, y_train)


8. Naïve Bayes

Concept

A probabilistic classifier based on Bayes' Theorem with a strong (naive) assumption of independence between features.

Bayes' Theorem

  • : Posterior probability (Probability of class given features ).
  • : Likelihood.
  • : Prior probability of class.
  • : Predictor prior probability.

The "Naïve" Assumption

It assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Even if this doesn't hold true in reality, the classifier often performs remarkably well.

Types of Naïve Bayes Classifiers

  1. Gaussian Naïve Bayes: Assumes features follow a normal distribution. Good for continuous data (e.g., Iris dataset).
  2. Multinomial Naïve Bayes: Used for discrete counts. Standard for text classification (word counts).
  3. Bernoulli Naïve Bayes: Used for binary/boolean features.

Implementation

PYTHON
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)