Unit 2 - Notes
INT394
Unit 2: Classification
1. Overview of Classification
1.1 Definition
Classification is a supervised learning task where the objective is to predict the categorical class labels (discrete values) of new instances based on past observations. The algorithm learns a mapping function from input variables to discrete output variables .
1.2 Key Terminology
- Input (Feature Vector): , representing features.
- Output (Label): .
- Training Set: A labeled dataset used to train the model .
- Classifier: An algorithm that maps input data to a specific category.
1.3 Types of Classification
- Binary Classification: The target has only two possible outcomes (e.g., Spam vs. Not Spam, 0 vs. 1).
- Multi-class Classification: The target has more than two classes (e.g., classifying images of digits 0–9).
- Multi-label Classification: An instance can be assigned multiple labels simultaneously (e.g., tagging a movie as both "Action" and "Comedy").
2. Decision Boundaries and their Properties
2.1 Concept
A decision boundary is a hypersurface that partitions the underlying vector space into two or more sets, one for each class. The classifier will classify all the points on one side of the decision boundary as belonging to one class and all those on the other side as belonging to the other class.
2.2 Properties
- Dimensionality:
- In 1D space (1 feature): The boundary is a point.
- In 2D space (2 features): The boundary is a line (linear) or a curve (non-linear).
- In 3D space: The boundary is a plane or surface.
- In -dimensional space: The boundary is a hyperplane (linear) or a hypersurface (non-linear).
- Mathematical Representation:
The boundary is the region where the probability of the classes is equal. For a binary classification between Class A and Class B:
Or, using a discriminant function :
2.3 Linear vs. Non-Linear Boundaries
- Linear Decision Boundary: Can be defined by a linear combination of features (e.g., ). Algorithms: Logistic Regression, Linear SVM, Perceptron.
- Non-Linear Decision Boundary: Required when classes are not linearly separable. Algorithms: k-Nearest Neighbors (k-NN), Decision Trees, SVM with Kernel trick, Neural Networks.
3. Linear Classifier
3.1 Definition
A linear classifier achieves classification by making a classification decision based on the value of a linear combination of the characteristics.
3.2 The Discriminant Function
A linear classifier makes predictions using the function:
Where:
- is the weight vector (orthogonal to the decision boundary).
- is the feature vector.
- is the bias (shifts the decision boundary away from the origin).
3.3 Prediction Rule
For binary classification ():
- If , predict Class +1.
- If , predict Class -1.
- If , the point lies exactly on the decision boundary.
3.4 Geometry
The decision boundary is defined by . The vector determines the orientation of the boundary, and determines the distance of the hyperplane from the origin.
4. Multi-class Classification Strategies
Many linear classifiers (like SVM and Perceptron) are natively binary. To handle classes (), heuristic methods are used.
4.1 One-vs-All (OvA) / One-vs-Rest (OvR)
Strategy: Train distinct binary classifiers.
- Process: For each class (where ), train a classifier to distinguish Class (positive) from all other classes combined (negative).
- Prediction: Given a new input , run all classifiers. The class with the highest confidence score (or highest probability) is selected.
- Pros: Efficient (only classifiers).
- Cons: Can suffer from class imbalance (the "Rest" class is usually much larger than the "One" class).
4.2 One-vs-One (OvO)
Strategy: Train a binary classifier for every pair of classes.
- Process: If there are classes, we train classifiers. Each classifier distinguishes between Class and Class .
- Prediction:
- Run the input through all classifiers.
- Use a Voting Scheme: If the classifier for predicts , Class gets a vote.
- The class with the most votes wins.
- Pros: Each classifier is trained on a smaller subset of data; less sensitive to imbalance.
- Cons: Computationally expensive for large (quadratic growth in number of models).
5. Probabilistic Approaches for Classification
Instead of outputting a hard class label directly, probabilistic classifiers output the probability that an instance belongs to a specific class: .
5.1 Generative vs. Discriminative Models
- Discriminative Models:
- Model directly.
- Focus on finding the decision boundary.
- Examples: Logistic Regression, SVM.
- Generative Models:
- Model how the data is generated: (likelihood) and (prior).
- Use Bayes theorem to calculate posterior .
- Examples: Naïve Bayes, Gaussian Discriminant Analysis.
6. Bayes Theorem
6.1 The Theorem
Bayes Theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. In Machine Learning, it is used to update the probability for a hypothesis as more evidence becomes available.
6.2 Application to Classification
We want to find the class given features . Using Bayes Theorem:
Where:
- - Posterior Probability: The probability of the class given the observed data . (This is what we want to calculate).
- - Likelihood: The probability of observing features given that the class is .
- - Prior Probability: The initial probability of class before seeing any data (usually frequency of class in the training set).
- - Evidence (Marginal Likelihood): The total probability of observing the data . Since is constant for all classes, it is often ignored during maximization.
7. Bayesian Decision Theory
7.1 Concept
Bayesian Decision Theory is a fundamental statistical approach to the problem of pattern classification. It quantifies the trade-offs between various classification decisions using probability and costs.
7.2 MAP (Maximum A Posteriori) Estimation
To classify an observation , we select the class that maximizes the posterior probability:
Using Bayes theorem (and ignoring the denominator ):
7.3 Risk and Loss
Sometimes, misclassifying Class A as Class B is more costly than the reverse (e.g., diagnosing a healthy person as sick vs. diagnosing a sick person as healthy).
- Loss Function : The cost incurred when taking action (predicting class ) when the true state of nature is (true class ).
- Expected Risk: The goal of Bayesian Decision Theory is to minimize the expected risk (total cost).
If we assume a Zero-One Loss function (loss is 0 for correct classification, 1 for error), minimizing risk is equivalent to maximizing the posterior probability (MAP).
8. Naïve Bayes Classifier
8.1 The "Naïve" Assumption
Calculating the joint likelihood is computationally impossible for high-dimensional data because it requires an exponential amount of data.
Naïve Bayes makes a strong independence assumption:
It assumes that all features are mutually independent given the class label .
8.2 Mathematical Formulation
Due to the independence assumption:
Therefore, the classification rule becomes:
8.3 Types of Naïve Bayes Classifiers
The type depends on the distribution of :
-
Gaussian Naïve Bayes:
- Used for continuous data.
- Assumes features follow a normal (Gaussian) distribution.
- Parameters learned: Mean and Variance for each class/feature.
- Likelihood:
-
Multinomial Naïve Bayes:
- Used for discrete counts (e.g., text classification/word counts).
- Features represent counts or frequencies.
-
Bernoulli Naïve Bayes:
- Used for binary features (0 or 1).
- Example: Does the word "buy" appear in the email? (Yes/No).
8.4 The Zero-Frequency Problem (Laplace Smoothing)
If a categorical feature value occurs in the test set but was never seen in the training set for a specific class, . This zeros out the entire probability product.
Solution: Add a small constant (usually 1) to the count of every feature.
8.5 Advantages and Disadvantages
- Advantages:
- Very fast training and prediction.
- Works well with high-dimensional data (e.g., text).
- Performs well with small training data.
- Disadvantages:
- The assumption of feature independence is rarely true in real life (e.g., in text, "Hong" is likely followed by "Kong"). However, it often performs well despite this.