Unit 2 - Notes

INT394 8 min read

Unit 2: Classification

1. Overview of Classification

1.1 Definition

Classification is a supervised learning task where the objective is to predict the categorical class labels (discrete values) of new instances based on past observations. The algorithm learns a mapping function $f$ from input variables $X$ to discrete output variables $y$ .

1.2 Key Terminology

Input (Feature Vector): $x \in \mathbb{R}^d$ , representing $d$ features.
Output (Label): $y \in \{C_1, C_2, ..., C_k\}$ .
Training Set: A labeled dataset used to train the model $\{(x^{(1)}, y^{(1)}), ..., (x^{(m)}, y^{(m)})\}$ .
Classifier: An algorithm that maps input data to a specific category.

1.3 Types of Classification

Binary Classification: The target has only two possible outcomes (e.g., Spam vs. Not Spam, 0 vs. 1).
Multi-class Classification: The target has more than two classes (e.g., classifying images of digits 0–9).
Multi-label Classification: An instance can be assigned multiple labels simultaneously (e.g., tagging a movie as both "Action" and "Comedy").

2. Decision Boundaries and their Properties

2.1 Concept

A decision boundary is a hypersurface that partitions the underlying vector space into two or more sets, one for each class. The classifier will classify all the points on one side of the decision boundary as belonging to one class and all those on the other side as belonging to the other class.

2.2 Properties

Dimensionality:
- In 1D space (1 feature): The boundary is a point.
- In 2D space (2 features): The boundary is a line (linear) or a curve (non-linear).
- In 3D space: The boundary is a plane or surface.
- In $d$ -dimensional space: The boundary is a hyperplane (linear) or a hypersurface (non-linear).
Mathematical Representation:
The boundary is the region where the probability of the classes is equal. For a binary classification between Class A and Class B:
$P(A|x) = P(B|x)$
Or, using a discriminant function $f(x)$ :
$f(x) = 0$

2.3 Linear vs. Non-Linear Boundaries

Linear Decision Boundary: Can be defined by a linear combination of features (e.g., $w_1x_1 + w_2x_2 + b = 0$ ). Algorithms: Logistic Regression, Linear SVM, Perceptron.
Non-Linear Decision Boundary: Required when classes are not linearly separable. Algorithms: k-Nearest Neighbors (k-NN), Decision Trees, SVM with Kernel trick, Neural Networks.

3. Linear Classifier

3.1 Definition

A linear classifier achieves classification by making a classification decision based on the value of a linear combination of the characteristics.

3.2 The Discriminant Function

A linear classifier makes predictions using the function:

f(x) = w^T x + b

Where:

$w$ is the weight vector (orthogonal to the decision boundary).
$x$ is the feature vector.
$b$ is the bias (shifts the decision boundary away from the origin).

3.3 Prediction Rule

For binary classification ( $y \in \{-1, +1\}$ ):

\hat{y} = \text{sign}(w^T x + b)

If $w^T x + b > 0$ , predict Class +1.
If $w^T x + b < 0$ , predict Class -1.
If $w^T x + b = 0$ , the point lies exactly on the decision boundary.

3.4 Geometry

The decision boundary is defined by $w \cdot x + b = 0$ . The vector $w$ determines the orientation of the boundary, and $b$ determines the distance of the hyperplane from the origin.

4. Multi-class Classification Strategies

Many linear classifiers (like SVM and Perceptron) are natively binary. To handle $K$ classes ( $K > 2$ ), heuristic methods are used.

4.1 One-vs-All (OvA) / One-vs-Rest (OvR)

Strategy: Train $K$ distinct binary classifiers.

Process: For each class $i$ (where $i = 1 \dots K$ ), train a classifier $f_i(x)$ to distinguish Class $i$ (positive) from all other classes combined (negative).
Prediction: Given a new input $x$ , run all $K$ classifiers. The class with the highest confidence score (or highest probability) is selected.
$\hat{y} = \arg\max_{i} f_i(x)$
Pros: Efficient (only $K$ classifiers).
Cons: Can suffer from class imbalance (the "Rest" class is usually much larger than the "One" class).

4.2 One-vs-One (OvO)

Strategy: Train a binary classifier for every pair of classes.

Process: If there are $K$ classes, we train $\frac{K(K-1)}{2}$ classifiers. Each classifier distinguishes between Class $i$ and Class $j$ .
Prediction:
1. Run the input $x$ through all $\frac{K(K-1)}{2}$ classifiers.
2. Use a Voting Scheme: If the classifier for $(i, j)$ predicts $i$ , Class $i$ gets a vote.
3. The class with the most votes wins.
Pros: Each classifier is trained on a smaller subset of data; less sensitive to imbalance.
Cons: Computationally expensive for large $K$ (quadratic growth in number of models).

5. Probabilistic Approaches for Classification

Instead of outputting a hard class label directly, probabilistic classifiers output the probability that an instance belongs to a specific class: $P(y=k | x)$ .

5.1 Generative vs. Discriminative Models

Discriminative Models:
- Model $P(y|x)$ directly.
- Focus on finding the decision boundary.
- Examples: Logistic Regression, SVM.
Generative Models:
- Model how the data is generated: $P(x|y)$ (likelihood) and $P(y)$ (prior).
- Use Bayes theorem to calculate posterior $P(y|x)$ .
- Examples: Naïve Bayes, Gaussian Discriminant Analysis.

6. Bayes Theorem

6.1 The Theorem

Bayes Theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. In Machine Learning, it is used to update the probability for a hypothesis as more evidence becomes available.

P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

6.2 Application to Classification

We want to find the class $y$ given features $x$ . Using Bayes Theorem:

P(y|x) = \frac{P(x|y) \cdot P(y)}{P(x)}

Where:

$P(y|x)$ - Posterior Probability: The probability of the class $y$ given the observed data $x$ . (This is what we want to calculate).
$P(x|y)$ - Likelihood: The probability of observing features $x$ given that the class is $y$ .
$P(y)$ - Prior Probability: The initial probability of class $y$ before seeing any data (usually frequency of class $y$ in the training set).
$P(x)$ - Evidence (Marginal Likelihood): The total probability of observing the data $x$ . Since $P(x)$ is constant for all classes, it is often ignored during maximization.

7. Bayesian Decision Theory

7.1 Concept

Bayesian Decision Theory is a fundamental statistical approach to the problem of pattern classification. It quantifies the trade-offs between various classification decisions using probability and costs.

7.2 MAP (Maximum A Posteriori) Estimation

To classify an observation $x$ , we select the class $y$ that maximizes the posterior probability:

\hat{y}_{MAP} = \arg\max_{y} P(y|x)

Using Bayes theorem (and ignoring the denominator

P(x)

\hat{y}_{MAP} = \arg\max_{y} [P(x|y) \cdot P(y)]

7.3 Risk and Loss

Sometimes, misclassifying Class A as Class B is more costly than the reverse (e.g., diagnosing a healthy person as sick vs. diagnosing a sick person as healthy).

Loss Function $\lambda(\alpha_i | \omega_j)$ : The cost incurred when taking action $\alpha_i$ (predicting class $i$ ) when the true state of nature is $\omega_j$ (true class $j$ ).
Expected Risk: The goal of Bayesian Decision Theory is to minimize the expected risk (total cost).

If we assume a Zero-One Loss function (loss is 0 for correct classification, 1 for error), minimizing risk is equivalent to maximizing the posterior probability (MAP).

8. Naïve Bayes Classifier

8.1 The "Naïve" Assumption

Calculating the joint likelihood $P(x_1, x_2, ..., x_d | y)$ is computationally impossible for high-dimensional data because it requires an exponential amount of data.

Naïve Bayes makes a strong independence assumption:

It assumes that all features $x_i$ are mutually independent given the class label $y$ .

8.2 Mathematical Formulation

Due to the independence assumption:

P(x|y) = P(x_1, x_2, \dots, x_d | y) \approx P(x_1|y) \cdot P(x_2|y) \cdot \dots \cdot P(x_d|y)

P(x|y) = \prod_{j=1}^{d} P(x_j|y)

Therefore, the classification rule becomes:

\hat{y} = \arg\max_{y} \left( P(y) \prod_{j=1}^{d} P(x_j|y) \right)

8.3 Types of Naïve Bayes Classifiers

The type depends on the distribution of $P(x_j|y)$ :

Gaussian Naïve Bayes:
- Used for continuous data.
- Assumes features follow a normal (Gaussian) distribution.
- Parameters learned: Mean $\mu$ and Variance $\sigma^2$ for each class/feature.
- Likelihood: $P(x_j|y) = \frac{1}{\sqrt{2\pi\sigma_y^2}} \exp\left(-\frac{(x_j - \mu_y)^2}{2\sigma_y^2}\right)$
Multinomial Naïve Bayes:
- Used for discrete counts (e.g., text classification/word counts).
- Features represent counts or frequencies.
Bernoulli Naïve Bayes:
- Used for binary features (0 or 1).
- Example: Does the word "buy" appear in the email? (Yes/No).

8.4 The Zero-Frequency Problem (Laplace Smoothing)

If a categorical feature value occurs in the test set but was never seen in the training set for a specific class, $P(x_j|y) = 0$ . This zeros out the entire probability product.
Solution: Add a small constant $\alpha$ (usually 1) to the count of every feature.

P(x_j|y) = \frac{\text{count}(x_j, y) + \alpha}{\text{count}(y) + \alpha \cdot d}

8.5 Advantages and Disadvantages

Advantages:
- Very fast training and prediction.
- Works well with high-dimensional data (e.g., text).
- Performs well with small training data.
Disadvantages:
- The assumption of feature independence is rarely true in real life (e.g., in text, "Hong" is likely followed by "Kong"). However, it often performs well despite this.

Unit 1

Unit 3