Unit 4 - Notes

INT344 7 min read

Unit 4: Natural Language Processing with Classification Models

This unit focuses on Supervised Machine Learning techniques for text classification, specifically Sentiment Analysis. It covers the transition from raw text to numerical representation and the mathematical foundations of two primary algorithms: Logistic Regression and Naïve Bayes.

1. Extract Features from Text into Numerical Vectors

Machine learning models cannot process raw text strings directly; they require numerical input. Feature extraction is the process of transforming text into a numerical vector representation.

A. Vocabulary and Sparse Representation

Vocabulary ( $V$ ): The list of all unique words found in the training corpus.
One-Hot Encoding: A simplistic approach where every word is a vector of length $|V|$ . This results in a "sparse" matrix (mostly zeros), which is computationally expensive and inefficient for large datasets.

B. Feature Extraction with Frequencies

To create a dense representation for classification (specifically for binary sentiment analysis), we map words to their frequencies based on class labels.

Frequency Dictionary: A mapping that counts how many times a specific word appears in a specific class.
- Structure: {(word, label): count}
- Example: {("happy", 1): 50, ("happy", 0): 2} (Word "happy" appears 50 times in positive tweets, 2 times in negative tweets).
Feature Vector ( $x$ ):
For a given text input (e.g., a tweet), we can extract a vector of dimension 3 (bias, positive sum, negative sum).
- $x_0 = 1$ : The Bias unit.
- $x_1$ : The sum of the frequencies of words in the tweet that appear in the Positive class.
- $x_2$ : The sum of the frequencies of words in the tweet that appear in the Negative class.

2. Binary Classifier using Logistic Regression

Logistic Regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. In NLP, it is used for Binary Classification (e.g., Positive vs. Negative, Spam vs. Not Spam).

A. The Sigmoid Function

Unlike Linear Regression, which outputs continuous values ( $-\infty$ to $+\infty$ ), Logistic Regression must output a probability between 0 and 1. This is achieved using the Sigmoid Function.

h(z) = \frac{1}{1 + e^{-z}}

If $z \to \infty$ , $h(z) \to 1$
If $z \to -\infty$ , $h(z) \to 0$
If $z = 0$ , $h(z) = 0.5$

B. The Decision Boundary

We define a threshold (usually 0.5) to classify the output:

$\text{Prediction} = 1$ (Positive) if $h(z) \geq 0.5$
$\text{Prediction} = 0$ (Negative) if $h(z) < 0.5$

C. Mathematical Definition

The variable $z$ is the dot product of the Weight Vector ( $\theta$ ) and the Feature Vector ( $x$ ).

z = \theta \cdot x = \theta_0x_0 + \theta_1x_1 + ... + \theta_nx_n

z = \theta^T x

3. Sentiment Analysis with Logistic Regression

Applying Logistic Regression to Sentiment Analysis involves a supervised learning pipeline.

A. The Pipeline

Preprocessing: Clean text (remove stop words, punctuation, perform stemming/lemmatization).
Feature Extraction: Convert text to numerical vector $x$ (as defined in Section 1).
Initialization: Initialize weights $\theta$ randomly or with zeros.
Training: Optimize $\theta$ to minimize error.
Prediction: Apply the learned $\theta$ to new data.

B. Training: Cost Function and Gradient Descent

To train the model, we compare the predicted probability $h(z)$ with the actual label $y$ .

Cost Function ( $J$ ): Measures the error over the entire training set.
- If $y=1$ and prediction is close to 0, cost is high.
- If $y=0$ and prediction is close to 1, cost is high.
Gradient Descent: Iteratively updates the weights $\theta$ by moving in the opposite direction of the gradient (slope) of the cost function to find the global minimum.

$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)$
(Where $\alpha$ is the learning rate).

C. Testing/Prediction

Given a new sentence:

Extract features $x$ .
Calculate $z = \theta_{trained}^T x$ .
Calculate sigmoid $h(z)$ .
Classify based on threshold.

4. Bayes' Rule for Conditional Probabilities

Before understanding Naïve Bayes, one must understand conditional probability and Bayes' Theorem.

A. Conditional Probability

$P(A|B)$ is the probability of event A occurring given that event B has already occurred.

P(A|B) = \frac{P(A \cap B)}{P(B)}

B. Bayes' Theorem

Bayes' theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

$P(A|B)$ : Posterior Probability (What we want to calculate).
$P(B|A)$ : Likelihood (Probability of predictor given class).
$P(A)$ : Prior Probability (General probability of the class).
$P(B)$ : Evidence (Probability of the predictor).

C. Context in Sentiment Analysis

We want to find the probability that a sentence is Positive ( $y=1$ ) given a set of words ( $W$ ).

P(Positive | W) = \frac{P(W | Positive) \cdot P(Positive)}{P(W)}

5. The Naïve Bayes Classifier

The Naïve Bayes classifier is a probabilistic machine learning model suitable for classification tasks.

A. The "Naïve" Assumption

Calculating the exact probability of every combination of words in a sentence is computationally impossible due to the sparsity of data.

Assumption: The model assumes that the presence of a particular feature (word) in a class is unrelated (independent) to the presence of any other feature.
While this assumption is rarely true in real language (e.g., "not" affects "happy"), the model performs surprisingly well in practice.

B. The Classification Formula

To classify a sentence with words $w_1, w_2, ..., w_n$ :

P(y|w_1...w_n) \propto P(y) \cdot \prod_{i=1}^{n} P(w_i|y)

We compare the probabilities for both classes. For binary classification, we can calculate the ratio:

\frac{P(Pos|w_1...w_n)}{P(Neg|w_1...w_n)} = \frac{P(Pos)}{P(Neg)} \cdot \prod_{i=1}^{n} \frac{P(w_i|Pos)}{P(w_i|Neg)}

Prior Ratio: $\frac{P(Pos)}{P(Neg)}$
Likelihood Ratio: $\prod \frac{P(w_i|Pos)}{P(w_i|Neg)}$

6. Sentiment Analysis with Naïve Bayes

Implementation of Naïve Bayes for Sentiment Analysis involves creating a lookup table of probabilities (likelihoods) for every word in the vocabulary.

A. Training Phase (Building the Table)

Count Frequencies: Count $N_{pos}$ (total words in positive class) and $N_{neg}$ (total words in negative class).
Calculate Conditional Probabilities:
$P(w_i | Pos) = \frac{freq(w_i, Pos)}{N_{pos}}$
$P(w_i | Neg) = \frac{freq(w_i, Neg)}{N_{neg}}$

B. Laplace Smoothing

Problem: If a word appears in the testing set but was never seen in the "Positive" training set, $freq(w_i, Pos) = 0$ . Since we multiply probabilities, the entire probability becomes 0.

Solution: Add 1 to the numerator and add the size of the Vocabulary ( $V$ ) to the denominator.

P(w_i | Class) = \frac{freq(w_i, Class) + 1}{N_{Class} + V}

C. Log Likelihoods

Problem: Multiplying many small probabilities (e.g., $0.0005 \times 0.0003...$ ) results in Arithmetic Underflow (numbers become too small for computer memory).

Solution: Use Logarithms. Logarithms turn multiplication into addition.

\log(a \cdot b) = \log(a) + \log(b)

We define the Lambda ( $\lambda$ ) of a word as the log ratio of its probabilities:

\lambda(w) = \log \left( \frac{P(w|Pos)}{P(w|Neg)} \right)

D. The Inference Algorithm

To classify a new sentence using Log Likelihoods:

Calculate the Log Prior:

$\text{logprior} = \log \left( \frac{\text{Number of Pos Documents}}{\text{Number of Neg Documents}} \right)$
Sum the $\lambda$ scores for each word in the sentence:

$\text{Score} = \text{logprior} + \sum_{w \in \text{sentence}} \lambda(w)$
Decision:
- If $\text{Score} > 0 \rightarrow$ Positive
- If $\text{Score} < 0 \rightarrow$ Negative

E. Summary of Naïve Bayes vs. Logistic Regression

Feature	Naïve Bayes	Logistic Regression
Model Type	Generative: Models the distribution of individual classes ( $P(x\|y)$ ).	Discriminative: Learn the boundary between classes ( $P(y\|x)$ ).
Assumptions	Assumes features (words) are independent.	Does not assume independence; learns weights based on feature interaction implicitly via the sum.
Data Size	Works well with smaller datasets.	Usually requires more data to generalize well.
Speed	Very fast (simple counting).	Slower (requires iterative training via Gradient Descent).

Unit 3

Unit 5