Unit 4 - Notes
INT344
Unit 4: Natural Language Processing with Classification Models
This unit focuses on Supervised Machine Learning techniques for text classification, specifically Sentiment Analysis. It covers the transition from raw text to numerical representation and the mathematical foundations of two primary algorithms: Logistic Regression and Naïve Bayes.
1. Extract Features from Text into Numerical Vectors
Machine learning models cannot process raw text strings directly; they require numerical input. Feature extraction is the process of transforming text into a numerical vector representation.
A. Vocabulary and Sparse Representation
- Vocabulary (): The list of all unique words found in the training corpus.
- One-Hot Encoding: A simplistic approach where every word is a vector of length . This results in a "sparse" matrix (mostly zeros), which is computationally expensive and inefficient for large datasets.
B. Feature Extraction with Frequencies
To create a dense representation for classification (specifically for binary sentiment analysis), we map words to their frequencies based on class labels.
-
Frequency Dictionary: A mapping that counts how many times a specific word appears in a specific class.
- Structure:
{(word, label): count} - Example:
{("happy", 1): 50, ("happy", 0): 2}(Word "happy" appears 50 times in positive tweets, 2 times in negative tweets).
- Structure:
-
Feature Vector ():
For a given text input (e.g., a tweet), we can extract a vector of dimension 3 (bias, positive sum, negative sum).- : The Bias unit.
- : The sum of the frequencies of words in the tweet that appear in the Positive class.
- : The sum of the frequencies of words in the tweet that appear in the Negative class.
2. Binary Classifier using Logistic Regression
Logistic Regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. In NLP, it is used for Binary Classification (e.g., Positive vs. Negative, Spam vs. Not Spam).
A. The Sigmoid Function
Unlike Linear Regression, which outputs continuous values ( to ), Logistic Regression must output a probability between 0 and 1. This is achieved using the Sigmoid Function.
- If ,
- If ,
- If ,
B. The Decision Boundary
We define a threshold (usually 0.5) to classify the output:
- (Positive) if
- (Negative) if
C. Mathematical Definition
The variable is the dot product of the Weight Vector () and the Feature Vector ().
3. Sentiment Analysis with Logistic Regression
Applying Logistic Regression to Sentiment Analysis involves a supervised learning pipeline.
A. The Pipeline
- Preprocessing: Clean text (remove stop words, punctuation, perform stemming/lemmatization).
- Feature Extraction: Convert text to numerical vector (as defined in Section 1).
- Initialization: Initialize weights randomly or with zeros.
- Training: Optimize to minimize error.
- Prediction: Apply the learned to new data.
B. Training: Cost Function and Gradient Descent
To train the model, we compare the predicted probability with the actual label .
-
Cost Function (): Measures the error over the entire training set.
- If and prediction is close to 0, cost is high.
- If and prediction is close to 1, cost is high.
-
Gradient Descent: Iteratively updates the weights by moving in the opposite direction of the gradient (slope) of the cost function to find the global minimum.
(Where is the learning rate).
C. Testing/Prediction
Given a new sentence:
- Extract features .
- Calculate .
- Calculate sigmoid .
- Classify based on threshold.
4. Bayes' Rule for Conditional Probabilities
Before understanding Naïve Bayes, one must understand conditional probability and Bayes' Theorem.
A. Conditional Probability
is the probability of event A occurring given that event B has already occurred.
B. Bayes' Theorem
Bayes' theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
- : Posterior Probability (What we want to calculate).
- : Likelihood (Probability of predictor given class).
- : Prior Probability (General probability of the class).
- : Evidence (Probability of the predictor).
C. Context in Sentiment Analysis
We want to find the probability that a sentence is Positive () given a set of words ().
5. The Naïve Bayes Classifier
The Naïve Bayes classifier is a probabilistic machine learning model suitable for classification tasks.
A. The "Naïve" Assumption
Calculating the exact probability of every combination of words in a sentence is computationally impossible due to the sparsity of data.
- Assumption: The model assumes that the presence of a particular feature (word) in a class is unrelated (independent) to the presence of any other feature.
- While this assumption is rarely true in real language (e.g., "not" affects "happy"), the model performs surprisingly well in practice.
B. The Classification Formula
To classify a sentence with words :
We compare the probabilities for both classes. For binary classification, we can calculate the ratio:
- Prior Ratio:
- Likelihood Ratio:
6. Sentiment Analysis with Naïve Bayes
Implementation of Naïve Bayes for Sentiment Analysis involves creating a lookup table of probabilities (likelihoods) for every word in the vocabulary.
A. Training Phase (Building the Table)
- Count Frequencies: Count (total words in positive class) and (total words in negative class).
- Calculate Conditional Probabilities:
B. Laplace Smoothing
Problem: If a word appears in the testing set but was never seen in the "Positive" training set, . Since we multiply probabilities, the entire probability becomes 0.
Solution: Add 1 to the numerator and add the size of the Vocabulary () to the denominator.
C. Log Likelihoods
Problem: Multiplying many small probabilities (e.g., ) results in Arithmetic Underflow (numbers become too small for computer memory).
Solution: Use Logarithms. Logarithms turn multiplication into addition.
We define the Lambda () of a word as the log ratio of its probabilities:
D. The Inference Algorithm
To classify a new sentence using Log Likelihoods:
-
Calculate the Log Prior:
-
Sum the scores for each word in the sentence:
-
Decision:
- If Positive
- If Negative
E. Summary of Naïve Bayes vs. Logistic Regression
| Feature | Naïve Bayes | Logistic Regression |
|---|---|---|
| Model Type | Generative: Models the distribution of individual classes (). | Discriminative: Learn the boundary between classes (). |
| Assumptions | Assumes features (words) are independent. | Does not assume independence; learns weights based on feature interaction implicitly via the sum. |
| Data Size | Works well with smaller datasets. | Usually requires more data to generalize well. |
| Speed | Very fast (simple counting). | Slower (requires iterative training via Gradient Descent). |