Unit 3 - Notes

INT344 8 min read

Unit 3: Natural Language Processing with Probabilistic Models

This unit covers the fundamental probabilistic techniques used in Natural Language Processing (NLP) to handle ambiguity, spelling errors, sequence prediction, and semantic representation.

1. Autocorrect

Autocorrect is an application that changes words in a text to their nearest correct spelling equivalents. It relies on finding the word in a vocabulary that is most "similar" to the misspelled word based on specific metrics.

1.1 Minimum Edit Distance

Minimum Edit Distance is a string metric used to quantify how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other.

The Three Basic Operations:

Insertion: Adding a character.
Deletion: Removing a character.
Substitution: Replacing one character with another.

Levenshtein Distance:
This is the most common algorithm for calculating edit distance. Costs are usually assigned as follows:

Insertion/Deletion cost = 1
Substitution cost = 2 (often viewed as 1 deletion + 1 insertion, though some variations assign it 1).

Dynamic Programming Algorithm:
To calculate the distance between a source string $S$ of length $n$ and a target string $T$ of length $m$ , we construct a matrix $D$ of size $(n+1) \times (m+1)$ .

The recurrence relation for cell $D[i,j]$ is:

D[i,j] = \min \begin{cases} D[i-1, j] + \text{del\_cost} \\ D[i, j-1] + \text{ins\_cost} \\ D[i-1, j-1] + \begin{cases} 0 & \text{if } S[i] = T[j] \\ \text{sub\_cost} & \text{if } S[i] \neq T[j] \end{cases} \end{cases}

Base Case: $D[0,0] = 0$ . The first row represents transforming an empty string to $T$ (all insertions). The first column represents transforming $S$ to an empty string (all deletions).
Result: The value at $D[n,m]$ is the minimum edit distance.

1.2 Spellchecker to Correct Misspelled Words

A probabilistic spellchecker typically follows a four-step process:

Identify the Misspelled Word: Check if the word exists in the dictionary.
Generate Candidates: Create a list of potential correct words. This is done by applying edit operations (edit distance 1 or 2) to the misspelled word.
Filter Candidates: Keep only the candidates that are actual real words in the language's vocabulary.
Score Candidates: Find the most likely correction using probabilities.

The Noisy Channel Model:
To select the best candidate, we use Bayes' Theorem. Let $w$ be the correct word and $c$ be the misspelled candidate (the observation). We want to find the $w$ that maximizes $P(w|c)$ :

\hat{w} = \underset{w \in \text{candidates}}{\text{argmax}} P(w|c) = \underset{w \in \text{candidates}}{\text{argmax}} \frac{P(c|w)P(w)}{P(c)}

Since $P(c)$ is constant for all candidates, we maximize:

\hat{w} = \underset{w \in \text{candidates}}{\text{argmax}} P(c|w) \times P(w)

$P(w)$ (Language Model): The probability that the word $w$ appears in the language (frequency in a corpus). This favors common words.
$P(c|w)$ (Error Model): The probability that the user types $c$ when they intended to type $w$ . This relies on edit distance statistics (e.g., how often is 'a' typed as 's'?).

2. Part of Speech (POS) Tagging and Hidden Markov Models

POS tagging is the process of assigning a grammatical category (Noun, Verb, Adjective, etc.) to each word in a text corpus. This is difficult because many words are ambiguous (e.g., "book" can be a noun or a verb).

2.1 About Markov Chains

A Markov Chain is a stochastic model describing a sequence of possible events.

Key Properties:

Markov Assumption: The probability of a future state depends only on the current state, not on the sequence of events that preceded it.
$P(q_{t+1} | q_1, \dots, q_t) = P(q_{t+1} | q_t)$
States ( $Q$ ): A finite set of states (e.g., sunny, rainy).
Transition Matrix ( $A$ ): Contains probabilities $a_{ij}$ of moving from state $i$ to state $j$ .

2.2 Hidden Markov Models (HMMs)

In a Markov Chain, states are visible. In an HMM, the states are hidden (we cannot see them directly), but they produce observations (which we can see).

Components of an HMM in POS Tagging:

Hidden States ( $Q$ ): The Part of Speech tags (Noun, Verb, Det, etc.).
Observations ( $O$ ): The actual words in the sentence.
Transition Probabilities ( $A$ ): The probability of one tag following another (e.g., likelihood of a Noun following a Determiner).
Emission Probabilities ( $B$ ): The probability of a specific word being generated given a specific tag (e.g., likelihood that the tag "Verb" emits the word "run").
Initial Probabilities ( $\pi$ ): The probability of a sentence starting with a specific tag.

2.3 Part-Of-Speech Tags using a Text Corpus

To build an HMM for POS tagging, we need to calculate the Transition ( $A$ ) and Emission ( $B$ ) matrices using a labeled text corpus (a large dataset where linguists have already tagged the words).

Calculating Transition Probabilities:

P(t_i | t_{i-1}) = \frac{C(t_{i-1}, t_i)}{C(t_{i-1})}

$C(t_{i-1}, t_i)$ : Count of times tag $t_{i-1}$ is immediately followed by tag $t_i$ .
$C(t_{i-1})$ : Total count of tag $t_{i-1}$ in the corpus.

Calculating Emission Probabilities:

P(w_i | t_i) = \frac{C(t_i, w_i)}{C(t_i)}

$C(t_i, w_i)$ : Count of times tag $t_i$ is associated with word $w_i$ .
$C(t_i)$ : Total count of tag $t_i$ .

Decoding (Viterbi Algorithm):
Once the HMM is trained (matrices calculated), we use the Viterbi Algorithm to find the most likely sequence of hidden tags for a new sentence. This algorithm uses dynamic programming to avoid computing all possible tag sequences, maximizing the path probability product of transitions and emissions.

3. Autocomplete and Language Models

Autocomplete systems suggest the completion of a word or the next word in a sentence. This relies on Language Modeling, which assigns probabilities to sequences of words.

3.1 N-gram Language Models

An N-gram is a contiguous sequence of $N$ items from a given sample of text.

Unigram (N=1): Individual words ("I", "love", "coding"). Assumes word independence.
Bigram (N=2): Pairs of words ("I love", "love coding").
Trigram (N=3): Triplets ("I love coding").

Calculating Sequence Probabilities:
The goal is to calculate $P(w_1, w_2, \dots, w_n)$ . Using the Chain Rule of Probability:

P(w_1, \dots, w_n) = P(w_1) P(w_2|w_1) P(w_3|w_1, w_2) \dots P(w_n|w_1 \dots w_{n-1})

Because calculating probabilities with long histories is computationally expensive and suffers from data sparsity, we apply the Markov Assumption to N-grams. For a Bigram model, the probability of a word depends only on the previous word:

P(w_1, \dots, w_n) \approx \prod_{i=1}^{n} P(w_i | w_{i-1})

3.2 Autocomplete Language Model using a Text Corpus

To build an autocomplete system, we calculate probabilities based on counting frequencies in a corpus.

Maximum Likelihood Estimation (MLE):
For a Bigram model, the probability of word $w_i$ following $w_{i-1}$ is:

P(w_i | w_{i-1}) = \frac{Count(w_{i-1}, w_i)}{Count(w_{i-1})}

The Problem of Sparsity (Zero Probability):
If a specific sequence (e.g., "purple elephant") never appears in the training corpus, the probability becomes 0. If this is part of a longer sentence, the probability of the entire sentence becomes 0.

Smoothing (Laplace Smoothing / Add-k):
To fix sparsity, we add a small number (usually 1, denoted as $k$ ) to the numerator and adjust the denominator by the vocabulary size ( $V$ ).

P(w_i | w_{i-1}) = \frac{Count(w_{i-1}, w_i) + k}{Count(w_{i-1}) + k \times |V|}

Autocomplete Implementation:
Given a history $h$ (the words typed so far), the system looks for the word $w$ that maximizes $P(w|h)$ .

4. Word Embeddings with Neural Networks

Traditional NLP represented words as discrete atomic symbols (e.g., "hotel" = ID 452). This approach lacks the notion of similarity. Word embeddings solve this by representing words as continuous vectors.

4.1 Word Embeddings

A word embedding is a learned representation where words that have the same meaning have a similar representation. They are dense vectors (typically 50 to 300 dimensions) of real numbers.

One-Hot Encoding vs. Embeddings:

One-Hot: A vector of length (vocabulary size) with a single '1' and all other zeros.
- Sparse, high-dimensional.
- No relationship between words (dot product of any two different words is 0).
Embedding: A vector of length (e.g., 300).
- Dense, low-dimensional.
- Captures relationships.

Training with Neural Networks:
Embeddings are typically learned using "self-supervised" learning on large text corpora. The neural network tries to predict a word given its context (or vice versa).

Word2Vec (Mikolov et al.):
1. CBOW (Continuous Bag of Words): Predicts the target word based on context words.
2. Skip-gram: Predicts context words given a target word.

4.2 Semantic Meaning of Words

The primary power of embeddings is that the geometric distance between vectors corresponds to semantic similarity.

Properties:

Similarity: Words like "frog" and "toad" will have vectors that are very close to each other in vector space.
Analogies: Algebraic operations on vectors produce semantic results.
- The Classic Example:
  $\text{Vector("King")} - \text{Vector("Man")} + \text{Vector("Woman")} \approx \text{Vector("Queen")}$
- This implies the vector direction for "Gender" or "Royalty" is consistent across the space.

Cosine Similarity:
To measure how similar two words (vectors $A$ and $B$ ) are, we calculate the cosine of the angle between them:

\text{Similarity} = \cos(\theta) = \frac{A \cdot B}{||A|| \times ||B||}

Value is 1 if vectors point in exactly the same direction (identical meaning).
Value is 0 if vectors are orthogonal (unrelated).
Value is -1 if vectors are opposite.

Unit 2

Unit 4