Unit 5 - Notes

INT344 8 min read

Unit 5: Natural Language Processing with Sequence and Attention Models

1. Introduction to Sequence Models

1.1 The Nature of Sequential Data

In traditional feedforward neural networks, inputs (and outputs) are assumed to be independent of each other. However, in Natural Language Processing (NLP), the order of data matters significantly. A sentence is not just a "bag of words"; the semantic meaning is derived from the sequence.

Sequence Models are specialized neural network architectures designed to handle sequential data where:

Inputs and/or outputs are sequences.
Sequence lengths can vary (e.g., a 5-word sentence vs. a 20-word sentence).
Context matters (previous inputs influence the interpretation of the current input).

1.2 Notation

$X$ : Input sequence (e.g., a sentence).
$x^{<t>}$ : The input feature vector at time step $t$ .
$Y$ : Output sequence.
$y^{<t>}$ : The output at time step $t$ .
$T_x, T_y$ : The lengths of the input and output sequences, respectively.

2. Recurrent Neural Networks (RNNs)

2.1 Architecture

A Recurrent Neural Network looks like a standard neural network that has been "unrolled" over time. The key differentiator is the Hidden State, which acts as the network's memory.

At every time step $t$ :

The network takes the current input $x^{<t>}$ .
It combines it with the hidden state from the previous time step $h^{<t-1>}$ .
It produces a new hidden state $h^{<t>}$ and an output $y^{<t>}$ .

Parameter Sharing: Unlike standard networks that would need separate parameters for every position in a sentence, RNNs share the same weight matrices ( $W_{ax}, W_{aa}, W_{ya}$ ) across all time steps.

2.2 Mathematical Formulation

The activation at time $t$ is calculated as:

h^{<t>} = g(W_{aa} h^{<t-1>} + W_{ax} x^{<t>} + b_a)

The output at time $t$ is:

y^{<t>} = g'(W_{ya} h^{<t>} + b_y)

$g$ : Activation function (usually Tanh or ReLU).
$g'$ : Output activation function (Sigmoid or Softmax).
$W_{aa}, W_{ax}, W_{ya}$ : Weight matrices.

2.3 Limitations of Standard RNNs

Despite their ability to handle sequences, basic RNNs suffer from severe limitations during training via Backpropagation Through Time (BPTT):

The Vanishing Gradient Problem

Issue: As gradients are backpropagated through many time steps, they pass through repeated matrix multiplications. If the weights are small (< 1), the gradients shrink exponentially and approach zero.
Consequence: The weights in the early layers (beginning of the sentence) are not updated effectively. The model fails to learn long-term dependencies (e.g., remembering a subject from the start of a paragraph to conjugate a verb correctly at the end).

The Exploding Gradient Problem

Issue: If weights are large (> 1), gradients can grow exponentially, leading to numerical overflow (NaN).
Solution: Gradient Clipping (capping the gradient vector if it exceeds a threshold).

3. Long Short-Term Memory (LSTM)

3.1 Overview

LSTMs were introduced (Hochreiter & Schmidhuber, 1997) specifically to solve the vanishing gradient problem. They introduce a more complex internal structure called a Cell that regulates the flow of information.

3.2 Key Concepts

Cell State ( $C^{<t>}$ ): The "highway" of information. It runs straight down the entire chain with only minor linear interactions, allowing information to flow unchanged.
Gates: Neural network layers (sigmoid activations) that determine what information is added to or removed from the cell state.

3.3 The LSTM Equations (Step-by-Step)

Forget Gate ( $f_t$ ): Decides what information to throw away from the previous cell state.

$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$
(If output is 0, forget; if 1, keep.)
Input Gate ( $i_t$ ) & Candidate Value ( $\tilde{C}_t$ ): Decides what new information to store.

$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$
$\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$
Update Cell State ( $C_t$ ): Combine old memory (filtered by forget gate) and new memory (filtered by input gate).

$C_t = f_t * C_{t-1} + i_t * \tilde{C}_t$
Output Gate ( $o_t$ ) & Hidden State ( $h_t$ ): Decides what to output based on the cell state.

$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$
$h_t = o_t * \tanh(C_t)$

4. Applications of Sequence Models

One-to-One: Standard NN (Image Classification).
One-to-Many: Image Captioning (Input: Image $\to$ Output: "A cat sitting on a chair").
Many-to-One: Sentiment Analysis (Input: "This movie was great" $\to$ Output: Rating 5/5).
Many-to-Many ( $T_x = T_y$ ): Named Entity Recognition, POS Tagging.
Many-to-Many ( $T_x \neq T_y$ ): Machine Translation (English to French).

5. POS Tagging and Named Entity Recognition

These are fundamentally Sequence Labeling tasks.

5.1 Part-of-Speech (POS) Tagging

Goal: Assign a grammatical category (tag) to every word in a sentence (e.g., Noun, Verb, Adjective).
Input: "Time flies like an arrow."
Output: [Noun, Verb, Prep, Det, Noun].
Model: usually a Bidirectional LSTM (Bi-LSTM).
- A forward LSTM reads the sentence from start to end.
- A backward LSTM reads from end to start.
- This captures context from both the past (left) and future (right) to disambiguate words (e.g., "bank" as a river bank vs. financial bank).

5.2 Named Entity Recognition (NER)

Goal: Locate and classify named entities into categories such as Person, Organization, Location, Time, etc.
IOB Tagging Scheme: Used to define boundaries of entities.
- B-PER: Beginning of a Person entity.
- I-PER: Inside a Person entity.
- O: Outside any entity.
Example:
- Input: "Elon Musk works at SpaceX."
- Output: [B-PER, I-PER, O, O, B-ORG].
Architecture: Bi-LSTM is standard, often topped with a CRF (Conditional Random Field) layer to ensure valid tag transitions.

6. Neural Machine Translation (NMT)

6.1 The Shift from SMT to NMT

Traditional Statistical Machine Translation (SMT) used large phrase tables and separate language models. NMT attempts to build a single neural network that can be jointly tuned to maximize translation performance.

6.2 The Encoder-Decoder Architecture (Seq2Seq)

NMT relies on a Sequence-to-Sequence (seq2seq) model.

Encoder:
- An RNN (or LSTM/GRU) that takes the input sentence one word at a time.
- It compresses the entire semantic meaning of the source sentence into a final hidden state (often called the Context Vector).
Decoder:
- Another RNN that takes the context vector as its initial state.
- It generates the target sentence one word at a time.
- The output of the decoder at time $t$ is fed as input at time $t+1$ .

\text{Source Sentence} \xrightarrow{\text{Encoder}} \text{Context Vector} \xrightarrow{\text{Decoder}} \text{Target Sentence}

7. Shortcomings of a Traditional Seq2Seq Model

While revolutionary, the basic Encoder-Decoder architecture has a critical design flaw known as the Information Bottleneck.

7.1 The Bottleneck Problem

The Encoder must compress the entire input sequence—regardless of its length—into a fixed-length vector (the final hidden state).
If the sentence is long (e.g., 50+ words), the vector cannot preserve all the necessary information from the beginning of the sentence.
Consequence:
- Performance degrades sharply as sentence length increases.
- The model "forgets" details from the start of the source sentence when generating the target.

8. Introduction to Attention Mechanism

8.1 The Intuition

When a human translates a sentence, they do not memorize the whole sentence, close their eyes, and recite the translation. Instead, they read the source sentence and, while writing each word of the translation, they focus (pay attention) on the specific part of the source sentence that is relevant to the word they are currently writing.

The Attention Mechanism allows the Neural Network to mimic this behavior.

8.2 How Attention Works

Instead of relying on just the final hidden state of the encoder, the Attention mechanism gives the decoder access to all the hidden states of the encoder.

Alignment Scores: At every step of the decoder, the model calculates an alignment score (similarity) between the decoder's current hidden state and every hidden state of the encoder.
Attention Weights: These scores are passed through a Softmax layer to generate weights (probabilities summing to 1). A high weight means "pay close attention to this specific source word."
Context Vector: A weighted sum of the encoder hidden states is calculated.
$c_t = \sum \alpha_{t, i} h_i$
(Where $\alpha$ is the attention weight and $h$ is the encoder state).
Generation: The decoder uses this dynamic, focused context vector to generate the next word.

8.3 Benefits of Attention

Solves the Bottleneck: No need to compress everything into one fixed vector.
Handles Long Sequences: Performance does not degrade as sequence length increases.
Interpretability: By visualizing the attention weights (Attention Map), we can see exactly which source words the model focused on to generate a specific target word (e.g., aligning "European Economic Area" in English to "zone économique européenne" in French).

Summary: Attention changed NLP from "reading the whole book then writing a summary" to "referencing the relevant page while writing each sentence." This concept paved the way for the Transformer architecture (BERT, GPT), which relies entirely on attention mechanisms ("Attention Is All You Need").

Unit 4

Unit 6