Unit 3 - Notes

CSE472 7 min read

Unit 3: Deep Learning Sequence Models for NLP

1. Sequential Text Data

In Natural Language Processing (NLP), text is fundamentally sequential. Unlike tabular data or static images, the order of words in a sentence heavily dictates its meaning (e.g., "The dog bit the man" vs. "The man bit the dog").

Time-steps: Each word, sub-word, or character is treated as a token at a specific time-step ( $t$ ).
Contextual Dependency: The interpretation of a token often depends on preceding (and sometimes succeeding) tokens. Long-term dependencies exist when a word refers back to a concept introduced much earlier in the text.
Variable Length: Sequential text data does not have a fixed size. Models must be capable of processing inputs of varying lengths, usually handled via padding and masking during batching.

2. Recurrent Neural Networks (RNNs)

Recurrent Neural Networks are the foundational deep learning architecture designed for sequential data.

Architecture: RNNs maintain a "hidden state" ( $h_t$ ) that acts as a memory of the sequence seen so far. At each time-step $t$ , the RNN takes the current input $x_t$ and the previous hidden state $h_{t-1}$ to produce the new hidden state.
Mathematical Representation:
$h_t = \sigma(W_{hh}h_{t-1} + W_{xh}x_t + b_h)$
$y_t = \text{softmax}(W_{hy}h_t + b_y)$
(where $\sigma$ is an activation function like tanh or ReLU, and $W$ represents learnable weight matrices).
Unfolding in Time: An RNN can be conceptualized as multiple copies of the same network, each passing a message to a successor.
The Vanishing Gradient Problem: When backpropagating through time (BPTT) over long sequences, gradients can become vanishingly small (or explode) due to repeated multiplication of the weight matrices. This makes standard RNNs incapable of learning long-range dependencies.

3. Long Short-Term Memory Networks (LSTMs)

LSTMs are a specialized variant of RNNs designed specifically to combat the vanishing gradient problem and capture long-range dependencies.

Cell State ( $C_t$ ): The core innovation of the LSTM. It runs straight down the entire chain with only minor linear interactions, allowing gradients to flow uninterrupted.
Gates: LSTMs regulate the flow of information using three gates (composed of a sigmoid neural net layer and a pointwise multiplication operation):
1. Forget Gate ( $f_t$ ): Decides what information to throw away from the previous cell state.
  $f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$
2. Input Gate ( $i_t$ ): Decides which values will be updated, while a tanh layer creates a vector of new candidate values ( $\tilde{C}_t$ ).
  $i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$
  $\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$
3. Output Gate ( $o_t$ ): Decides what to output based on the filtered cell state.
  $o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$
  $h_t = o_t * \tanh(C_t)$

4. Gated Recurrent Units (GRUs)

GRUs are a more recent, simplified variation of LSTMs. They aim to solve the vanishing gradient problem but with a more streamlined architecture, making them faster to train and less memory-intensive.

Key Differences from LSTM: GRUs combine the forget and input gates into a single "update gate" and merge the cell state and hidden state.
Gates:
1. Update Gate ( $z_t$ ): Determines how much of the past knowledge needs to be passed along to the future.
2. Reset Gate ( $r_t$ ): Determines how much of the past knowledge to forget.
Trade-off: While LSTMs are strictly more expressive, GRUs often perform similarly on many tasks while requiring fewer computational resources.

5. Bidirectional RNNs (Bi-RNNs)

Standard RNNs process sequences strictly from left to right (past to future). However, in many NLP tasks (where the entire text is available at once), looking at future context is just as important as past context.

Architecture: Bi-RNNs consist of two independent RNNs (or LSTMs/GRUs).
- One processes the sequence forward: $\overrightarrow{h_t}$
- One processes the sequence backward: $\overleftarrow{h_t}$
Output: The representations are usually concatenated at each time step: $h_t = [\overrightarrow{h_t} ; \overleftarrow{h_t}]$ .
Advantage: The network has full symmetric context of the entire sequence at every time step, highly improving performance for tasks like Named Entity Recognition (NER) or translation.

6. Sequence Modeling Applications

Sequence models can be configured in different ways (Many-to-One, Many-to-Many, One-to-Many) depending on the task.

Text Classification

Definition: Categorizing a sequence of text into one or more predefined labels (e.g., spam detection, topic categorization).
Architecture (Many-to-One): The sequence is fed into an RNN/LSTM step-by-step. The final hidden state $h_n$ (which theoretically contains the summarized context of the whole document) is passed through a dense (fully connected) layer with a Softmax or Sigmoid activation to output class probabilities.

Sentiment Classification

Definition: A specific sub-field of text classification where the goal is to determine the emotional tone behind a series of words (e.g., Positive, Negative, Neutral).
Challenges: Capturing context is vital (e.g., "The movie was not good"). LSTMs and Bi-LSTMs excel here because they can link the negation "not" with the adjective "good" even if they are separated by other words.

7. Sequence Training Techniques

Training recurrent networks involves unique challenges regarding stability and computational limits.

Teacher Forcing

Concept: A training strategy used primarily in sequence generation (e.g., machine translation, language modeling).
Mechanism: Instead of feeding the model's own incorrect prediction from time-step $t-1$ as the input for time-step $t$ , Teacher Forcing feeds the actual ground truth target from time-step $t-1$ .
Benefits: Prevents early mistakes from compounding, making training faster and more stable.
Drawback (Exposure Bias): During inference (testing), the model does not have access to ground-truth tokens and must rely on its own generated tokens. This discrepancy between training and testing can hurt performance. Scheduled Sampling is often used as a middle ground.

Truncated Backpropagation Through Time (TBPTT)

Concept: Standard BPTT unrolls the RNN across the entire sequence length. For very long documents, this requires massive memory and leads to extreme vanishing/exploding gradients.
Mechanism: TBPTT processes the sequence in fixed-size chunks (e.g., steps).
- Forward pass: Run through $k$ steps.
- Backward pass: Backpropagate errors only through those $k$ steps.
- Hidden State: The final hidden state of the chunk is detached from the computation graph and passed as the initial hidden state for the next chunk.
Benefits: Makes training on infinite or very long sequences computationally feasible.

PYTHON

# Pseudo-code for TBPTT concept in PyTorch
hidden = model.init_hidden()
for chunk in sequence_chunks: # chunks of length k
    # Detach hidden state from history to prevent backpropping all the way to step 0
    hidden = hidden.detach() 
    
    output, hidden = model(chunk, hidden)
    loss = criterion(output, target_chunk)
    
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

8. Evaluation Metrics for Sequence Tasks

The choice of metric heavily depends on the specific NLP sequence task.

For Sequence Classification (Text/Sentiment):
- Accuracy: The percentage of correctly predicted labels. (Best for balanced datasets).
- Precision, Recall, and F1-Score: Crucial for imbalanced datasets (e.g., detecting rare hate speech). F1-Score is the harmonic mean of Precision and Recall.
- Cross-Entropy Loss: Used during training to measure the divergence between the predicted probability distribution and the true distribution.
For Sequence Generation / Language Modeling:
- Perplexity (PPL): The standard metric for language models. It measures how well a probability model predicts a sample. Lower perplexity indicates the model is better at predicting the sequence. It is the exponentiated average negative log-likelihood of a sequence.
- BLEU (Bilingual Evaluation Understudy): Used for machine translation. Measures n-gram overlap between the generated sequence and reference sequences.
- ROUGE: Used for text summarization. Focuses on recall of n-grams between the generated summary and the reference.

Unit 2

Unit 4