Unit 4 - Notes

CSE472 8 min read

Unit 4: Sequence-to-Sequence Models and Attention Mechanisms

1. Encoder–Decoder Architectures for NLP

The encoder-decoder architecture is a foundational framework in Deep Learning for Natural Language Processing (NLP), designed specifically to map variable-length input sequences to variable-length output sequences.

Core Components

  • Encoder: Reads the input sequence step-by-step. Traditionally built using Recurrent Neural Networks (RNNs), Long Short-Term Memory networks (LSTMs), or Gated Recurrent Units (GRUs). The encoder processes the input and compresses the information into a fixed-length hidden state representation, often called the context vector ().
  • Context Vector: The final hidden state of the encoder (or a transformation of it). It acts as an informational bottleneck, encapsulating the semantic meaning of the entire input sequence.
  • Decoder: Another RNN/LSTM/GRU that takes the context vector as its initial hidden state and generates the output sequence one token at a time. The decoder stops generating when it outputs a special end-of-sequence <EOS> token.

Training Mechanism

Encoder-decoder models are typically trained using Teacher Forcing. During training, the decoder receives the actual target token from the previous time step as input for the current time step, rather than its own predicted token. This stabilizes training and accelerates convergence.


2. Sequence-to-Sequence (Seq2Seq) Models for Machine Translation and Summarization

Seq2Seq models are specific implementations of the encoder-decoder architecture. They revolutionized tasks where inputs and outputs do not have a one-to-one mapping (unaligned sequences).

Machine Translation

  • Process: An English sentence (e.g., "How are you?") is passed into the encoder. The resulting context vector initializes the decoder, which generates the French translation (e.g., "Comment ça va ?").
  • Challenge Solved: Solves the problem of differing word orders and sequence lengths between source and target languages.

Text Summarization

  • Abstractive Summarization: Unlike extractive summarization (which copies exact sentences), Seq2Seq models perform abstractive summarization. They "understand" the source document via the encoder and generate a novel, concise summary via the decoder.
  • Handling Vocabulary: Often employs techniques like Pointer-Generator networks to allow the model to either generate a new word from the vocabulary or copy a rare word/named entity directly from the source text.

3. Attention in Deep NLP

The Bottleneck Problem

In classical Seq2Seq models, forcing the encoder to compress a long sentence into a single, fixed-length context vector leads to catastrophic information loss. The model "forgets" earlier parts of a long sequence by the time it finishes reading it.

The Attention Solution

The Attention mechanism solves this by allowing the decoder to "look back" at the entire input sequence at every step of generation. Instead of a single context vector, the encoder outputs a sequence of hidden states . The decoder then computes a dynamic context vector for each generation step by taking a weighted sum of all encoder hidden states.


4. Soft Attention

Attention mechanisms can be categorized into Hard and Soft attention.

  • Hard Attention: The model focuses on a single input word at a time. This involves stochastic processes and is non-differentiable, requiring reinforcement learning to train.
  • Soft Attention: The model computes a continuous, probabilistic weight distribution over all input words. Because it relies on standard differentiable functions (like Softmax), it can be trained easily using standard backpropagation.

Mathematical Formulation of Soft Attention

For a decoder hidden state and encoder hidden states :

  1. Calculate Alignment Scores:
  2. Calculate Attention Weights: (This is the Softmax function)
  3. Compute Context Vector:

5. Bahdanau and Luong Attention

Two primary variations of soft attention dominate RNN-based Seq2Seq models, differing mainly in how the alignment score is calculated.

Bahdanau Attention (Additive Attention)

Introduced by Dzmitry Bahdanau (2014), this was the first attention mechanism applied to Machine Translation.

  • Mechanism: Uses a feed-forward neural network with a single hidden layer to calculate the alignment score.
  • Score Function:
    • , and are learnable weight matrices.
  • Key Trait: It uses the decoder's previous hidden state () to calculate attention, which is then concatenated with the input to calculate the current hidden state.

Luong Attention (Multiplicative / Dot-Product Attention)

Introduced by Thang Luong (2015), this approach simplifies and generalizes the attention calculation.

  • Mechanism: Uses multiplicative operations (dot products), making it computationally faster and more memory-efficient than Bahdanau's additive approach.
  • Score Functions: Luong proposed three ways to compute the score:
    1. Dot:
    2. General: (where is a learnable weight matrix)
    3. Concat:
  • Key Trait: It uses the decoder's current hidden state () to calculate the context vector, which is then used to predict the current output word. Luong also introduced Global Attention (attending to all source words) vs. Local Attention (attending to a subset/window of source words).

6. Integrating Attention into Encoder–Decoder Networks

Integrating attention changes the flow of information in the decoder step.

Step-by-Step Workflow (Luong style):

  1. Encoder Phase: The input is processed to yield a set of encoder hidden states .
  2. Decoder Initial Step: Decoder produces a hidden state based on the previous token and previous hidden state.
  3. Attention Calculation: Compute alignment scores between and all states in .
  4. Softmax: Convert scores to a probability distribution (attention weights, ).
  5. Context Vector: Multiply weights by and sum them up to get .
  6. Attentional Hidden State: Concatenate and , and pass through a linear layer with a activation to create an attentional hidden state .
  7. Prediction: Pass through a dense layer with softmax to predict the vocabulary distribution for the next word.

PYTHON
# Conceptual PyTorch code snippet for Luong (General) Attention
import torch
import torch.nn as nn
import torch.nn.functional as F

class Attention(nn.Module):
    def __init__(self, hidden_size):
        super(Attention, self).__init__()
        self.W = nn.Linear(hidden_size, hidden_size, bias=False)

    def forward(self, decoder_hidden, encoder_outputs):
        # decoder_hidden: [batch, 1, hidden_size]
        # encoder_outputs: [batch, seq_len, hidden_size]
        
        # Apply W to encoder outputs: [batch, seq_len, hidden_size]
        energy = self.W(encoder_outputs) 
        
        # Dot product: [batch, 1, seq_len]
        attention_scores = torch.bmm(decoder_hidden, energy.transpose(1, 2))
        
        # Softmax to get weights: [batch, 1, seq_len]
        attention_weights = F.softmax(attention_scores, dim=2)
        
        # Context vector: [batch, 1, hidden_size]
        context_vector = torch.bmm(attention_weights, encoder_outputs)
        
        return context_vector, attention_weights


7. Evaluation Techniques: BLEU and ROUGE Scores

Evaluating generative NLP models requires specialized metrics, as there are many valid ways to translate or summarize a text.

BLEU (Bilingual Evaluation Understudy)

  • Primary Use: Machine Translation.
  • Mechanism: Measures precision. It counts the number of overlapping n-grams (unigrams, bigrams, etc.) between the machine-generated translation and multiple human reference translations.
  • Brevity Penalty (BP): Since precision can be hacked by generating very short outputs (e.g., generating just "The" if "The" is in the reference), BLEU applies a penalty if the generated text is shorter than the reference text.
  • Formula:
    • Where is n-gram precision and is the weight (usually uniform).

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

  • Primary Use: Text Summarization.
  • Mechanism: Measures recall. It evaluates how much of the human-generated reference summary is captured in the machine-generated summary.
  • Variants:
    • ROUGE-N: Measures n-gram overlap (e.g., ROUGE-1 for unigrams, ROUGE-2 for bigrams).
    • ROUGE-L: Measures the Longest Common Subsequence (LCS). It takes sentence structure into account without requiring strict consecutive n-gram matching.
    • ROUGE-S: Skip-bigram co-occurrence.

8. Limitations of Classical Seq2Seq Models

While Seq2Seq models with attention marked a massive leap forward in NLP, classical RNN/LSTM-based implementations have critical limitations:

  1. Sequential Computation Bottleneck: RNNs must process tokens one by one (token requires the state from token ). This inherently prevents parallelization across time steps during training, making training on massive datasets extremely slow.
  2. Long-Range Dependencies: Even with LSTMs, GRUs, and Attention, recurrent models struggle to maintain context over very long documents. Gradients can still vanish or explode over hundreds of time steps.
  3. Context Vector Complexity: While Attention solves the bottleneck of the encoder, calculating attention between every decoder step and every encoder step is computationally expensive ( complexity, where and are source and target lengths).
  4. Lack of Global Context: Standard recurrent models view the sequence strictly from left-to-right (or right-to-left in bi-directional models), making it harder to build deeply unified representations of words based on their full surrounding context simultaneously.

Note: These limitations directly motivated the invention of the Transformer architecture (which abandons recurrence entirely in favor of Self-Attention), paving the way for modern Large Language Models.