Unit 5 - Notes

CSE472 8 min read

Unit 5: Transformers and Pretrained Language Models

1. The Transformer Architecture

Introduced in the seminal 2017 paper "Attention Is All You Need" by Vaswani et al., the Transformer revolutionized NLP by replacing Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTMs) networks with a purely attention-based mechanism. This allowed for massive parallelization and the ability to capture long-range dependencies effectively.

1.1 Self-Attention Mechanism

Self-attention allows a model to weigh the importance of different words in a sequence when encoding a particular word. It computes representations by relating different positions of a single sequence.

Query (Q), Key (K), and Value (V): Every input token is linearly projected into three distinct vectors:
- Query: What the current token is looking for.
- Key: What the token contains.
- Value: The actual content/meaning of the token.
Scaled Dot-Product Attention: The attention score is computed by taking the dot product of the Query with all Keys, scaling it, applying a softmax to get weights, and multiplying by the Values.

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

where $d_k$ is the dimension of the key vectors. Scaling by $\sqrt{d_k}$ prevents the dot products from growing too large, which would push the softmax function into regions with extremely small gradients.

1.2 Multi-Head Attention

Instead of computing a single attention mechanism, the Transformer uses Multi-Head Attention.

The $Q$ , $K$ , and $V$ vectors are projected into $h$ different lower-dimensional spaces (heads).
Attention is computed independently in each head.
The outputs of all heads are concatenated and linearly projected back to the original dimension.
Benefit: This allows the model to jointly attend to information from different representation subspaces at different positions (e.g., one head might track syntactic relations, while another tracks coreference).

1.3 Positional Encoding

Because Transformers do not process data sequentially (like RNNs), they have no inherent sense of token order. Positional encodings are added to the input embeddings at the bottom of the encoder and decoder stacks to inject information about the relative or absolute position of the tokens.

Original implementation uses sine and cosine functions of different frequencies:
- $PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{\text{model}}})$
- $PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{\text{model}}})$

1.4 Transformer Encoder and Decoder Blocks

Encoder Block:
The encoder is composed of a stack of $N$ identical layers (typically 6 in the original paper). Each layer has two sub-layers:

Multi-Head Self-Attention: Attends to all tokens in the previous layer.
Position-wise Feed-Forward Network (FFN): Two linear transformations with a ReLU activation in between.
- Residual Connections & Normalization: Each sub-layer is followed by a residual connection ( $x + \text{Sublayer}(x)$ ) and Layer Normalization (Add & Norm).

Decoder Block:
The decoder also consists of a stack of $N$ identical layers, but with three sub-layers:

Masked Multi-Head Self-Attention: Prevents positions from attending to subsequent positions (looking into the future). This preserves the auto-regressive property.
Multi-Head Cross-Attention: The Query comes from the previous decoder layer, while the Key and Value come from the output of the Encoder.
Position-wise FFN: Same as the encoder.
- Again, residual connections and layer normalization surround each sub-layer.

2. Tokenization Methods

Traditional word-level tokenization suffers from the Out-Of-Vocabulary (OOV) problem, while character-level tokenization results in overly long sequences and loss of semantic meaning. Subword tokenization offers a middle ground.

2.1 Byte-Pair Encoding (BPE)

BPE is a data compression algorithm adapted for NLP.

Initialize the vocabulary with single characters.
Count the frequency of all adjacent symbol pairs in the training corpus.
Merge the most frequent pair into a new single symbol.
Repeat the process until a predefined vocabulary size is reached.
- Usage: GPT-2, RoBERTa.

2.2 WordPiece

Developed by Google (used in BERT). It is similar to BPE but uses a different merging criterion.

Initialize the vocabulary with single characters.
Instead of merging the most frequent pair, WordPiece merges the pair that maximizes the likelihood of the training data when added to the vocabulary.
Words are broken down into subwords, often using a special prefix (like ## in BERT) to denote that a subword is part of a larger word (e.g., "playing" -> "play", "##ing").

3. Pretrained Transformer Models

Pretrained models leverage large amounts of unannotated text to learn general language representations, which are then fine-tuned on specific downstream tasks.

3.1 BERT (Bidirectional Encoder Representations from Transformers)

BERT uses a Transformer Encoder-only architecture. It is designed to pretrain deep bidirectional representations by jointly conditioning on both left and right context in all layers.

Pretraining Objectives:

Masked Language Modeling (MLM): 15% of the input tokens are masked at random. The model's objective is to predict the original vocabulary id of the masked word based purely on its context. This allows the model to learn bidirectional context.
Next Sentence Prediction (NSP): The model receives pairs of sentences and must predict whether the second sentence is the actual subsequent sentence in the original document (50% of the time) or a random sentence (50% of the time). This teaches the model relationships between sentences (useful for QA and NLI).

3.2 GPT (Generative Pre-trained Transformer)

GPT models (GPT-1, GPT-2, GPT-3, GPT-4) use a Transformer Decoder-only architecture.

Pretraining Objective:

Causal Language Modeling (CLM): Also known as standard autoregressive language modeling. The model is trained to predict the next word in a sequence given all the preceding words. It is strictly left-to-right (unidirectional), enforced by the masked self-attention mechanism in the decoder.

3.3 T5 (Text-to-Text Transfer Transformer)

T5 uses a standard Encoder-Decoder Transformer architecture.

Core Philosophy: Every NLP task is cast as a "text-to-text" problem. Both the input and output are always text strings.
For example, translation ("translate English to German: That is good."), summarization ("summarize: [text]"), and classification ("cola sentence: [text]" outputting "acceptable" or "unacceptable").
Pretraining: Uses a variant of MLM where consecutive spans of dropped tokens are replaced by a single sentinel token, and the decoder must predict the masked spans.

4. Transfer Learning and Fine-Tuning for NLP Tasks

Transfer learning in NLP involves taking a model pretrained on a massive dataset and fine-tuning its weights on a smaller, task-specific dataset.

4.1 Fine-Tuning for Text Classification

Mechanism: In models like BERT, a special classification token ([CLS]) is prepended to every input sequence. The final hidden state corresponding to this [CLS] token is used as the aggregate sequence representation.
Architecture: A simple linear classifier (Feed-Forward layer + Softmax) is added on top of the [CLS] token's output. The entire pre-trained model and the new layer are trained jointly to minimize cross-entropy loss.

4.2 Fine-Tuning for Named Entity Recognition (NER)

Mechanism: NER is a token classification task. Instead of using the [CLS] token, the final hidden state of every token in the sequence is utilized.
Architecture: A classification layer is added on top of the output of every token to predict its entity label (e.g., using IOB format: B-PER, I-PER, O).

4.3 Fine-Tuning for Question Answering (QA)

Mechanism: Extractive QA (like the SQuAD dataset) requires the model to identify the span of text within a context that answers a question.
Architecture: The input is structured as [CLS] Question [SEP] Context [SEP]. The model outputs two probability distributions over the context tokens: one for the probability of being the start token of the answer span, and one for being the end token.

5. HuggingFace Transformers Library

HuggingFace has become the industry standard for working with Transformer models. It provides an API and tools to easily download, train, and use state-of-the-art pretrained models.

5.1 Core Components

Tokenizer (AutoTokenizer): Handles the preprocessing of text into numerical token IDs, attention masks, and token type IDs (segment IDs). It ensures the text is formatted exactly as the specific pretrained model expects.
Model (AutoModel, AutoModelForSequenceClassification, etc.): The neural network architecture with loaded pretrained weights. HuggingFace provides specific task-heads (like ForTokenClassification or ForQuestionAnswering) that automatically add the correct output layers on top of the base transformer.
Trainer API: A high-level, feature-complete API for PyTorch that abstracts away the complex training loop, handling distributed training, evaluation, logging, and model saving.

5.2 Example: Fine-Tuning for Text Classification (Code Snippet)

PYTHON

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset

# 1. Load Dataset
dataset = load_dataset("imdb")

# 2. Load Tokenizer and Tokenize Data
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# 3. Load Model with Classification Head
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# 4. Define Training Arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
)

# 5. Initialize Trainer and Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].shuffle(seed=42).select(range(1000)), # Subset for speed
    eval_dataset=tokenized_datasets["test"].shuffle(seed=42).select(range(1000)),
)

trainer.train()

Unit 4

Unit 6