Unit5 - Subjective Questions
CSE472 • Practice Questions with Detailed Answers
Explain the fundamental shift introduced by the Transformer architecture compared to traditional Recurrent Neural Networks (RNNs) for Sequence-to-Sequence tasks.
The Transformer architecture, introduced by Vaswani et al. in 'Attention Is All You Need', brought a fundamental shift in how sequence data is processed in Deep Learning:
- Elimination of Recurrence: Traditional RNNs and LSTMs process sequences sequentially (token by token), which inherently prevents parallelization across sequence elements. Transformers eliminate recurrence entirely, relying instead on attention mechanisms.
- Parallelization: Because there is no sequential dependency in computing representations for the current token based on the previous one, Transformers allow for massive parallelization during training, drastically reducing training time.
- Global Dependencies: RNNs struggle with long-range dependencies due to vanishing gradients. The self-attention mechanism in Transformers computes dependencies between all tokens in a sequence in a single step, giving the model direct access to the entire context window with a path length of .
- Positional Encoding: Since recurrence is removed, the model has no inherent notion of sequence order. Transformers solve this by injecting Positional Encodings into the input embeddings to provide structural context.
Derive and explain the mathematical formulation of Scaled Dot-Product Self-Attention.
Self-attention allows a model to weigh the importance of different words in a sequence when encoding a particular word. The formula for Scaled Dot-Product Attention is:
Components:
- (Queries): A matrix representing the current tokens focusing on other tokens.
- (Keys): A matrix representing the tokens being focused on.
- (Values): A matrix representing the actual content/features of the tokens.
- : The dimension of the key vectors.
Step-by-step process:
- Dot Product: Compute to get the raw attention scores. This measures the similarity between queries and keys.
- Scaling: Divide by . As grows large, dot products grow large in magnitude, pushing the softmax function into regions with extremely small gradients. Scaling counteracts this.
- Softmax: Apply the softmax function along the last dimension to normalize the scores into a probability distribution (summing to 1).
- Weighted Sum: Multiply the normalized scores by the Value matrix to obtain the final context-aware embeddings.
What is Multi-Head Attention, and why is it preferred over a single attention function?
Multi-Head Attention is an extension of the self-attention mechanism where the attention process is run multiple times (heads) in parallel.
How it works:
Instead of performing a single attention function with -dimensional keys, values, and queries, the model linearly projects the queries, keys, and values times with different, learned linear projections to , , and dimensions, respectively. Attention is applied to each of these projected versions in parallel, yielding different -dimensional output values. These are concatenated and once again projected to form the final output.
Why it is preferred:
- Multiple Representation Subspaces: It allows the model to jointly attend to information from different representation subspaces at different positions. For example, one head might learn to attend to grammatical structure (verbs to subjects), while another attends to semantic meaning.
- Prevents Averaging Out: With a single attention head, averaging might obscure subtle relationships. Multi-head attention preserves multiple distinct contexts simultaneously.
Describe the mathematical concept and necessity of Positional Encoding in the Transformer architecture.
Necessity:
Because the Transformer contains no recurrence and no convolution, it processes all tokens simultaneously. To make the model sensitive to the sequence order (which is crucial for language), we must explicitly inject information about the relative or absolute position of the tokens in the sequence. This is done via Positional Encoding, which is added to the input embeddings at the bottom of the encoder and decoder stacks.
Mathematical Concept:
The original Transformer uses sine and cosine functions of different frequencies:
Where:
- is the position of the token in the sequence.
- is the dimension index.
- is the embedding dimension.
Why this formulation?
These functions allow the model to easily learn to attend by relative positions, because for any fixed offset , can be represented as a linear function of . Furthermore, the varying frequencies ensure that each dimension corresponds to a distinct sinusoidal wave, creating a unique positional signature.
Explain the structure of a single Transformer Encoder block in detail.
A Transformer Encoder is composed of a stack of identical layers (typically 6 to 24 blocks). A single Encoder block consists of two main sub-layers:
- Multi-Head Self-Attention Mechanism:
- Takes the output of the previous layer (or the input embeddings + positional encoding for the first layer).
- Computes self-attention to allow tokens to interact and gather context from the entire sequence.
- Position-wise Feed-Forward Network (FFN):
- Consists of two linear transformations with a ReLU activation in between: .
- It is applied to each position separately and identically.
Residual Connections & Normalization:
Around each of the two sub-layers, there is a residual (skip) connection followed by Layer Normalization.
- Output of a sub-layer =
- The residual connections help mitigate the vanishing gradient problem in deep networks, while Layer Normalization stabilizes the hidden state dynamics.
How does the Transformer Decoder block differ from the Encoder block, and what is the role of Masked Self-Attention?
The Transformer Decoder block shares a similar structure with the Encoder but introduces crucial differences to handle text generation autoregressively.
Differences:
- Masked Multi-Head Self-Attention: This is the first sub-layer in the decoder. It prevents positions from attending to subsequent positions.
- Role: During training, the decoder takes the target sequence as input. To prevent "cheating" (looking at future tokens to predict the next token), a masking matrix (upper triangular filled with ) is applied before the softmax step. This ensures that the prediction for position can depend only on the known outputs at positions less than .
- Encoder-Decoder Cross-Attention: This is the second sub-layer.
- Role: It allows the decoder to focus on relevant parts of the input sequence. In this layer, the Queries () come from the previous decoder layer, while the Keys () and Values () come from the output of the final Encoder layer.
- Feed-Forward Network: Similar to the encoder, followed by residual connections and Layer Normalization after every sub-layer.
Compare and contrast Byte-Pair Encoding (BPE) and WordPiece tokenization methods.
Both BPE and WordPiece are subword tokenization algorithms used to handle the out-of-vocabulary (OOV) problem by breaking unknown words into known subword units.
Byte-Pair Encoding (BPE):
- Algorithm: BPE starts with a vocabulary of individual characters. It then iteratively counts the frequency of adjacent symbol pairs in the training corpus and merges the most frequent pair into a new single symbol.
- Criterion: Merges are based strictly on frequency of co-occurrence.
- Usage: Commonly used in GPT, GPT-2, and RoBERTa.
WordPiece:
- Algorithm: WordPiece also starts with characters and iteratively builds a vocabulary.
- Criterion: Instead of merging based purely on frequency, WordPiece evaluates the impact of the merge on the likelihood of the training data. It merges the pair that maximizes the language model probability of the training corpus (i.e., it maximizes the score: ).
- Usage: Popularized by BERT.
Comparison: While BPE greedily merges the most frequent pairs, WordPiece is more statistically driven, choosing merges that provide the most information gain to the tokenized corpus.
Explain the concept of Transfer Learning in NLP and how Pretrained Language Models utilize this paradigm.
Transfer Learning in NLP involves training a model on a large, general-purpose dataset and then transferring that learned knowledge to a specific downstream task.
The Paradigm:
- Pretraining (Self-Supervised Learning):
- A massive neural network (like a Transformer) is trained on large amounts of unlabelled text (e.g., Wikipedia, BookCorpus, internet scrape).
- The model learns syntax, semantics, facts, and reasoning abilities through objective functions like Masked Language Modeling (predicting missing words) or Causal Language Modeling (predicting the next word).
- This step is highly computationally expensive.
- Fine-Tuning (Supervised Learning):
- The pretrained model is adapted to a specific, narrower task (e.g., sentiment analysis, Named Entity Recognition) using a much smaller, labeled dataset.
- A small task-specific classification head is usually added to the model.
- The weights of the entire model (or just the top layers) are updated with a small learning rate.
Benefits: Transfer learning reduces the need for massive labeled datasets for every specific NLP task, saves computational resources on downstream tasks, and achieves state-of-the-art performance by leveraging deep contextual representations.
Describe the architecture of BERT and elaborate on its Masked Language Modeling (MLM) pretraining objective.
Architecture:
BERT (Bidirectional Encoder Representations from Transformers) is a multi-layer bidirectional Transformer Encoder. Unlike standard directional language models, BERT reads the entire sequence of words at once (both left-to-right and right-to-left), allowing the model to learn deep bidirectional context.
Masked Language Modeling (MLM):
Traditional language models predict the next word, which restricts them to unidirectional context. To train bidirectionally without allowing the model to trivially "see" the target word, BERT uses MLM.
- Process: 15% of the tokens in the input sequence are randomly chosen to be modified before being fed into the model.
- Masking Strategy: Of those 15% chosen tokens:
- 80% are replaced with a special
[MASK]token (e.g., "my dog is hairy" -> "my dog is [MASK]"). - 10% are replaced with a random word (forces the model to rely on context, as any word might be wrong).
- 10% are kept unchanged (biases the model towards the actual observed word).
- 80% are replaced with a special
- Objective: The model's final hidden states for the masked tokens are passed through a classification layer to predict the original vocabulary ID using Cross-Entropy Loss.
What is Next Sentence Prediction (NSP) in BERT, and why was it introduced?
Next Sentence Prediction (NSP) is the second pretraining objective used in the original BERT model, alongside Masked Language Modeling (MLM).
Concept:
Many downstream NLP tasks, such as Question Answering (QA) and Natural Language Inference (NLI), require an understanding of the relationship between two sentences. MLM alone does not directly capture inter-sentence relationships. NSP was introduced to teach the model how sentences relate.
Mechanism:
- During pretraining, the model receives pairs of sentences (Sentence A and Sentence B).
- For 50% of the pairs, Sentence B is the actual next sentence that follows Sentence A in the original document (Label:
IsNext). - For the other 50%, Sentence B is a completely random sentence chosen from the corpus (Label:
NotNext). - The sequences are formatted as:
[CLS] Sentence A [SEP] Sentence B [SEP]. - The final hidden state of the
[CLS]token is passed through a binary classifier to predict whether B follows A.
(Note: Later models like RoBERTa found that NSP might not be strictly necessary if MLM is trained on contiguous text spans, but it was fundamental to the original BERT design.)
Discuss the architecture of GPT and its pretraining objective, Causal Language Modeling.
GPT Architecture:
GPT (Generative Pre-trained Transformer) relies on a Decoder-only Transformer architecture. It discards the encoder block entirely. Because there is no encoder, the encoder-decoder cross-attention layer is removed. GPT consists of a stack of masked multi-head self-attention layers followed by feed-forward neural networks.
Causal Language Modeling (CLM):
- Objective: GPT is trained autoregressively to predict the next token in a sequence given the previous tokens. Mathematically, it maximizes the likelihood:
- Masking: To prevent information flow from future tokens (which would make the prediction trivial), GPT uses causal masking (an upper triangular mask) in its self-attention layers. Token can only attend to tokens $1$ through .
- Application: Because it learns to predict the next word, GPT is naturally suited for text generation tasks (zero-shot and few-shot generation), unlike BERT which is suited for understanding and classification.
Explain the text-to-text framework introduced by the T5 model.
T5 (Text-to-Text Transfer Transformer)
T5 introduced a unified framework that reframes all NLP tasks into a standard "text-to-text" format.
The Framework:
- In models like BERT, different tasks require different architectural modifications (e.g., adding a classification head for sentiment analysis, or token-level heads for NER).
- T5 treats every text processing problem as a sequence-to-sequence task: the model takes text as input and generates text as output.
- Task Prefixes: To distinguish between tasks, a task-specific prefix is added to the input text.
- Translation: Input:
translate English to German: That is good.Output:Das ist gut. - Summarization: Input:
summarize: [Text]Output:[Summary] - Classification: Input:
sentiment: I love this movie!Output:positive
- Translation: Input:
Architecture: T5 utilizes the standard Encoder-Decoder Transformer architecture, which naturally fits this sequence-to-sequence paradigm.
Outline the process of fine-tuning a pretrained BERT model for a Text Classification task.
Fine-tuning BERT for text classification (e.g., sentiment analysis, spam detection) involves minimal architectural changes to the pretrained model.
Process:
- Input Formatting: The input text is tokenized using WordPiece. A special classification token
[CLS]is prepended to the start of the sequence, and a[SEP]token is appended at the end. - Forward Pass: The sequence is passed through the BERT encoder layers. Each token gets a corresponding contextualized hidden state vector.
- Extracting the Representation: For classification, the final hidden state corresponding to the
[CLS]token, denoted as (where is the hidden size, e.g., 768), is used as the aggregate sequence representation. - Classification Head: A new, randomly initialized linear layer (feed-forward layer) is added on top of the
[CLS]token's output.
where for classes. - Training: The entire model (BERT layers + new linear head) is fine-tuned end-to-end using a standard classification loss (like Cross-Entropy) for a few epochs with a small learning rate (e.g., 2e-5).
How is fine-tuning adapted for Named Entity Recognition (NER) using a Transformer model?
Named Entity Recognition (NER) is a token-level classification task where every word/token in a sentence must be assigned a label (e.g., Person, Organization, Location, or Outside).
Fine-tuning Adaptation:
- Tokenization: The input sequence is tokenized. Because subword tokenizers (like WordPiece or BPE) break words into pieces, labels must be aligned. Typically, the first subword of an entity gets the label, and subsequent subwords get an 'X' or ignored label.
- Model Architecture: Unlike sequence classification which uses only the
[CLS]token, NER utilizes the final hidden state of every token in the sequence. - Classification Head: A shared dense linear layer is applied to the final hidden state of each token :
where and is the number of NER tags. - Loss Function: The loss is calculated as the average Cross-Entropy loss over all valid tokens in the sequence, and the model is fine-tuned end-to-end.
Detail the architecture modifications and loss formulation for fine-tuning BERT on Extractive Question Answering tasks (e.g., SQuAD).
Extractive Question Answering involves finding the span of text within a given paragraph (context) that answers a given question.
Input Formulation:
The input is packed as a single sequence: [CLS] Question Tokens [SEP] Context Tokens [SEP].
Architecture Modifications:
Instead of outputting a class label, the model must predict two pointers: the Start index and the End index of the answer span within the context.
To do this, two trainable vectors are introduced: (Start vector) and (End vector), both of dimension .
Probability Calculation:
For every token in the sequence with final hidden state :
- Start Probability: The dot product between and is computed, followed by a softmax over all context tokens.
- End Probability: Similarly, the dot product between and is computed.
Loss Formulation:
The objective is to maximize the log-likelihood of the correct start and end positions. The training loss is the sum of the negative log-likelihoods of the true start and end indices:
What is the HuggingFace Transformers library, and what core abstractions does it provide for NLP practitioners?
The HuggingFace Transformers library is an open-source framework that provides thousands of pretrained models to perform tasks on texts, vision, and audio. It has become the de-facto standard for NLP.
Core Abstractions:
AutoModelclasses: These provide a generic interface to load a specific architecture (like BERT, GPT) along with its pretrained weights. Variations likeAutoModelForSequenceClassificationorAutoModelForTokenClassificationautomatically append the correct output head for specific downstream tasks.AutoTokenizer: Tokenization logic varies heavily between models (BPE, WordPiece, SentencePiece). This abstraction automatically downloads and instantiates the correct tokenizer associated with a specific pretrained model name, handling padding, truncation, and special tokens seamlessly.Pipeline: A high-level API that abstracts away complex code. It connects a model with its tokenizer and post-processing logic to perform inference in just a few lines of code (e.g.,pipeline('sentiment-analysis')).TrainerAPI: A feature-complete training loop in PyTorch/TensorFlow optimized for fine-tuning transformer models, handling distributed training, logging, and evaluation.
Distinguish between Encoder-only, Decoder-only, and Encoder-Decoder architectures in the context of pretrained language models, providing an example for each.
Encoder-Only Models:
- Mechanism: Uses only the encoder part of the Transformer. They utilize bidirectional self-attention, meaning every token can attend to all other tokens.
- Strengths: Excellent for Natural Language Understanding (NLU) tasks that require full sequence context, such as text classification, NER, and extractive QA.
- Example: BERT, RoBERTa, ALBERT.
Decoder-Only Models:
- Mechanism: Uses only the decoder part. They utilize masked (causal) self-attention, meaning a token can only attend to previous tokens.
- Strengths: Designed for Natural Language Generation (NLG) via autoregressive next-token prediction. Strong at zero-shot and few-shot generation.
- Example: GPT, GPT-2, GPT-3, LLaMA.
Encoder-Decoder Models:
- Mechanism: Uses both components. The encoder processes the input bidirectionally, and the decoder generates output autoregressively, cross-attending to the encoder's states.
- Strengths: Best for Sequence-to-Sequence tasks where the input and output are both complex text sequences, such as translation, summarization, and paraphrasing.
- Example: T5, BART.
Explain the computational bottleneck of the self-attention mechanism regarding sequence length.
The main computational bottleneck in standard Transformers is the self-attention mechanism, which scales quadratically with respect to the input sequence length.
The Math:
In self-attention, we compute the dot product of every token's Query with every token's Key: .
If the sequence length is and the hidden dimension is :
- is an matrix.
- is a matrix.
- The multiplication results in an attention score matrix.
- The time complexity of this matrix multiplication is .
- The spatial complexity (memory footprint) to store the attention map is .
Impact:
While linear scaling in is manageable, the scaling in sequence length means that doubling the sequence length quadruples the compute and memory requirements. This makes it prohibitively expensive to process very long documents (e.g., books, long codebases) using standard self-attention, leading to research into sparse attention and linear transformers (e.g., Longformer, Linformer).
What role do Residual Connections and Layer Normalization play in the Transformer block?
Both Residual Connections and Layer Normalization are critical for training deep Transformer networks effectively.
Residual (Skip) Connections:
- Definition: The input to a sub-layer is added directly to its output: .
- Role: They create shortcuts for the gradients during backpropagation. This mitigates the vanishing gradient problem, allowing the training of very deep networks. They also ensure that if a layer is unnecessary, the model can easily push the weights of the sublayer to zero, turning it into an identity mapping.
Layer Normalization:
- Definition: Normalizes the inputs across the features (hidden dimensions) for each token independently, ensuring mean 0 and variance 1, followed by learned scale and shift parameters.
- Role: It stabilizes the training process by preventing the internal covariate shift. In Transformers, applying Layer Normalization after the residual connection (Post-LN) or before the sub-layer (Pre-LN) ensures that the magnitudes of activations and gradients remain well-conditioned, which is vital for the stable optimization of the multi-head attention and feed-forward layers.
Describe the position-wise Feed-Forward Network within a Transformer block and its significance.
In addition to attention sub-layers, each Encoder and Decoder block in the Transformer contains a fully connected Feed-Forward Network (FFN).
Structure:
The FFN consists of two linear transformations with a ReLU activation in between:
Typically, the inner hidden layer has a larger dimensionality than the input/output. For instance, in the original base Transformer, the input dimension , and the inner layer is expanded to .
Position-wise Application:
The key characteristic is that this exact same FFN is applied independently and identically to each position (token) in the sequence. There is no mixing of information across tokens within this sub-layer (that job is exclusively handled by the self-attention layer).
Significance:
While the self-attention mechanism facilitates interactions and context sharing between tokens, it acts purely as a weighted linear combination of values. The position-wise FFN introduces non-linearity and provides the necessary capacity/parameters to process and transform the aggregated contextual features of each individual token into a richer representation.