Practice MCQ

Unit 6 - Notes

INT344 6 min read

Unit 6: Building Models/ Case Studies

1. Question Answering: Transfer Learning with State-Of-The-Art Models

Question Answering (QA) is a sub-field of Information Retrieval and NLP concerned with building systems that automatically answer questions posed by humans in a natural language.

Transfer Learning in NLP

Before the transformer era, NLP models were trained from scratch for specific tasks. Transfer learning revolutionized this by allowing models to be pre-trained on massive datasets (like Wikipedia) to learn the structure of language, and then fine-tuned on smaller, task-specific datasets (like SQuAD - Stanford Question Answering Dataset).

Key Advantages:

Reduced Training Time: Fine-tuning takes significantly less time than training from scratch.
Performance on Low-Resource Data: Models can perform well even with limited labeled QA data because they already "understand" language.
State-Of-The-Art (SOTA) Evolution:
- ELMo (2018): Contextualized word embeddings.
- BERT (2018): Bidirectional transformer; redefined SOTA for QA.
- RoBERTa/ALBERT: Optimized versions of BERT.
- T5/GPT-3: Generative models that formulate answers rather than just extracting them.

2. BERT and T5 for Question Answering

BERT (Bidirectional Encoder Representations from Transformers)

BERT is an Encoder-only transformer architecture. It is designed to pre-train deep bidirectional representations from unlabeled text.

Mechanism for QA (Extractive QA):
- BERT treats QA as a span selection problem.
- Input: [CLS] Question [SEP] Passage [SEP]
- Output: Two vectors representing the probability of each token in the passage being the Start Position and the End Position of the answer.
- Pros: Highly accurate for factoid questions where the answer exists verbatim in the text.
- Cons: Cannot generate answers that are not explicitly present in the context.

T5 (Text-to-Text Transfer Transformer)

T5 allows the use of the same model, loss function, and hyperparameters across all NLP tasks by treating every problem as a text-to-text problem.

Mechanism for QA (Generative QA):
- T5 uses an Encoder-Decoder architecture.
- Input: question: What is the capital of France? context: France is a country in Europe...
- Output: Paris (Generated token by token).
- Pros: Can generate abstractive answers; handles boolean (True/False) questions easily; unified framework.
- Cons: Slower inference time due to auto-regressive decoding; risk of hallucination (generating plausible but incorrect facts).

3. Model for Answering Questions (Architecture Design)

Building a complete QA system usually involves more than just a language model. The standard architecture for Open-Domain QA is the Retriever-Reader pipeline.

The Retriever-Reader Pipeline

The Retriever (Document Selection):
- Purpose: Scans a massive knowledge base (e.g., all of Wikipedia) to find relevant documents.
- Traditional Method: TF-IDF / BM25 (Keyword matching).
- Modern Method: Dense Passage Retrieval (DPR). Uses a dual-encoder architecture to embed questions and documents into the same vector space. Relevance is calculated via dot product (cosine similarity).
The Reader (Answer Extraction/Generation):
- Purpose: Processes the documents found by the Retriever to find the specific answer.
- Model: BERT (for extractive) or T5/BART (for generative).
- Process: The Reader takes the top $K$ documents + the Question and outputs the final answer.

Conceptual Code Snippet (Hugging Face Transformers)

PYTHON

from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

model_name = "deepset/roberta-base-squad2"

# Load pipeline
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)

QA_input = {
    'question': 'Why is model conversion important?',
    'context': 'Model conversion allows using models trained in PyTorch with TensorFlow.'
}
res = nlp(QA_input)
print(res) 
# Output: {'score': 0.98, 'start': 0, 'end': 15, 'answer': 'Model conversion'}

4. Chatbots: Unique Challenges

While QA systems answer single queries, chatbots must maintain a continuous dialogue. This introduces specific challenges that standard transformer models struggle with.

Key Challenges

Long-Term Context Memory:
- Standard Transformers have a fixed context window (usually 512 or 1024 tokens).
- As the conversation grows, early parts of the chat are truncated, causing the bot to "forget" the user's name or initial intent.
Consistency and Persona:
- Models often contradict themselves (e.g., saying "I live in New York" then "I live in Paris" later).
- Lack of a consistent personality profile.
Generic Responses:
- Models tend to play it safe to minimize loss, resulting in dull responses like "I don't know" or "That's interesting."
Evaluation Metrics:
- Standard metrics like BLEU (used for translation) correlate poorly with human judgment of conversation quality. A chatbot can answer correctly but rudely or vaguely.

5. Transformer Models: Challenges and Solutions

The standard Transformer architecture (like BERT/GPT-2) faces computational limits when applied to long sequences (like long chat logs or books).

The Quadratic Complexity Problem

The core issue is the Self-Attention Mechanism.

For a sequence of length $L$ , every token attends to every other token.
This results in an $O(L^2)$ complexity for both time and memory.
Example: Doubling the context length quadruples the memory usage. This makes processing long chat histories prohibitively expensive.

Solutions and Optimizations

To address the $O(L^2)$ bottleneck, several variations have been proposed:

Sparse Attention (e.g., Longformer, BigBird):
- Instead of attending to all tokens, tokens only attend to a local window of neighbors and a few global tokens.
- Reduces complexity to $O(L)$ .
Recurrence (e.g., Transformer-XL):
- Caches hidden states from previous segments to preserve long-term dependencies without recomputing.
Low-Rank Factorization (e.g., Linformer):
- Approximates the attention matrix using lower-rank matrices.
Hashing (e.g., Reformer):
- Uses Locality Sensitive Hashing (LSH) to approximate attention.

6. Chatbot using a Reformer Model

The Reformer (introduced by Google Research) is known as the "Efficient Transformer." It is specifically designed to handle very long context windows (up to 64,000 tokens) on a single GPU, making it ideal for maintaining long conversational contexts in chatbots.

Key Innovations of the Reformer

1. Locality Sensitive Hashing (LSH) Attention

Problem: In standard attention, we compute $Query \times Key$ for every pair. Most pairs result in a low score (irrelevant).
Solution: LSH groups vectors that are similar into "buckets" using hash functions.
Mechanism: The model only computes attention between items falling in the same bucket (similar items).
Impact: Changes complexity from $O(L^2)$ to $O(L \log L)$ .

2. Reversible Residual Layers (RevNets)

Problem: To perform backpropagation, standard models must store the activations (values) of every layer in memory. For deep models, this consumes massive RAM.
Solution: In a reversible network, the input to a layer can be calculated from its output.
- $Y = X + F(X)$ (Standard) vs Reversible Architecture.
Impact: The model does not need to store activations for all layers. It recomputes them on the fly during the backward pass. This trades a small amount of compute time for massive memory savings.

Implementation in Chatbots

Using a Reformer for a chatbot allows the system to feed the entire conversation history (thousands of turns) into the model without truncation.

Workflow:

Input: Concatenate full User/Bot history.
LSH Attention: Efficiently attends to relevant past parts of the conversation (e.g., recalling the user's name mentioned 500 turns ago).
Generation: Produces a response that is contextually aware of the entire interaction, solving the "Consistency and Memory" challenge.

Summary Table: Transformer vs. Reformer

Feature	Standard Transformer	Reformer
Attention Complexity	$O(L^2)$ (Quadratic)	$O(L \log L)$ (Log-linear)
Memory Usage	High (stores all activations)	Low (Reversible layers)
Max Context Length	~512 - 2,048 tokens	~64,000+ tokens
Best Use Case	Short QA, Sentence classification	Long documents, Long chat history

Unit 5