Unit 6 - Notes

INT344

Unit 6: Building Models/ Case Studies

1. Question Answering: Transfer Learning with State-Of-The-Art Models

Question Answering (QA) is a sub-field of Information Retrieval and NLP concerned with building systems that automatically answer questions posed by humans in a natural language.

Transfer Learning in NLP

Before the transformer era, NLP models were trained from scratch for specific tasks. Transfer learning revolutionized this by allowing models to be pre-trained on massive datasets (like Wikipedia) to learn the structure of language, and then fine-tuned on smaller, task-specific datasets (like SQuAD - Stanford Question Answering Dataset).

Key Advantages:

  • Reduced Training Time: Fine-tuning takes significantly less time than training from scratch.
  • Performance on Low-Resource Data: Models can perform well even with limited labeled QA data because they already "understand" language.
  • State-Of-The-Art (SOTA) Evolution:
    • ELMo (2018): Contextualized word embeddings.
    • BERT (2018): Bidirectional transformer; redefined SOTA for QA.
    • RoBERTa/ALBERT: Optimized versions of BERT.
    • T5/GPT-3: Generative models that formulate answers rather than just extracting them.

2. BERT and T5 for Question Answering

BERT (Bidirectional Encoder Representations from Transformers)

BERT is an Encoder-only transformer architecture. It is designed to pre-train deep bidirectional representations from unlabeled text.

  • Mechanism for QA (Extractive QA):
    • BERT treats QA as a span selection problem.
    • Input: [CLS] Question [SEP] Passage [SEP]
    • Output: Two vectors representing the probability of each token in the passage being the Start Position and the End Position of the answer.
    • Pros: Highly accurate for factoid questions where the answer exists verbatim in the text.
    • Cons: Cannot generate answers that are not explicitly present in the context.

T5 (Text-to-Text Transfer Transformer)

T5 allows the use of the same model, loss function, and hyperparameters across all NLP tasks by treating every problem as a text-to-text problem.

  • Mechanism for QA (Generative QA):
    • T5 uses an Encoder-Decoder architecture.
    • Input: question: What is the capital of France? context: France is a country in Europe...
    • Output: Paris (Generated token by token).
    • Pros: Can generate abstractive answers; handles boolean (True/False) questions easily; unified framework.
    • Cons: Slower inference time due to auto-regressive decoding; risk of hallucination (generating plausible but incorrect facts).

3. Model for Answering Questions (Architecture Design)

Building a complete QA system usually involves more than just a language model. The standard architecture for Open-Domain QA is the Retriever-Reader pipeline.

The Retriever-Reader Pipeline

  1. The Retriever (Document Selection):

    • Purpose: Scans a massive knowledge base (e.g., all of Wikipedia) to find relevant documents.
    • Traditional Method: TF-IDF / BM25 (Keyword matching).
    • Modern Method: Dense Passage Retrieval (DPR). Uses a dual-encoder architecture to embed questions and documents into the same vector space. Relevance is calculated via dot product (cosine similarity).
  2. The Reader (Answer Extraction/Generation):

    • Purpose: Processes the documents found by the Retriever to find the specific answer.
    • Model: BERT (for extractive) or T5/BART (for generative).
    • Process: The Reader takes the top documents + the Question and outputs the final answer.

Conceptual Code Snippet (Hugging Face Transformers)

PYTHON
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

model_name = "deepset/roberta-base-squad2"

# Load pipeline
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)

QA_input = {
    'question': 'Why is model conversion important?',
    'context': 'Model conversion allows using models trained in PyTorch with TensorFlow.'
}
res = nlp(QA_input)
print(res) 
# Output: {'score': 0.98, 'start': 0, 'end': 15, 'answer': 'Model conversion'}


4. Chatbots: Unique Challenges

While QA systems answer single queries, chatbots must maintain a continuous dialogue. This introduces specific challenges that standard transformer models struggle with.

Key Challenges

  1. Long-Term Context Memory:
    • Standard Transformers have a fixed context window (usually 512 or 1024 tokens).
    • As the conversation grows, early parts of the chat are truncated, causing the bot to "forget" the user's name or initial intent.
  2. Consistency and Persona:
    • Models often contradict themselves (e.g., saying "I live in New York" then "I live in Paris" later).
    • Lack of a consistent personality profile.
  3. Generic Responses:
    • Models tend to play it safe to minimize loss, resulting in dull responses like "I don't know" or "That's interesting."
  4. Evaluation Metrics:
    • Standard metrics like BLEU (used for translation) correlate poorly with human judgment of conversation quality. A chatbot can answer correctly but rudely or vaguely.

5. Transformer Models: Challenges and Solutions

The standard Transformer architecture (like BERT/GPT-2) faces computational limits when applied to long sequences (like long chat logs or books).

The Quadratic Complexity Problem

The core issue is the Self-Attention Mechanism.

  • For a sequence of length , every token attends to every other token.
  • This results in an complexity for both time and memory.
  • Example: Doubling the context length quadruples the memory usage. This makes processing long chat histories prohibitively expensive.

Solutions and Optimizations

To address the bottleneck, several variations have been proposed:

  1. Sparse Attention (e.g., Longformer, BigBird):
    • Instead of attending to all tokens, tokens only attend to a local window of neighbors and a few global tokens.
    • Reduces complexity to .
  2. Recurrence (e.g., Transformer-XL):
    • Caches hidden states from previous segments to preserve long-term dependencies without recomputing.
  3. Low-Rank Factorization (e.g., Linformer):
    • Approximates the attention matrix using lower-rank matrices.
  4. Hashing (e.g., Reformer):
    • Uses Locality Sensitive Hashing (LSH) to approximate attention.

6. Chatbot using a Reformer Model

The Reformer (introduced by Google Research) is known as the "Efficient Transformer." It is specifically designed to handle very long context windows (up to 64,000 tokens) on a single GPU, making it ideal for maintaining long conversational contexts in chatbots.

Key Innovations of the Reformer

1. Locality Sensitive Hashing (LSH) Attention

  • Problem: In standard attention, we compute for every pair. Most pairs result in a low score (irrelevant).
  • Solution: LSH groups vectors that are similar into "buckets" using hash functions.
  • Mechanism: The model only computes attention between items falling in the same bucket (similar items).
  • Impact: Changes complexity from to .

2. Reversible Residual Layers (RevNets)

  • Problem: To perform backpropagation, standard models must store the activations (values) of every layer in memory. For deep models, this consumes massive RAM.
  • Solution: In a reversible network, the input to a layer can be calculated from its output.
    • (Standard) vs Reversible Architecture.
  • Impact: The model does not need to store activations for all layers. It recomputes them on the fly during the backward pass. This trades a small amount of compute time for massive memory savings.

Implementation in Chatbots

Using a Reformer for a chatbot allows the system to feed the entire conversation history (thousands of turns) into the model without truncation.

Workflow:

  1. Input: Concatenate full User/Bot history.
  2. LSH Attention: Efficiently attends to relevant past parts of the conversation (e.g., recalling the user's name mentioned 500 turns ago).
  3. Generation: Produces a response that is contextually aware of the entire interaction, solving the "Consistency and Memory" challenge.

Summary Table: Transformer vs. Reformer

Feature Standard Transformer Reformer
Attention Complexity (Quadratic) (Log-linear)
Memory Usage High (stores all activations) Low (Reversible layers)
Max Context Length ~512 - 2,048 tokens ~64,000+ tokens
Best Use Case Short QA, Sentence classification Long documents, Long chat history