Unit6 - Subjective Questions
CSE472 • Practice Questions with Detailed Answers
What are Generative NLP models, and how do they differ from discriminative NLP models?
Generative NLP Models are machine learning models designed to generate natural language text. They learn the joint probability distribution of the data, , or simply the distribution of the text in the case of unsupervised generation.
Differences from Discriminative Models:
- Objective: Generative models aim to understand how the data is generated (modeling or ), whereas discriminative models learn the boundary between classes (modeling ).
- Tasks: Generative models are used for tasks like text generation, translation, and summarization. Discriminative models are typically used for classification tasks like sentiment analysis or spam detection.
- Flexibility: Generative models like GPT can generate entirely new sequences of text, making them highly versatile for open-ended tasks.
Explain the Greedy Search strategy for text generation and discuss its primary limitations.
Greedy Search is the simplest decoding algorithm used in text generation.
Mechanism:
- At each time step , the model calculates the probability distribution over the entire vocabulary for the next word.
- It strictly selects the word with the highest probability: .
- This process continues iteratively until an end-of-sequence token is generated or a maximum length is reached.
Limitations:
- Suboptimal Global Sequences: By only looking at the immediate next step, greedy search can miss a sequence of words that has a higher overall probability but requires selecting a locally sub-optimal word early on.
- Repetitiveness: It often leads to repetitive and highly predictable text, lacking the diversity and natural variation found in human language.
Describe the Beam Search algorithm and how it improves upon Greedy Search. Provide its mathematical intuition.
Beam Search is a heuristic search algorithm that expands upon greedy search by keeping track of multiple hypotheses at each time step.
Mechanism:
- Instead of picking just the top word, Beam Search keeps track of the most probable partial sequences (where is the beam width).
- At each step, it expands all sequences by predicting the next word, resulting in new candidates.
- It scores these candidates using the joint probability and keeps only the top sequences for the next step.
Mathematical Intuition:
The goal is to maximize the sequence probability:
By keeping beams, it searches a larger subspace than greedy search (), mitigating the risk of missing high-probability global sequences that start with low-probability words.
Improvement over Greedy:
- Yields more globally optimal sequences.
- Reduces ungrammatical or dead-end sentences compared to greedy decoding.
Explain the Top-k sampling strategy in generative NLP.
Top-k Sampling is a stochastic decoding strategy used to introduce diversity into text generation while maintaining coherence.
How it works:
- At each generation step, the model outputs a probability distribution over the vocabulary.
- Instead of considering all words, the vocabulary is filtered to keep only the most likely next words.
- The probability mass is then redistributed (re-normalized) among these words:
- The next word is then randomly sampled from this re-normalized distribution.
Advantages and Disadvantages:
- Advantage: Prevents the model from sampling highly unlikely "tail" words, ensuring better coherence than pure random sampling.
- Disadvantage: A fixed can be rigid. In flat distributions, a small cuts off valid options; in sharp distributions, a large includes irrelevant words.
What is Nucleus (Top-p) sampling, and how does it dynamically address the limitations of Top-k sampling?
Nucleus (Top-p) Sampling is an advanced text generation strategy that filters the vocabulary based on cumulative probability rather than a fixed number of words.
How it works:
- The model sorts the vocabulary by probability in descending order.
- It computes the cumulative probability and selects the smallest set of words (the "nucleus") whose cumulative probability exceeds a predefined threshold (e.g., ).
- The formula is to find the smallest index such that:
- The probabilities of the words in this set are re-normalized, and the next word is sampled from them.
Addressing Top-k Limitations:
- Dynamic Vocabulary Size: Unlike Top-k, which always selects words regardless of the distribution, Top-p adjusts dynamically. If the model is highly confident (sharp distribution), it might only sample from 1 or 2 words. If the model is uncertain (flat distribution), it samples from a larger pool, perfectly balancing diversity and coherence.
What does it mean for a Large Language Model to be 'instruction-tuned'?
Instruction Tuning is a fine-tuning process applied to pre-trained Large Language Models (LLMs) to align their behavior with user intent.
Key Characteristics:
- Data Format: The model is trained on datasets formatted as instructions (e.g., "Translate this sentence to French: ..." or "Summarize the following text: ...") paired with the correct responses.
- Objective: While base models are trained simply to predict the next word, instruction-tuned models are optimized to follow directions, answer questions directly, and complete specific tasks described in the prompt.
- Zero-Shot Generalization: Instruction tuning significantly improves a model's ability to perform unseen tasks zero-shot, because the model learns the generalized concept of "following an instruction" rather than just continuing a text pattern.
- Techniques: Often involves Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF).
Describe Large Language Models (LLMs). What makes them 'large' compared to earlier NLP models?
Large Language Models (LLMs) are highly advanced generative NLP models, predominantly based on the Transformer architecture (especially the decoder-only architecture).
What makes them 'Large':
- Parameter Count: LLMs possess billions or even trillions of parameters (weights). Early models like ELMo or BERT had millions of parameters, whereas models like GPT-3 or LLaMA have tens to hundreds of billions.
- Training Data: They are pre-trained on massive, internet-scale text corpora (terabytes of text), encompassing diverse topics, languages, and coding syntax.
- Computational Scale: Training requires thousands of GPUs/TPUs over several weeks or months.
Emergent Abilities: Because of this immense scale, LLMs demonstrate "emergent abilities"—skills they were not explicitly trained for, such as translating between rare languages, writing executable code, or solving multi-step logic problems.
Discuss how LLMs behave in summarization tasks. What are the common challenges they face?
LLMs in Summarization Tasks:
LLMs excel at both extractive (pulling key sentences) and abstractive (generating new sentences to capture the gist) summarization, largely leaning toward abstractive due to their generative nature.
Behaviors and Capabilities:
- Context Understanding: They can digest long documents and identify the core themes effectively.
- Customization: They can adapt the summary based on instructions (e.g., "Summarize in 3 bullet points" or "Summarize for a 5-year-old").
Common Challenges:
- Hallucination: The model might insert facts or details not present in the source text.
- Context Window Limits: If a document exceeds the model's token limit, important information at the beginning or end might be truncated or forgotten (the "lost in the middle" phenomenon).
- Bias and Spin: LLMs might inadvertently alter the tone or emphasize the wrong points based on their pre-training biases rather than the source text.
Analyze the role of Generative Models in Dialogue Generation.
Dialogue Generation is a primary application for generative LLMs, transforming them into conversational agents (chatbots).
Key Roles and Mechanisms:
- Context Tracking: LLMs maintain conversational context by taking the entire conversation history as the input prompt to generate the next response.
- Persona Adoption: Generative models can be prompted to adopt specific personas (e.g., a helpful assistant, a pirate, a professional tutor) and maintain that tone throughout the dialogue.
- Open-Domain Capability: Unlike traditional rule-based or intent-based chatbots, LLMs can discuss virtually any topic seamlessly without requiring predefined dialogue trees.
Challenges:
- Memory constraints: Managing long-term memory in extended conversations requires workarounds like vector databases or continuous summarization.
- Safety and Toxicity: Without guardrails, models can generate inappropriate, biased, or toxic responses in a conversational setting.
How do LLMs tackle reasoning tasks? Explain the 'Chain of Thought' prompting strategy.
LLMs and Reasoning Tasks:
Historically, language models struggled with multi-step reasoning, arithmetic, or logic puzzles because standard prompting forces the model to output the final answer immediately, relying entirely on next-token prediction.
Chain of Thought (CoT) Prompting:
CoT is a strategy that enables LLMs to solve complex reasoning tasks by mimicking human-like step-by-step thinking.
- Mechanism: Instead of asking for a direct answer, the prompt instructs the model (or provides examples) to break the problem down into intermediate logical steps before reaching the conclusion. For example: "Let's think step by step."
- Why it works: By generating intermediate tokens, the model essentially gets "more compute time" (token generation) to process the logic. It alters the context window, so the final answer is conditioned on the logical steps generated previously.
- Impact: CoT drastically improves performance on benchmarks related to math word problems, symbolic reasoning, and common sense reasoning.
Define Perplexity as an evaluation metric for language models. Provide its mathematical formula.
Perplexity (PPL) is an intrinsic evaluation metric used to measure how well a probability model predicts a sample. In NLP, it measures the model's surprise at seeing the test data.
Definition:
Lower perplexity indicates that the language model assigns a higher probability to the true test data, meaning it is better at predicting the language sequence.
Mathematical Formula:
For a test sequence of words , perplexity is the inverse probability of the test set, normalized by the number of words:
Using the chain rule of probability, it can be written as:
Alternately, it is the exponentiation of the cross-entropy loss: , where is the cross-entropy.
Discuss the importance and methodologies of Human Judgment measures in evaluating generative NLP models.
Importance of Human Judgment:
Automated metrics (like BLEU, ROUGE, or Perplexity) often fail to capture the true quality of generative text, such as nuance, humor, factual accuracy, and fluency. Human judgment remains the gold standard for evaluating open-ended text generation.
Methodologies:
- Likert Scales: Human annotators rate generated text on a scale (e.g., 1 to 5) across various dimensions like Fluency, Coherence, Relevance, and Factuality.
- Pairwise Comparison (A/B Testing): Annotators are shown two outputs from different models for the same prompt and asked to choose the better one. This is heavily used in systems like Chatbot Arena.
- Error Analysis: Humans specifically look for and categorize errors, such as hallucinations, logical fallacies, or toxic language.
- Drawbacks: Human evaluation is expensive, time-consuming, subjective, and difficult to reproduce at scale.
Compare automated evaluation metrics with human judgment measures for evaluating Large Language Models.
Automated Evaluation Metrics vs. Human Judgment
Automated Metrics (e.g., BLEU, ROUGE, Perplexity):
- Speed and Cost: Extremely fast and virtually free to compute at scale.
- Reproducibility: Highly reproducible and objective (mathematically defined).
- Drawbacks: They rely mostly on exact n-gram overlap. They cannot penalize subtle factual errors (hallucinations) and often penalize highly creative but perfectly valid paraphrases.
Human Judgment:
- Quality Assessment: Captures the true utility of the text, evaluating semantics, truthfulness, tone, and stylistic nuances.
- Adaptability: Humans can evaluate open-ended tasks (like brainstorming or creative writing) where no reference text exists.
- Drawbacks: Expensive, slow, and prone to inter-annotator disagreement (subjectivity bias).
Conclusion: Best practices in LLM evaluation involve a hybrid approach: using automated metrics for rapid prototyping and human judgment for final benchmarking and RLHF.
What is Explainability in the context of Large Language Models, and why is it challenging?
Explainability in LLMs refers to the ability to understand, interpret, and trace how a model arrived at a specific output or decision.
Why it is challenging:
- Black Box Nature: LLMs consist of billions of parameters with highly non-linear transformations (attention mechanisms, MLPs). Tracing a specific output back to a specific input or training data point is mathematically intractable.
- Distributed Representations: Knowledge is not stored in a single neuron; it is distributed across the entire network.
Approaches to Explainability:
- Mechanistic Interpretability: Attempting to reverse-engineer the neural network to find circuits or attention heads responsible for specific tasks.
- Post-hoc Explanations: Asking the model itself to explain its reasoning (e.g., using Chain of Thought), though this can be prone to rationalization rather than genuine explanation.
- Feature Attribution: Techniques like Integrated Gradients to see which input tokens most heavily influenced the output tokens.
Define 'Hallucination' in Large Language Models. Provide examples of how it manifests.
Hallucination in LLMs occurs when the model generates text that is grammatically correct and sounds highly plausible, but is factually incorrect, nonsensical, or unfaithful to the provided source input.
How it manifests:
- Intrinsic Hallucination (Contradiction): The model contradicts the source material provided in the prompt. (e.g., Prompt: "John was born in 1980." Output: "John, who was born in 1990...").
- Extrinsic Hallucination (Fabrication): The model invents "facts" that cannot be verified from the input or reality. (e.g., making up a fake research paper title, author, and DOI when asked for sources).
Causes:
LLMs are fundamentally probabilistic next-token predictors, not knowledge databases. If a highly probable sequence of tokens forms a factual inaccuracy, the model will generate it regardless of ground truth.
Discuss strategies and techniques used to mitigate hallucination in Large Language Models.
Mitigating Hallucination in LLMs is a critical area of research. Key strategies include:
- Retrieval-Augmented Generation (RAG): Instead of relying on the model's internal parametric memory, RAG searches an external database for factual documents and injects them into the prompt. The model is instructed to answer only based on the retrieved context.
- Prompt Engineering: Using strict instructions like "If you do not know the answer, say 'I don't know'" or "Cite your sources from the provided text."
- Self-Consistency and Verification: Generating multiple responses and checking for consensus, or prompting the model to explicitly verify its own previous statements.
- Instruction Tuning / RLHF: Penalizing hallucinated answers during the reinforcement learning phase, teaching the model to express uncertainty rather than fabricate facts.
- Lowering Temperature: In decoding, setting a low temperature (closer to greedy search) reduces randomness, which often correlates with a reduction in creative fabrications.
Explain the concept of Temperature in text generation and its mathematical impact on the softmax function.
Temperature is a hyperparameter used to control the randomness and creativity of text generated by an LLM.
Mathematical Impact:
In the final layer of an LLM, logits () are converted to probabilities using the softmax function. Temperature () scales these logits before softmax is applied:
- : Default softmax. The model generates based on its raw learned probabilities.
- (Low Temperature): As , the differences between logits are amplified. The probability distribution becomes "sharper," favoring the most likely words. This leads to predictable, safe, and less diverse text (approaching Greedy Search).
- (High Temperature): The differences between logits are minimized. The probability distribution becomes "flatter" or more uniform. This increases randomness, resulting in more creative, diverse, but potentially nonsensical or hallucinated text.
Compare Pre-training and Instruction Tuning phases of Large Language Models. What are their respective goals?
Pre-training vs. Instruction Tuning
1. Pre-training:
- Goal: To build foundational language understanding and world knowledge.
- Objective: Self-supervised learning, typically next-token prediction (causal language modeling).
- Data: Massive, unstructured internet text (Common Crawl, Wikipedia, books).
- Result: A "Base Model" that is excellent at completing sentences but poor at answering questions or following formatting commands (e.g., it might answer a question with another question).
2. Instruction Tuning:
- Goal: To adapt the base model to act as a helpful, harmless, and honest assistant.
- Objective: Supervised fine-tuning on QA pairs, followed often by RLHF (Reinforcement Learning from Human Feedback).
- Data: High-quality, human-annotated prompt-response pairs.
- Result: An "Instruct Model" (like ChatGPT) that understands intents, follows constraints, and engages in dialogue.
How does balancing Diversity and Coherence impact the choice of text generation strategies?
Diversity vs. Coherence is the fundamental tradeoff in text generation.
- Coherence: The text must be logically consistent, grammatically correct, and highly relevant to the prompt. Maximizing coherence generally involves deterministic strategies like Greedy Search or Beam Search, which pick the highest probability tokens. However, this often leads to generic, repetitive, and boring text.
- Diversity: The text should be varied, creative, and human-like. Maximizing diversity involves stochastic strategies like random sampling. However, pure random sampling often selects long-tail tokens, leading to nonsensical gibberish (loss of coherence).
Impact on Strategy Choice:
To balance these, hybrid sampling methods are preferred:
- Top-k Sampling: Limits diversity to only the most coherent words.
- Nucleus (Top-p) Sampling: Dynamically limits diversity based on the model's confidence, providing the best modern balance for open-ended generation like storytelling or dialogue.
What are the specific challenges in evaluating the dialogue generation capabilities of LLMs?
Challenges in Evaluating Dialogue Generation:
- One-to-Many Mapping: In dialogue, there are multiple valid ways to respond to a given prompt. Automated metrics like BLEU require an exact reference, but a generated response might be perfectly valid even if it shares zero words with the human reference.
- Context Dependency: A response cannot be evaluated in isolation; it must be judged based on the entire preceding conversational history.
- Persona and Consistency: Evaluating whether the model maintained a consistent personality or remembered facts stated three turns ago is difficult for automated systems.
- Safety and Nuance: Chatbots must be evaluated for nuanced issues like avoiding toxic responses, handling adversarial attacks (jailbreaks), and providing empathetic responses, all of which heavily require expensive human-in-the-loop evaluation.