Unit 6 - Notes

CSE472 1 min read

Unit 6: Generative NLP and LLMs

1. Introduction to Generative NLP and Large Language Models (LLMs)

Generative NLP Models

Generative Natural Language Processing (NLP) focuses on generating coherent, contextually relevant, and semantically meaningful text. Unlike discriminative models (which classify or predict labels for existing text), generative models learn the probability distribution of a language to predict subsequent tokens (words, subwords, or characters) given a preceding context.

Autoregressive Language Modeling: Most generative NLP models are autoregressive. They predict the probability of the next token $x_t$ conditioned on the previous tokens $x_{<t}$ : $P(x_t | x_1, x_2, ..., x_{t-1})$ .
Transformer Architecture: Modern generative models predominantly rely on the Decoder-only Transformer architecture (e.g., GPT series). They utilize masked self-attention to ensure that the prediction of a token only depends on previous tokens, preventing information leakage from future tokens.

Large Language Models (LLMs)

LLMs are generative models scaled up massively in terms of parameter count (billions to trillions), training data (terabytes of text), and compute.

Emergent Abilities: As models scale, they exhibit behaviors not explicitly trained for, such as zero-shot translation, logical reasoning, and complex pattern matching.
Foundation Models: These are base LLMs trained on vast, uncurated internet data using self-supervised learning. While they possess broad knowledge, they require alignment (fine-tuning) to be useful for specific user-facing tasks.

2. Text Generation Strategies (Decoding Algorithms)

Once a model outputs a probability distribution over the vocabulary for the next token, a strategy is needed to select the actual token. This process is called decoding.

Greedy Search

Mechanism: At each time step, the model selects the single token with the highest predicted probability: $x_t = \arg\max P(x_t | x_{<t})$ .
Pros: Extremely fast and computationally cheap.
Cons: Often leads to suboptimal, repetitive, or mathematically "safe" but linguistically boring text. It does not look ahead, meaning a high-probability word now might lead to a low-probability sequence later.

Beam Search

Mechanism: Instead of keeping just one path, Beam Search keeps track of the top $B$ (beam size) most probable sequences at each step. At step $t$ , it expands all $B$ sequences, calculates the joint probabilities, and keeps only the new top $B$ sequences.
Pros: Produces higher-quality, more globally optimal sequences than greedy search. Excellent for tasks where the output length is predictable and exactness matters (e.g., Neural Machine Translation).
Cons: In open-ended text generation (like story writing), beam search tends to produce generic, predictable, and highly repetitive text.

Top-k Sampling

Mechanism: Sort the vocabulary by probability. Truncate the list to the top $k$ tokens. Redistribute the probability mass among these $k$ tokens and sample from this restricted distribution.
Pros: Introduces randomness (temperature) which makes the text more diverse and human-like, while preventing the model from picking completely absurd words (the "long tail" of low-probability words).
Cons: $k$ is fixed. If the probability distribution is very flat (many good options), $k=10$ might cut off good words. If the distribution is very sharp (only 1-2 good options), $k=10$ includes bad options, leading to potential gibberish.

Nucleus (Top-p) Sampling

Mechanism: Instead of a fixed number $k$ , Top-p sampling selects the smallest set of tokens whose cumulative probability exceeds a threshold $p$ (e.g., $p = 0.9$ ). The probability mass is then redistributed among this "nucleus" of tokens.
Pros: Dynamically adjusts the vocabulary size based on the model's confidence. If the model is confident (sharp distribution), it might only sample from 2 words. If unsure (flat distribution), it might sample from 50 words. This produces highly fluent and diverse text.
Cons: Computationally slightly more complex than Top-k, though widely considered the industry standard for open-ended generation.

3. Instruction-Tuned Large Language Models

Base LLMs are trained merely to predict the next word (e.g., completing a prompt like "The capital of France is" with "Paris"). However, they struggle to follow user commands. To bridge this gap, models undergo Instruction Tuning.

Supervised Fine-Tuning (SFT): The base model is fine-tuned on thousands of high-quality, human-annotated examples of instructions and their desired outputs (e.g., Prompt: Summarize this text. -> Output: [Summary]).
RLHF (Reinforcement Learning from Human Feedback): To further align the model with human preferences (Helpful, Honest, Harmless):
1. The model generates multiple responses to an instruction.
2. Human annotators rank these responses.
3. A Reward Model is trained to predict these human rankings.
4. The LLM is optimized using Reinforcement Learning (typically PPO - Proximal Policy Optimization) to maximize the reward.
Result: The model transitions from a "document completer" to a conversational agent (e.g., ChatGPT, Claude) capable of following complex, multi-constraint instructions.

4. Model Behaviors in Specific Tasks

Summarization

Behavior: LLMs excel at abstractive summarization (generating new text to summarize content) rather than extractive (copy-pasting sentences).
Challenges: "Lost in the middle" phenomenon—LLMs often pay heavy attention to the beginning and end of a long prompt but ignore the middle context. They can also hallucinate details not present in the source text.

Dialogue Generation

Behavior: Instruction-tuned LLMs can maintain personas, track context over multiple turns, and adjust tone.
Mechanism: Dialogue history is concatenated into a single long prompt fed to the model at each turn.
Challenges: Context window limits. As conversations grow, older context must be truncated or summarized. Over-apologizing or sycophancy (agreeing with the user even when the user is wrong) is a common byproduct of RLHF.

Reasoning Tasks

Behavior: LLMs initially struggled with multi-step math or logic. However, they exhibit strong reasoning capabilities when prompted correctly.
Chain-of-Thought (CoT): Prompting the model to "think step-by-step" forces it to allocate more computational tokens to the problem, mimicking a scratchpad. This drastically improves zero-shot and few-shot reasoning performance on complex tasks.

5. Evaluation Metrics

Evaluating generative text is notoriously difficult because there are infinitely many "correct" ways to answer an open-ended prompt.

Perplexity

Definition: An intrinsic evaluation metric that measures how well a probability model predicts a sample. It quantifies the "surprise" of the model when seeing a set of text.
Math: Perplexity (PP) is the exponentiated average negative log-likelihood of a sequence.
$PP(W) = P(w_1, w_2, ..., w_N)^{-\frac{1}{N}}$
Interpretation: A lower perplexity indicates the model assigns higher probability to the evaluation text (i.e., it is less surprised and better at modeling the language). However, lower perplexity does not always correlate with factual correctness or human preference.

N-Gram and Reference-Based Metrics

BLEU & ROUGE: Traditionally used for translation and summarization by measuring n-gram overlap between the generated text and a human reference.
Limitation for LLMs: These metrics penalize paraphrasing. A model might generate a perfect response using completely different words than the reference, scoring a 0 on BLEU.

Human Judgment Measures

Because automated metrics fail to capture the nuance of LLM generation, human evaluation is the gold standard.

Likert Scales: Humans rate outputs on a scale (e.g., 1-5) for fluency, coherence, relevance, and factual accuracy.
Side-by-Side (A/B) Testing: Humans are shown two model outputs and asked which is better. This calculates a model's "Win Rate" (e.g., Chatbot Arena).
Helpfulness and Harmlessness (H&H): Specific criteria used during RLHF to ensure the model assists the user without generating toxic, biased, or dangerous content.

6. Explainability and Hallucination in LLMs

Hallucination

Definition: When an LLM generates text that is grammatically correct and sounds highly plausible, but is factually incorrect or nonsensical.
Causes:
- Data Memorization: The model associates concepts statistically without true semantic understanding.
- Source Conflation: Mixing up facts from two distinct entities that share similar names or contexts in the training data.
- Prompt Forcing: If a user implies a false premise, the model (due to RLHF sycophancy) might play along and invent facts to support it.
Mitigation:
- Retrieval-Augmented Generation (RAG): Connecting the LLM to an external database (like a search engine). The model retrieves factual documents first, then generates an answer strictly based on those documents.
- Temperature Scaling: Lowering the temperature to reduce randomness in fact-based queries.

Explainability

LLMs act as "black boxes." Understanding why a model generated a specific token is a major area of research.

Attention Weights Analysis: Looking at which previous tokens the model paid the most "attention" to when generating the current token. However, in models with billions of parameters and dozens of layers, attention does not easily map to human logic.
Mechanistic Interpretability: Reverse-engineering neural networks to find specific circuits or neurons responsible for specific behaviors (e.g., finding the "name-mover head" that copies a subject's name in a sentence).
Post-hoc Explanations: Asking the LLM to explain its own reasoning. Caution: LLMs often suffer from rationalization; they will confidently invent a logical-sounding explanation for a decision that was actually made for entirely different, statistical reasons.

Unit 5