1 $What is the primary goal of a generative NLP model?$

generative NLP models Easy

A.

To cluster documents into distinct topics

B.

To predict the sentiment of a given sentence

C.

To generate new, coherent text sequences

D.

To extract named entities from text

2 $In autoregressive text generation, how is a sequence typically generated?$

text generation strategies Easy

A.

The sequence is generated backwards from the end to the beginning

B.

All tokens are generated simultaneously in one step

C.

Tokens are selected randomly from the vocabulary

D.

Tokens are generated one by one, conditioned on previously generated tokens

3 $How does the greedy search decoding strategy choose the next token?$

greedy search Easy

A.

It selects the token with the highest probability at each step

B.

It selects the token with the lowest probability to increase diversity

C.

It looks ahead to the end of the sentence before choosing

D.

It randomly samples from the top 10 tokens

4 $What is a common drawback of using greedy search for text generation?$

greedy search Easy

A.

It requires a specialized neural network architecture

B.

It can only generate one word per minute

C.

It often leads to repetitive and highly predictable text

D.

It is too computationally expensive

5 $Which of the following best describes beam search?$

beam search Easy

A.

It keeps track of the top most probable sequences at each step

B.

It generates text by translating it back and forth between languages

C.

It randomly selects tokens until a complete sentence is formed

D.

It only keeps the single best token at each step

6 $What happens during top-k sampling in text generation?$

top-k Easy

A.

The model samples the next token only from the most likely tokens

B.

The model selects exactly tokens at every step

C.

The model generates entirely different documents

D.

The model removes the top tokens and samples from the rest

7 $If in top-k sampling, the strategy becomes exactly equivalent to which other decoding method?$

top-k Easy

A.

Nucleus sampling

B.

Random sampling

C.

Greedy search

D.

Beam search

8 $What is another common name for nucleus sampling?$

nucleus sampling Easy

A.

Top-k sampling

B.

Greedy sampling

C.

Bottom-up sampling

D.

Top-p sampling

9 $Unlike top-k sampling which uses a fixed number of tokens, how does nucleus (top-p) sampling determine the pool of candidate tokens?$

nucleus sampling Easy

A.

By filtering out tokens with less than 5 characters

B.

By selecting tokens based on a fixed cumulative probability mass

C.

By picking all tokens that start with the same letter

D.

By dynamically choosing a random number at each step

10 $What does it mean for a large language model to be 'instruction-tuned'?$

instruction-tuned Easy

A.

It contains hard-coded rules for grammar instructions

B.

It is only capable of giving instructions to other AI models

C.

It has been fine-tuned on a dataset of tasks described via natural language instructions

D.

It has been trained solely on programming language instructions

11 $Which of the following is a primary characteristic of Large Language Models (LLMs)?$

large language models Easy

A.

They are exclusively rule-based expert systems

B.

They are trained on massive amounts of textual data using self-supervised learning

C.

They usually have fewer than 1 million parameters

D.

They cannot process input longer than one sentence

12 $When an LLM performs abstractive summarization, what is it primarily doing?$

model behaviors in summarization Easy

A.

Translating the text into a different language to make it shorter

B.

Generating new sentences that convey the core meaning of the source text

C.

Extracting exact sentences from the source text and pasting them together

D.

Identifying the author of the original text

13 $What is a critical requirement for a model to perform well in multi-turn dialogue generation?$

dialogue generation Easy

A.

It must maintain conversational context and memory across multiple interactions

B.

It must only provide answers using greedy search

C.

It must use a different language for every turn

D.

It must ignore the user's previous inputs to save memory

14 $Which prompting technique explicitly encourages an LLM to generate intermediate logical steps to solve complex reasoning tasks?$

reasoning tasks Easy

A.

Greedy prompting

B.

Zero-shot prompting

C.

Negative prompting

D.

Chain-of-Thought (CoT) prompting

15 $Which automated metric is widely used in NLP for evaluating the n-gram overlap between generated text and reference text?$

evaluation metrics Easy

A.

Accuracy

B.

BLEU score

C.

Mean Squared Error (MSE)

D.

F1 score for classification

16 $In the context of evaluating a language model, what does a lower perplexity score indicate?$

perplexity and human judgment measures Easy

A.

The model generates longer text sequences

B.

The model is more confused and performs poorly

C.

The model assigns higher probability to the test data, indicating better performance

D.

The model generates text at a slower speed

17 $Why are human judgment measures often considered essential for evaluating generative LLMs?$

perplexity and human judgment measures Easy

A.

Automated metrics like BLEU or perplexity do not fully capture text fluency, coherence, and safety

B.

Humans can evaluate millions of sentences faster than a computer

C.

Automated metrics perfectly capture semantic meaning but struggle with grammar

D.

Automated metrics are too computationally expensive to calculate

18 $What does the term 'hallucination' refer to in the context of Large Language Models?$

explainability and hallucination in LLMs Easy

A.

A visual glitch that occurs during the training phase of a neural network

B.

The model's ability to generate highly creative works of fiction when prompted

C.

The process of extracting hidden rules from the training data

D.

The model generating factually incorrect, nonsensical, or fabricated information as if it were true

19 $Why is 'explainability' a major challenge in modern Large Language Models?$

explainability and hallucination in LLMs Easy

A.

Because their code is generally kept secret from developers

B.

Because they act as 'black boxes' with billions of parameters, making it hard to trace how a specific output was derived

C.

Because they are rule-based systems with too many explicit rules

D.

Because they only output numerical data instead of text

20 $Which neural network architecture serves as the foundation for modern generative LLMs like GPT?$

generative NLP models Easy

A.

Long Short-Term Memory (LSTM)

B.

Support Vector Machine (SVM)

C.

Transformer Decoder

D.

Convolutional Neural Network (CNN)

21 $Which of the following best describes the core mathematical objective of standard autoregressive generative NLP models during generation?$

generative NLP models Medium

A.

Maximizing the probability of the entire sequence at once using .

B.

Minimizing the distance between the input sequence embeddings and the output sequence embeddings.

C.

Masking random tokens and predicting them using bidirectional context .

D.

Predicting the next token by maximizing the conditional probability .

22 $When generating text, modifying the 'temperature' parameter alters the probability distribution of the next token. If a temperature is applied, how does it affect the softmax output?$

text generation strategies Medium

A.

It sharpens the distribution, making the model more confident and likely to select the highest-probability tokens.

B.

It adds a constant noise value to all logits before the softmax is computed.

C.

It makes the distribution more uniform, increasing the likelihood of selecting low-probability tokens.

D.

It truncates the distribution by setting all probabilities below to zero.

23 $A generative model is using greedy search to translate a sentence. Why might this strategy fail to find the sequence with the highest overall probability?$

greedy search Medium

A.

Greedy search always selects the longest possible sequence, penalizing shorter valid translations.

B.

Greedy search makes locally optimal choices at each time step without considering future token probabilities.

C.

Greedy search randomly samples tokens, which introduces too much variance.

D.

Greedy search requires an excessive amount of memory to maintain multiple candidate sequences.

24 $In beam search text generation with a beam width of, what exactly does the algorithm track at time step ?$

beam search Medium

A.

All possible sequences of length, but only evaluating the top 3 nodes at the final step.

B.

The 3 highest probability tokens across the entire vocabulary, independently of previous steps.

C.

The 3 most probable sequences of length, along with their cumulative log-probabilities.

D.

The 3 most probable tokens for the single best sequence generated up to time .

25 $Why is a length penalty often applied to the scores of candidate sequences during beam search?$

beam search Medium

A.

Because the model's vocabulary size grows exponentially with sequence length.

B.

Because beam search naturally favors generating very long, repetitive sequences.

C.

Because cumulative log-probabilities are negative, making longer sequences naturally score lower.

D.

Because the attention mechanism degrades in performance as the sequence gets longer.

26 $A text generation model applies Top-K sampling with . If the model predicts a very confident distribution where the top 3 tokens contain of the probability mass, what is a potential drawback of this Top-K approach?$

top-k Medium

A.

It still allows a small probability of selecting from the remaining 47 tokens, which might be irrelevant.

B.

It will force the model to select a token from the long tail, resulting in gibberish.

C.

It will dynamically reduce to 3 to match the probability mass.

D.

It will automatically switch to greedy search because the confidence is too high.

27 $In nucleus (Top-p) sampling with, how does the algorithm determine which tokens to keep in the sampling pool?$

nucleus sampling Medium

A.

It selects tokens randomly until the mean probability of the selected tokens is $0.9$.

B.

It sorts tokens by probability and selects the smallest set whose cumulative probability exceeds $0.9$.

C.

It selects the top of the vocabulary size.

D.

It keeps all tokens with an individual probability greater than $0.9$.

28 $Which of the following describes a scenario where nucleus sampling (Top-p) behaves significantly differently than Top-K sampling?$

nucleus sampling Medium

A.

When the distribution shifts from highly peaked (confident) to very flat (uncertain) across different generation steps.

B.

When the probability distribution is completely uniform across the entire vocabulary.

C.

When the temperature is set to $0$, causing both to collapse to greedy search.

D.

When the generation task is a pure classification task with only two possible output tokens.

29 $What is the primary objective of instruction tuning a Large Language Model compared to standard pre-training?$

instruction-tuned Medium

A.

To teach the model grammar and syntax from raw text corpora.

B.

To convert the model from a causal decoder to a masked language encoder.

C.

To align the model's outputs with human intent by training on (instruction, response) pairs.

D.

To compress the model size by removing redundant attention heads.

30 $When performing few-shot prompting with an LLM, the model successfully adapts to a new task. What mechanism allows this adaptation without updating the model's parameters?$

large language models Medium

A.

Parameter-efficient fine-tuning (PEFT)

B.

Gradient descent

C.

In-context learning

D.

Weight quantization

31 $When utilizing a generative LLM for abstractive summarization, which of the following is the most significant risk compared to extractive summarization?$

model behaviors in summarization Medium

A.

The model might generate fluent but factually incorrect statements not present in the source.

B.

The model will inherently produce a summary longer than the original text.

C.

The model might only copy sentences verbatim without any rephrasing.

D.

The model is unable to process long documents due to vocabulary constraints.

32 $In a multi-turn dialogue generation system, how does an autoregressive LLM typically 'remember' previous conversational turns?$

dialogue generation Medium

A.

By relying exclusively on a static context vector generated during the pre-training phase.

B.

By updating its neural weights via continuous backpropagation after every user turn.

C.

By concatenating previous turns with the current input into a single prompt, up to the context window limit.

D.

By storing previous turns in a separate relational database that modifies the softmax layer.

33 $How does 'Chain-of-Thought' (CoT) prompting improve an LLM's performance on complex mathematical reasoning tasks?$

reasoning tasks Medium

A.

It alters the decoding strategy from Top-p sampling to beam search to guarantee the correct answer.

B.

It forces the model to use an external calculator API to compute the mathematical operations.

C.

It encourages the model to generate intermediate reasoning steps, allocating more computational steps before the final answer.

D.

It instructs the model to ignore intermediate steps and directly output the final number to reduce hallucination.

34 $Which of the following best describes the primary difference between how BLEU and ROUGE evaluate generated text?$

evaluation metrics Medium

A.

BLEU evaluates semantic meaning using embeddings, while ROUGE evaluates exact lexical overlap.

B.

BLEU is based on precision of n-grams, while ROUGE is traditionally based on recall of n-grams.

C.

BLEU measures grammatical correctness, while ROUGE measures factuality.

D.

BLEU is used exclusively for text summarization, while ROUGE is used for machine translation.

35 $A student evaluates an LLM's response using BLEU and gets a very low score, yet human evaluators rate the response as excellent. What is the most likely reason for this discrepancy?$

evaluation metrics Medium

A.

BLEU scores increase when the generated text is shorter than the reference text, which humans dislike.

B.

BLEU penalizes text that uses synonyms and paraphrasing instead of exact n-gram matches from the reference.

C.

The human evaluators calculated perplexity instead of precision.

D.

The LLM produced a highly repetitive sequence that tricked the human evaluators.

36 $Given a sequence of words, perplexity (PP) is defined as . What does a lower perplexity score on a test set indicate about a language model?$

perplexity and human judgment measures Medium

A.

The model aligns better with human ethical judgments and safety guidelines.

B.

The model assigns a higher probability to the test data, indicating it predicts the sequence well.

C.

The model generates text with a wider variety of vocabulary.

D.

The model assigns a lower probability to the test data, indicating it is confused.

37 $Why is perplexity generally considered insufficient on its own for evaluating modern instruction-tuned LLMs?$

perplexity and human judgment measures Medium

A.

Perplexity measures how well the model predicts the next token in a static corpus, but not how helpful, safe, or factually accurate the generated responses are.

B.

Perplexity is bounded between 0 and 1, making it difficult to distinguish between high-performing models.

C.

Perplexity requires a human in the loop to calculate, making it too expensive to use at scale.

D.

Perplexity cannot be mathematically calculated for autoregressive models.

38 $In the context of evaluating LLM hallucination in abstractive summarization, what distinguishes an 'intrinsic hallucination' from an 'extrinsic hallucination'?$

explainability and hallucination in LLMs Medium

A.

Intrinsic hallucination is caused by hyperparameter tuning, while extrinsic hallucination is caused by biased pre-training data.

B.

Intrinsic hallucination directly contradicts information in the source text, while extrinsic hallucination introduces external information that cannot be verified from the source.

C.

Intrinsic hallucination occurs when the model introduces information that is factually false in the real world, while extrinsic hallucination is mathematically invalid logic.

D.

Intrinsic hallucination is a failure in the self-attention mechanism, while extrinsic hallucination is a failure in the feed-forward network.

39 $Which of the following techniques is most commonly used to mitigate factual hallucination in LLMs by grounding the model's responses?$

explainability and hallucination in LLMs Medium

A.

Increasing the temperature parameter during nucleus sampling.

B.

Retrieval-Augmented Generation (RAG).

C.

Decreasing the beam width during beam search decoding.

D.

Applying an absolute length penalty to the generated sequences.

40 $When attempting to explain the predictions of a Transformer-based LLM, researchers often look at attention weights. What is a widely recognized limitation of using attention weights as an explainability tool?$

explainability and hallucination in LLMs Medium

A.

Attention weights are only computed for the final layer, leaving previous layers unexplainable.

B.

Attention weights do not always correlate with feature importance or the actual causal impact on the model's output.

C.

Attention weights can only be extracted from encoder-decoder models, not decoder-only LLMs.

D.

Attention weights are binary and cannot represent the magnitude of importance.

41 $In nucleus sampling (Top-), the model samples from the smallest set of tokens such that the sum of their probabilities is greater than or equal to . If, and the probability distribution over the vocabulary for the next token is,,, and, what will be the effective size of the sampling vocabulary ?$

nucleus sampling Hard

A.

The sampling fails because without adding up.

B.

C.

D.

42 $Autoregressive models decoded using standard beam search often exhibit a strong bias towards shorter sequences. To mitigate this, a length penalty is introduced to the objective function: . If, how does this objective theoretically alter the sequence scoring compared to standard beam search?$

beam search Hard

A.

It biases the model exclusively towards sequences that have the highest possible single-token probability, regardless of length.

B.

It squares the log-probability sum, punishing longer sequences exponentially more than standard beam search.

C.

It eliminates the influence of the prior probabilities, acting as a length-invariant constant across all beams.

D.

It normalizes the cumulative log-probability by the sequence length, converting the score to the geometric mean of the token probabilities.

43 $During text generation, consider a scenario where the vocabulary distribution is completely uniform across a large vocabulary . If we transition from Top- sampling with to nucleus sampling with, what happens to the size of the restricted vocabulary from which we sample?$

top-k Hard

A.

will collapse to 1, effectively becoming greedy search.

B.

will remain exactly 50 because uniform distributions bypass the cumulative probability condition.

C.

will increase if, because nucleus sampling dynamically adjusts to the entropy of the uniform distribution.

D.

will strictly decrease, regardless of the size of .

44 $Instruction-tuned Large Language Models are often refined using Reinforcement Learning from Human Feedback (RLHF). During the PPO (Proximal Policy Optimization) phase, a Kullback-Leibler (KL) divergence penalty is typically added to the reward. What is the primary analytical purpose of this KL penalty?$

instruction-tuned Hard

A.

To prevent the policy model from moving too far from the original Supervised Fine-Tuned (SFT) model, mitigating 'reward hacking' and catastrophic forgetting.

B.

To enforce syntactic similarity between the generated response and the human-provided reference response.

C.

To maximize the entropy of the generated sequences, ensuring the model maintains a diverse vocabulary.

D.

To decrease the computational overhead of the reward model by bounding the policy gradients.

45 $When evaluating an LLM on a reasoning task using the BLEU score, the resulting score is extremely low, yet human evaluation shows the model's reasoning is perfectly accurate. Which of the following best explains this divergence?$

evaluation metrics Hard

A.

BLEU computes recall rather than precision, which fails to capture the generative completeness of a reasoning path.

B.

BLEU measures exact n-gram overlap; logical reasoning tasks can have multiple valid structural phrasing paths that share no n-grams with the reference.

C.

BLEU heavily penalizes the generation of short chains of thought, even if the final answer is correct.

D.

BLEU requires multiple references to compute the brevity penalty properly, which is impossible in reasoning tasks.

46 $Perplexity (PPL) is a standard evaluation metric for language models, defined as . However, models with lower perplexity on a validation set do not always generate text that humans judge as higher quality. Which phenomenon best explains this paradox?$

perplexity and human judgment measures Hard

A.

Lower perplexity models often suffer from 'exposure bias', preventing them from generating tokens outside the validation set.

B.

Perplexity is evaluated using teacher forcing on human-written text, which does not penalize the model for entering repetitive loops during free-form autoregressive generation.

C.

Human judgment exclusively favors high-entropy distributions, whereas lower perplexity indicates maximum entropy.

D.

Perplexity only measures the precision of the generated text, ignoring recall which is highly valued by human evaluators.

47 $Chain-of-Thought (CoT) prompting significantly improves performance on reasoning tasks compared to standard prompting. From a computational complexity perspective of standard Transformer-based LLMs, why does generating intermediate reasoning steps increase the model's problem-solving capability?$

reasoning tasks Hard

A.

It forces the model to use exact n-gram matching with the prompt, preventing hallucination in reasoning chains.

B.

It allows the model to modify its own internal weights dynamically during the forward pass.

C.

It effectively increases the computational depth allocated to a problem, as each generated token provides another complete forward pass through the model's layers.

D.

It bypasses the self-attention bottleneck by attending only to the prompt and the final answer token.

48 $Consider an autoregressive language model generating a sequence using greedy search. The generated text gets stuck in an infinite loop (e.g., 'I went to the store to the store to the store...'). Which mathematical characteristic of the model's learned distribution most directly contributes to this greedy decoding failure?$

greedy search Hard

A.

The context window is strictly larger than the loop length, preventing attention heads from attending to previous instances of the loop.

B.

The length penalty is set to a negative value, forcing the model to repeat n-grams.

C.

The model forms an absorbing Markov chain state where creates a deterministic local optimum that outscores escaping it.

D.

The token probabilities are perfectly uniformly distributed.

49 $In the context of LLM hallucination, researchers distinguish between 'intrinsic' and 'extrinsic' hallucinations. An LLM generated a summary stating: 'The CEO of OpenAI, Sam Altman, announced a new model in Paris.' If the source text mentioned the announcement but did not state the location, how is this hallucination classified and why is it notoriously difficult to penalize using standard cross-entropy training?$

explainability and hallucination in LLMs Hard

A.

Extrinsic hallucination; cross-entropy maximizes likelihood based on training priors (where announcements often happen in major cities), penalizing the model for factual abstention.

B.

Intrinsic hallucination; the model lacks explicit causal attention heads.

C.

Extrinsic hallucination; standard cross-entropy training cannot be applied to summarization tasks.

D.

Intrinsic hallucination; cross-entropy forces the model to ignore factual contradictions.

50 $A persistent issue in dialogue generation is 'exposure bias'. Which of the following training paradigms is specifically designed to mitigate exposure bias by bridging the gap between training and inference distributions?$

dialogue generation Hard

A.

Teacher Forcing, where the model is strictly trained using the ground-truth previous tokens to stabilize gradients.

B.

Byte-Pair Encoding (BPE), which reduces out-of-vocabulary tokens during inference.

C.

Scheduled Sampling, where the model is increasingly fed its own predictions instead of the ground-truth tokens during training.

D.

Knowledge Distillation, where a smaller model learns from the logits of a larger dialogue model.

51 $In the context of Large Language Models, 'emergent abilities' are capabilities that are not present in smaller models but suddenly appear when the model scale reaches a certain threshold. Which of the following provides the most statistically rigorous critique of emergent abilities as proposed by some recent NLP literature?$

large language models Hard

A.

Emergent abilities only occur in models utilizing Mixture of Experts (MoE) architectures.

B.

The 'emergence' is often an artifact of using non-linear, discontinuous evaluation metrics (like exact match) rather than smooth, continuous metrics (like Brier score or cross-entropy).

C.

Emergent abilities are strictly a result of catastrophic forgetting of simple syntax in favor of complex semantics.

D.

Scaling laws predict that all models will forget reasoning abilities if trained for more than one epoch.

52 $Contrastive Search is a text generation strategy introduced to prevent degeneration in LLMs. Its objective function at step is formulated to select a token that maximizes model confidence while minimizing what specific component?$

text generation strategies Hard

A.

The absolute length of the generated sequence.

B.

The cosine similarity between the representation of and the representations of all previously generated tokens.

C.

The attention weight assigned to the prompt tokens.

D.

The KL divergence between the current token distribution and the uniform distribution.

53 $When leveraging LLMs for abstractive summarization, a phenomenon known as 'lead bias' is frequently observed. If an LLM is heavily exhibiting lead bias, how will this manifest in its outputs, and how does the self-attention mechanism theoretically contribute to it in long documents?$

model behaviors in summarization Hard

A.

The model tends to summarize only the concluding paragraphs; attention mechanism decays over long distances, ignoring early tokens.

B.

The model generates repetitive filler phrases at the lead of the summary; attention heads fail to normalize weights.

C.

The model focuses disproportionately on the beginning of the source text; early tokens act as 'attention sinks' and are heavily attended to across all layers.

D.

The model exclusively extracts verbatim sentences rather than paraphrasing; self-attention strictly enforces exact match routing.

54 $A known empirical issue with beam search in neural machine translation and other generative tasks is the 'beam search curse', where increasing the beam size beyond a certain point (e.g.,) degrades the BLEU score. What is the primary cause of this degradation?$

beam search Hard

A.

Larger beams require a negative length penalty, punishing the model for outputting any text.

B.

Larger beams find sequences with higher global log-probability, but the model's probability distribution is poorly calibrated and actually assigns the highest probabilities to overly short, inadequate sequences.

C.

Larger beams cause the model to exceed the context window, truncating the output.

D.

Larger beam sizes force the algorithm into a greedy search paradigm, removing diversity.

55 $Which of the following describes a key architectural difference between Causal Language Models (like GPT-3) and Masked Language Models (like BERT) that fundamentally makes Causal LMs better suited for zero-shot generative prompting?$

generative NLP models Hard

A.

Masked LMs use absolute positional embeddings, whereas Causal LMs use no positional embeddings, allowing infinite text generation.

B.

Causal LMs possess a bidirectional encoder that processes the prompt perfectly before generating text, while Masked LMs can only process text unidirectionally.

C.

Causal LMs optimize the entire sequence probability simultaneously using a CRF layer, allowing coherent long-form generation.

D.

Causal LMs use a strictly lower-triangular causal mask in self-attention, naturally aligning their pre-training objective with left-to-right autoregressive text generation.

56 $To explain the output of a generative LLM, researchers often use Integrated Gradients. For a specific generated token, Integrated Gradients computes the path integral of gradients from a baseline input to the actual input . Why is computing Integrated Gradients for generative LLMs significantly more complex than for standard classification models?$

explainability and hallucination in LLMs Hard

A.

Generative models do not have an objective function during inference, so gradients cannot be calculated.

B.

The softmax function at the output layer of an LLM is not differentiable.

C.

Generative LLMs use discrete token inputs; constructing a continuous interpolation path from a baseline token embedding to the input token embedding may pass through regions of the embedding space that correspond to no valid token.

D.

Integrated Gradients can only be applied to CNNs, as self-attention matrices do not have well-defined gradients.

57 $In 'Self-Consistency' decoding for LLM reasoning tasks, multiple distinct reasoning paths are sampled, and the final answer is selected via majority vote. For self-consistency to be effective, which decoding strategy MUST be utilized during the generation of the paths?$

reasoning tasks Hard

A.

A non-deterministic sampling strategy (like temperature sampling with) to ensure diverse reasoning paths are generated.

B.

Greedy search, to ensure the model produces its most confident reasoning path every time.

C.

Contrastive search with a high degeneration penalty.

D.

Beam search with a beam size of 1.

58 $During the creation of an instruction-tuned model, Supervised Fine-Tuning (SFT) is typically performed before RLHF. If the SFT phase trains the model exclusively on examples formatted as User: [Query] Assistant: [Response], and a user at inference prompts the model with System: [Directive], what failure mode is most likely to occur and why?$

instruction-tuned Hard

A.

The model will generate a sequence of [PAD] tokens because the attention mechanism will crash on unrecognized text.

B.

The model will trigger an intrinsic hallucination because System is a reserved keyword in all LLM tokenizers.

C.

Out-of-distribution formatting failure; the model has learned strict structural priors during SFT and may hallucinate a User: tag or generate degraded text when the prompt does not match the exact SFT template.

D.

The model will switch to purely extractive summarization because instructions without a User: prefix are interpreted as documents.

59 $ROUGE-L uses the Longest Common Subsequence (LCS) to evaluate summarization. Let the reference be and the generation be . The length of LCS is 3 (). What is a key mathematical advantage of LCS in ROUGE-L over using purely contiguous n-gram overlaps (like ROUGE-2)?$

evaluation metrics Hard

A.

LCS requires that the matching sequence be perfectly adjacent, heavily penalizing dropped words.

B.

LCS automatically incorporates an exponential brevity penalty, replacing the need for an explicit length threshold.

C.

LCS computes semantic similarity in the embedding space rather than relying on lexical matching.

D.

LCS naturally captures sentence-level structure by identifying in-sequence matches without requiring consecutive n-gram matches, allowing flexibility for insertion of novel words.

60 $A massive LLM is deployed using pipeline parallelism across multiple GPUs. If the model exhibits high latency during autoregressive token generation (decoding phase) compared to the prefill phase (processing the prompt), what is the primary architectural bottleneck causing this difference?$

large language models Hard

A.

The prefill phase operates sequentially on tokens, whereas the decoding phase evaluates all future tokens in parallel.

B.

The decoding phase requires loading the full Key-Value (KV) cache into memory from VRAM for every single generated token, severely bottlenecking memory bandwidth compared to the highly parallel matrix multiplications in the prefill phase.

C.

Autoregressive generation uses backpropagation at every step, whereas the prefill phase only uses a forward pass.

D.

The softmax operation cannot be parallelized across multiple GPUs, meaning decoding must happen on a single CPU node.

Unit 6 - Practice Quiz