Unit 6 - Practice Quiz

CSE472 60 Questions
0 Correct 0 Wrong 60 Left
0/60

1 What is the primary goal of a generative NLP model?

generative NLP models Easy
A. To generate new, coherent text sequences
B. To cluster documents into distinct topics
C. To extract named entities from text
D. To predict the sentiment of a given sentence

2 In autoregressive text generation, how is a sequence typically generated?

text generation strategies Easy
A. Tokens are generated one by one, conditioned on previously generated tokens
B. The sequence is generated backwards from the end to the beginning
C. All tokens are generated simultaneously in one step
D. Tokens are selected randomly from the vocabulary

3 How does the greedy search decoding strategy choose the next token?

greedy search Easy
A. It randomly samples from the top 10 tokens
B. It selects the token with the highest probability at each step
C. It selects the token with the lowest probability to increase diversity
D. It looks ahead to the end of the sentence before choosing

4 What is a common drawback of using greedy search for text generation?

greedy search Easy
A. It is too computationally expensive
B. It often leads to repetitive and highly predictable text
C. It can only generate one word per minute
D. It requires a specialized neural network architecture

5 Which of the following best describes beam search?

beam search Easy
A. It generates text by translating it back and forth between languages
B. It only keeps the single best token at each step
C. It keeps track of the top most probable sequences at each step
D. It randomly selects tokens until a complete sentence is formed

6 What happens during top-k sampling in text generation?

top-k Easy
A. The model generates entirely different documents
B. The model samples the next token only from the most likely tokens
C. The model removes the top tokens and samples from the rest
D. The model selects exactly tokens at every step

7 If in top-k sampling, the strategy becomes exactly equivalent to which other decoding method?

top-k Easy
A. Random sampling
B. Greedy search
C. Nucleus sampling
D. Beam search

8 What is another common name for nucleus sampling?

nucleus sampling Easy
A. Bottom-up sampling
B. Greedy sampling
C. Top-p sampling
D. Top-k sampling

9 Unlike top-k sampling which uses a fixed number of tokens, how does nucleus (top-p) sampling determine the pool of candidate tokens?

nucleus sampling Easy
A. By filtering out tokens with less than 5 characters
B. By picking all tokens that start with the same letter
C. By selecting tokens based on a fixed cumulative probability mass
D. By dynamically choosing a random number at each step

10 What does it mean for a large language model to be 'instruction-tuned'?

instruction-tuned Easy
A. It has been trained solely on programming language instructions
B. It is only capable of giving instructions to other AI models
C. It contains hard-coded rules for grammar instructions
D. It has been fine-tuned on a dataset of tasks described via natural language instructions

11 Which of the following is a primary characteristic of Large Language Models (LLMs)?

large language models Easy
A. They usually have fewer than 1 million parameters
B. They are exclusively rule-based expert systems
C. They cannot process input longer than one sentence
D. They are trained on massive amounts of textual data using self-supervised learning

12 When an LLM performs abstractive summarization, what is it primarily doing?

model behaviors in summarization Easy
A. Identifying the author of the original text
B. Generating new sentences that convey the core meaning of the source text
C. Translating the text into a different language to make it shorter
D. Extracting exact sentences from the source text and pasting them together

13 What is a critical requirement for a model to perform well in multi-turn dialogue generation?

dialogue generation Easy
A. It must only provide answers using greedy search
B. It must ignore the user's previous inputs to save memory
C. It must maintain conversational context and memory across multiple interactions
D. It must use a different language for every turn

14 Which prompting technique explicitly encourages an LLM to generate intermediate logical steps to solve complex reasoning tasks?

reasoning tasks Easy
A. Chain-of-Thought (CoT) prompting
B. Negative prompting
C. Zero-shot prompting
D. Greedy prompting

15 Which automated metric is widely used in NLP for evaluating the n-gram overlap between generated text and reference text?

evaluation metrics Easy
A. Mean Squared Error (MSE)
B. Accuracy
C. BLEU score
D. F1 score for classification

16 In the context of evaluating a language model, what does a lower perplexity score indicate?

perplexity and human judgment measures Easy
A. The model generates text at a slower speed
B. The model generates longer text sequences
C. The model is more confused and performs poorly
D. The model assigns higher probability to the test data, indicating better performance

17 Why are human judgment measures often considered essential for evaluating generative LLMs?

perplexity and human judgment measures Easy
A. Automated metrics are too computationally expensive to calculate
B. Automated metrics like BLEU or perplexity do not fully capture text fluency, coherence, and safety
C. Humans can evaluate millions of sentences faster than a computer
D. Automated metrics perfectly capture semantic meaning but struggle with grammar

18 What does the term 'hallucination' refer to in the context of Large Language Models?

explainability and hallucination in LLMs Easy
A. The process of extracting hidden rules from the training data
B. The model generating factually incorrect, nonsensical, or fabricated information as if it were true
C. A visual glitch that occurs during the training phase of a neural network
D. The model's ability to generate highly creative works of fiction when prompted

19 Why is 'explainability' a major challenge in modern Large Language Models?

explainability and hallucination in LLMs Easy
A. Because their code is generally kept secret from developers
B. Because they only output numerical data instead of text
C. Because they are rule-based systems with too many explicit rules
D. Because they act as 'black boxes' with billions of parameters, making it hard to trace how a specific output was derived

20 Which neural network architecture serves as the foundation for modern generative LLMs like GPT?

generative NLP models Easy
A. Convolutional Neural Network (CNN)
B. Long Short-Term Memory (LSTM)
C. Support Vector Machine (SVM)
D. Transformer Decoder

21 Which of the following best describes the core mathematical objective of standard autoregressive generative NLP models during generation?

generative NLP models Medium
A. Predicting the next token by maximizing the conditional probability .
B. Minimizing the distance between the input sequence embeddings and the output sequence embeddings.
C. Masking random tokens and predicting them using bidirectional context .
D. Maximizing the probability of the entire sequence at once using .

22 When generating text, modifying the 'temperature' parameter alters the probability distribution of the next token. If a temperature is applied, how does it affect the softmax output?

text generation strategies Medium
A. It truncates the distribution by setting all probabilities below to zero.
B. It adds a constant noise value to all logits before the softmax is computed.
C. It sharpens the distribution, making the model more confident and likely to select the highest-probability tokens.
D. It makes the distribution more uniform, increasing the likelihood of selecting low-probability tokens.

23 A generative model is using greedy search to translate a sentence. Why might this strategy fail to find the sequence with the highest overall probability?

greedy search Medium
A. Greedy search always selects the longest possible sequence, penalizing shorter valid translations.
B. Greedy search requires an excessive amount of memory to maintain multiple candidate sequences.
C. Greedy search randomly samples tokens, which introduces too much variance.
D. Greedy search makes locally optimal choices at each time step without considering future token probabilities.

24 In beam search text generation with a beam width of , what exactly does the algorithm track at time step ?

beam search Medium
A. All possible sequences of length , but only evaluating the top 3 nodes at the final step.
B. The 3 highest probability tokens across the entire vocabulary, independently of previous steps.
C. The 3 most probable sequences of length , along with their cumulative log-probabilities.
D. The 3 most probable tokens for the single best sequence generated up to time .

25 Why is a length penalty often applied to the scores of candidate sequences during beam search?

beam search Medium
A. Because the model's vocabulary size grows exponentially with sequence length.
B. Because the attention mechanism degrades in performance as the sequence gets longer.
C. Because beam search naturally favors generating very long, repetitive sequences.
D. Because cumulative log-probabilities are negative, making longer sequences naturally score lower.

26 A text generation model applies Top-K sampling with . If the model predicts a very confident distribution where the top 3 tokens contain of the probability mass, what is a potential drawback of this Top-K approach?

top-k Medium
A. It will dynamically reduce to 3 to match the probability mass.
B. It will force the model to select a token from the long tail, resulting in gibberish.
C. It will automatically switch to greedy search because the confidence is too high.
D. It still allows a small probability of selecting from the remaining 47 tokens, which might be irrelevant.

27 In nucleus (Top-p) sampling with , how does the algorithm determine which tokens to keep in the sampling pool?

nucleus sampling Medium
A. It keeps all tokens with an individual probability greater than $0.9$.
B. It selects tokens randomly until the mean probability of the selected tokens is $0.9$.
C. It selects the top of the vocabulary size.
D. It sorts tokens by probability and selects the smallest set whose cumulative probability exceeds $0.9$.

28 Which of the following describes a scenario where nucleus sampling (Top-p) behaves significantly differently than Top-K sampling?

nucleus sampling Medium
A. When the probability distribution is completely uniform across the entire vocabulary.
B. When the temperature is set to $0$, causing both to collapse to greedy search.
C. When the distribution shifts from highly peaked (confident) to very flat (uncertain) across different generation steps.
D. When the generation task is a pure classification task with only two possible output tokens.

29 What is the primary objective of instruction tuning a Large Language Model compared to standard pre-training?

instruction-tuned Medium
A. To teach the model grammar and syntax from raw text corpora.
B. To convert the model from a causal decoder to a masked language encoder.
C. To compress the model size by removing redundant attention heads.
D. To align the model's outputs with human intent by training on (instruction, response) pairs.

30 When performing few-shot prompting with an LLM, the model successfully adapts to a new task. What mechanism allows this adaptation without updating the model's parameters?

large language models Medium
A. Parameter-efficient fine-tuning (PEFT)
B. In-context learning
C. Weight quantization
D. Gradient descent

31 When utilizing a generative LLM for abstractive summarization, which of the following is the most significant risk compared to extractive summarization?

model behaviors in summarization Medium
A. The model is unable to process long documents due to vocabulary constraints.
B. The model will inherently produce a summary longer than the original text.
C. The model might generate fluent but factually incorrect statements not present in the source.
D. The model might only copy sentences verbatim without any rephrasing.

32 In a multi-turn dialogue generation system, how does an autoregressive LLM typically 'remember' previous conversational turns?

dialogue generation Medium
A. By updating its neural weights via continuous backpropagation after every user turn.
B. By storing previous turns in a separate relational database that modifies the softmax layer.
C. By concatenating previous turns with the current input into a single prompt, up to the context window limit.
D. By relying exclusively on a static context vector generated during the pre-training phase.

33 How does 'Chain-of-Thought' (CoT) prompting improve an LLM's performance on complex mathematical reasoning tasks?

reasoning tasks Medium
A. It encourages the model to generate intermediate reasoning steps, allocating more computational steps before the final answer.
B. It instructs the model to ignore intermediate steps and directly output the final number to reduce hallucination.
C. It forces the model to use an external calculator API to compute the mathematical operations.
D. It alters the decoding strategy from Top-p sampling to beam search to guarantee the correct answer.

34 Which of the following best describes the primary difference between how BLEU and ROUGE evaluate generated text?

evaluation metrics Medium
A. BLEU measures grammatical correctness, while ROUGE measures factuality.
B. BLEU evaluates semantic meaning using embeddings, while ROUGE evaluates exact lexical overlap.
C. BLEU is based on precision of n-grams, while ROUGE is traditionally based on recall of n-grams.
D. BLEU is used exclusively for text summarization, while ROUGE is used for machine translation.

35 A student evaluates an LLM's response using BLEU and gets a very low score, yet human evaluators rate the response as excellent. What is the most likely reason for this discrepancy?

evaluation metrics Medium
A. BLEU penalizes text that uses synonyms and paraphrasing instead of exact n-gram matches from the reference.
B. The LLM produced a highly repetitive sequence that tricked the human evaluators.
C. BLEU scores increase when the generated text is shorter than the reference text, which humans dislike.
D. The human evaluators calculated perplexity instead of precision.

36 Given a sequence of words , perplexity (PP) is defined as . What does a lower perplexity score on a test set indicate about a language model?

perplexity and human judgment measures Medium
A. The model assigns a higher probability to the test data, indicating it predicts the sequence well.
B. The model aligns better with human ethical judgments and safety guidelines.
C. The model generates text with a wider variety of vocabulary.
D. The model assigns a lower probability to the test data, indicating it is confused.

37 Why is perplexity generally considered insufficient on its own for evaluating modern instruction-tuned LLMs?

perplexity and human judgment measures Medium
A. Perplexity is bounded between 0 and 1, making it difficult to distinguish between high-performing models.
B. Perplexity measures how well the model predicts the next token in a static corpus, but not how helpful, safe, or factually accurate the generated responses are.
C. Perplexity cannot be mathematically calculated for autoregressive models.
D. Perplexity requires a human in the loop to calculate, making it too expensive to use at scale.

38 In the context of evaluating LLM hallucination in abstractive summarization, what distinguishes an 'intrinsic hallucination' from an 'extrinsic hallucination'?

explainability and hallucination in LLMs Medium
A. Intrinsic hallucination directly contradicts information in the source text, while extrinsic hallucination introduces external information that cannot be verified from the source.
B. Intrinsic hallucination occurs when the model introduces information that is factually false in the real world, while extrinsic hallucination is mathematically invalid logic.
C. Intrinsic hallucination is caused by hyperparameter tuning, while extrinsic hallucination is caused by biased pre-training data.
D. Intrinsic hallucination is a failure in the self-attention mechanism, while extrinsic hallucination is a failure in the feed-forward network.

39 Which of the following techniques is most commonly used to mitigate factual hallucination in LLMs by grounding the model's responses?

explainability and hallucination in LLMs Medium
A. Applying an absolute length penalty to the generated sequences.
B. Increasing the temperature parameter during nucleus sampling.
C. Retrieval-Augmented Generation (RAG).
D. Decreasing the beam width during beam search decoding.

40 When attempting to explain the predictions of a Transformer-based LLM, researchers often look at attention weights. What is a widely recognized limitation of using attention weights as an explainability tool?

explainability and hallucination in LLMs Medium
A. Attention weights do not always correlate with feature importance or the actual causal impact on the model's output.
B. Attention weights are binary and cannot represent the magnitude of importance.
C. Attention weights are only computed for the final layer, leaving previous layers unexplainable.
D. Attention weights can only be extracted from encoder-decoder models, not decoder-only LLMs.

41 In nucleus sampling (Top-), the model samples from the smallest set of tokens such that the sum of their probabilities is greater than or equal to . If , and the probability distribution over the vocabulary for the next token is , , , and , what will be the effective size of the sampling vocabulary ?

nucleus sampling Hard
A.
B. The sampling fails because without adding up.
C.
D.

42 Autoregressive models decoded using standard beam search often exhibit a strong bias towards shorter sequences. To mitigate this, a length penalty is introduced to the objective function: . If , how does this objective theoretically alter the sequence scoring compared to standard beam search?

beam search Hard
A. It eliminates the influence of the prior probabilities, acting as a length-invariant constant across all beams.
B. It squares the log-probability sum, punishing longer sequences exponentially more than standard beam search.
C. It normalizes the cumulative log-probability by the sequence length, converting the score to the geometric mean of the token probabilities.
D. It biases the model exclusively towards sequences that have the highest possible single-token probability, regardless of length.

43 During text generation, consider a scenario where the vocabulary distribution is completely uniform across a large vocabulary . If we transition from Top- sampling with to nucleus sampling with , what happens to the size of the restricted vocabulary from which we sample?

top-k Hard
A. will increase if , because nucleus sampling dynamically adjusts to the entropy of the uniform distribution.
B. will remain exactly 50 because uniform distributions bypass the cumulative probability condition.
C. will collapse to 1, effectively becoming greedy search.
D. will strictly decrease, regardless of the size of .

44 Instruction-tuned Large Language Models are often refined using Reinforcement Learning from Human Feedback (RLHF). During the PPO (Proximal Policy Optimization) phase, a Kullback-Leibler (KL) divergence penalty is typically added to the reward. What is the primary analytical purpose of this KL penalty?

instruction-tuned Hard
A. To prevent the policy model from moving too far from the original Supervised Fine-Tuned (SFT) model, mitigating 'reward hacking' and catastrophic forgetting.
B. To decrease the computational overhead of the reward model by bounding the policy gradients.
C. To enforce syntactic similarity between the generated response and the human-provided reference response.
D. To maximize the entropy of the generated sequences, ensuring the model maintains a diverse vocabulary.

45 When evaluating an LLM on a reasoning task using the BLEU score, the resulting score is extremely low, yet human evaluation shows the model's reasoning is perfectly accurate. Which of the following best explains this divergence?

evaluation metrics Hard
A. BLEU heavily penalizes the generation of short chains of thought, even if the final answer is correct.
B. BLEU requires multiple references to compute the brevity penalty properly, which is impossible in reasoning tasks.
C. BLEU computes recall rather than precision, which fails to capture the generative completeness of a reasoning path.
D. BLEU measures exact n-gram overlap; logical reasoning tasks can have multiple valid structural phrasing paths that share no n-grams with the reference.

46 Perplexity (PPL) is a standard evaluation metric for language models, defined as . However, models with lower perplexity on a validation set do not always generate text that humans judge as higher quality. Which phenomenon best explains this paradox?

perplexity and human judgment measures Hard
A. Perplexity only measures the precision of the generated text, ignoring recall which is highly valued by human evaluators.
B. Perplexity is evaluated using teacher forcing on human-written text, which does not penalize the model for entering repetitive loops during free-form autoregressive generation.
C. Human judgment exclusively favors high-entropy distributions, whereas lower perplexity indicates maximum entropy.
D. Lower perplexity models often suffer from 'exposure bias', preventing them from generating tokens outside the validation set.

47 Chain-of-Thought (CoT) prompting significantly improves performance on reasoning tasks compared to standard prompting. From a computational complexity perspective of standard Transformer-based LLMs, why does generating intermediate reasoning steps increase the model's problem-solving capability?

reasoning tasks Hard
A. It effectively increases the computational depth allocated to a problem, as each generated token provides another complete forward pass through the model's layers.
B. It forces the model to use exact n-gram matching with the prompt, preventing hallucination in reasoning chains.
C. It allows the model to modify its own internal weights dynamically during the forward pass.
D. It bypasses the self-attention bottleneck by attending only to the prompt and the final answer token.

48 Consider an autoregressive language model generating a sequence using greedy search. The generated text gets stuck in an infinite loop (e.g., 'I went to the store to the store to the store...'). Which mathematical characteristic of the model's learned distribution most directly contributes to this greedy decoding failure?

greedy search Hard
A. The model forms an absorbing Markov chain state where creates a deterministic local optimum that outscores escaping it.
B. The length penalty is set to a negative value, forcing the model to repeat n-grams.
C. The context window is strictly larger than the loop length, preventing attention heads from attending to previous instances of the loop.
D. The token probabilities are perfectly uniformly distributed.

49 In the context of LLM hallucination, researchers distinguish between 'intrinsic' and 'extrinsic' hallucinations. An LLM generated a summary stating: 'The CEO of OpenAI, Sam Altman, announced a new model in Paris.' If the source text mentioned the announcement but did not state the location, how is this hallucination classified and why is it notoriously difficult to penalize using standard cross-entropy training?

explainability and hallucination in LLMs Hard
A. Extrinsic hallucination; cross-entropy maximizes likelihood based on training priors (where announcements often happen in major cities), penalizing the model for factual abstention.
B. Intrinsic hallucination; the model lacks explicit causal attention heads.
C. Intrinsic hallucination; cross-entropy forces the model to ignore factual contradictions.
D. Extrinsic hallucination; standard cross-entropy training cannot be applied to summarization tasks.

50 A persistent issue in dialogue generation is 'exposure bias'. Which of the following training paradigms is specifically designed to mitigate exposure bias by bridging the gap between training and inference distributions?

dialogue generation Hard
A. Byte-Pair Encoding (BPE), which reduces out-of-vocabulary tokens during inference.
B. Scheduled Sampling, where the model is increasingly fed its own predictions instead of the ground-truth tokens during training.
C. Teacher Forcing, where the model is strictly trained using the ground-truth previous tokens to stabilize gradients.
D. Knowledge Distillation, where a smaller model learns from the logits of a larger dialogue model.

51 In the context of Large Language Models, 'emergent abilities' are capabilities that are not present in smaller models but suddenly appear when the model scale reaches a certain threshold. Which of the following provides the most statistically rigorous critique of emergent abilities as proposed by some recent NLP literature?

large language models Hard
A. The 'emergence' is often an artifact of using non-linear, discontinuous evaluation metrics (like exact match) rather than smooth, continuous metrics (like Brier score or cross-entropy).
B. Scaling laws predict that all models will forget reasoning abilities if trained for more than one epoch.
C. Emergent abilities are strictly a result of catastrophic forgetting of simple syntax in favor of complex semantics.
D. Emergent abilities only occur in models utilizing Mixture of Experts (MoE) architectures.

52 Contrastive Search is a text generation strategy introduced to prevent degeneration in LLMs. Its objective function at step is formulated to select a token that maximizes model confidence while minimizing what specific component?

text generation strategies Hard
A. The KL divergence between the current token distribution and the uniform distribution.
B. The attention weight assigned to the prompt tokens.
C. The cosine similarity between the representation of and the representations of all previously generated tokens.
D. The absolute length of the generated sequence.

53 When leveraging LLMs for abstractive summarization, a phenomenon known as 'lead bias' is frequently observed. If an LLM is heavily exhibiting lead bias, how will this manifest in its outputs, and how does the self-attention mechanism theoretically contribute to it in long documents?

model behaviors in summarization Hard
A. The model generates repetitive filler phrases at the lead of the summary; attention heads fail to normalize weights.
B. The model exclusively extracts verbatim sentences rather than paraphrasing; self-attention strictly enforces exact match routing.
C. The model focuses disproportionately on the beginning of the source text; early tokens act as 'attention sinks' and are heavily attended to across all layers.
D. The model tends to summarize only the concluding paragraphs; attention mechanism decays over long distances, ignoring early tokens.

54 A known empirical issue with beam search in neural machine translation and other generative tasks is the 'beam search curse', where increasing the beam size beyond a certain point (e.g., ) degrades the BLEU score. What is the primary cause of this degradation?

beam search Hard
A. Larger beams find sequences with higher global log-probability, but the model's probability distribution is poorly calibrated and actually assigns the highest probabilities to overly short, inadequate sequences.
B. Larger beam sizes force the algorithm into a greedy search paradigm, removing diversity.
C. Larger beams cause the model to exceed the context window, truncating the output.
D. Larger beams require a negative length penalty, punishing the model for outputting any text.

55 Which of the following describes a key architectural difference between Causal Language Models (like GPT-3) and Masked Language Models (like BERT) that fundamentally makes Causal LMs better suited for zero-shot generative prompting?

generative NLP models Hard
A. Causal LMs optimize the entire sequence probability simultaneously using a CRF layer, allowing coherent long-form generation.
B. Causal LMs use a strictly lower-triangular causal mask in self-attention, naturally aligning their pre-training objective with left-to-right autoregressive text generation.
C. Masked LMs use absolute positional embeddings, whereas Causal LMs use no positional embeddings, allowing infinite text generation.
D. Causal LMs possess a bidirectional encoder that processes the prompt perfectly before generating text, while Masked LMs can only process text unidirectionally.

56 To explain the output of a generative LLM, researchers often use Integrated Gradients. For a specific generated token , Integrated Gradients computes the path integral of gradients from a baseline input to the actual input . Why is computing Integrated Gradients for generative LLMs significantly more complex than for standard classification models?

explainability and hallucination in LLMs Hard
A. Generative LLMs use discrete token inputs; constructing a continuous interpolation path from a baseline token embedding to the input token embedding may pass through regions of the embedding space that correspond to no valid token.
B. Integrated Gradients can only be applied to CNNs, as self-attention matrices do not have well-defined gradients.
C. Generative models do not have an objective function during inference, so gradients cannot be calculated.
D. The softmax function at the output layer of an LLM is not differentiable.

57 In 'Self-Consistency' decoding for LLM reasoning tasks, multiple distinct reasoning paths are sampled, and the final answer is selected via majority vote. For self-consistency to be effective, which decoding strategy MUST be utilized during the generation of the paths?

reasoning tasks Hard
A. Contrastive search with a high degeneration penalty.
B. A non-deterministic sampling strategy (like temperature sampling with ) to ensure diverse reasoning paths are generated.
C. Greedy search, to ensure the model produces its most confident reasoning path every time.
D. Beam search with a beam size of 1.

58 During the creation of an instruction-tuned model, Supervised Fine-Tuning (SFT) is typically performed before RLHF. If the SFT phase trains the model exclusively on examples formatted as User: [Query] Assistant: [Response], and a user at inference prompts the model with System: [Directive], what failure mode is most likely to occur and why?

instruction-tuned Hard
A. The model will switch to purely extractive summarization because instructions without a User: prefix are interpreted as documents.
B. The model will generate a sequence of [PAD] tokens because the attention mechanism will crash on unrecognized text.
C. The model will trigger an intrinsic hallucination because System is a reserved keyword in all LLM tokenizers.
D. Out-of-distribution formatting failure; the model has learned strict structural priors during SFT and may hallucinate a User: tag or generate degraded text when the prompt does not match the exact SFT template.

59 ROUGE-L uses the Longest Common Subsequence (LCS) to evaluate summarization. Let the reference be and the generation be . The length of LCS is 3 (). What is a key mathematical advantage of LCS in ROUGE-L over using purely contiguous n-gram overlaps (like ROUGE-2)?

evaluation metrics Hard
A. LCS automatically incorporates an exponential brevity penalty, replacing the need for an explicit length threshold.
B. LCS requires that the matching sequence be perfectly adjacent, heavily penalizing dropped words.
C. LCS computes semantic similarity in the embedding space rather than relying on lexical matching.
D. LCS naturally captures sentence-level structure by identifying in-sequence matches without requiring consecutive n-gram matches, allowing flexibility for insertion of novel words.

60 A massive LLM is deployed using pipeline parallelism across multiple GPUs. If the model exhibits high latency during autoregressive token generation (decoding phase) compared to the prefill phase (processing the prompt), what is the primary architectural bottleneck causing this difference?

large language models Hard
A. The prefill phase operates sequentially on tokens, whereas the decoding phase evaluates all future tokens in parallel.
B. The decoding phase requires loading the full Key-Value (KV) cache into memory from VRAM for every single generated token, severely bottlenecking memory bandwidth compared to the highly parallel matrix multiplications in the prefill phase.
C. The softmax operation cannot be parallelized across multiple GPUs, meaning decoding must happen on a single CPU node.
D. Autoregressive generation uses backpropagation at every step, whereas the prefill phase only uses a forward pass.