1What is the primary goal of a generative NLP model?
generative NLP models
Easy
A.To generate new, coherent text sequences
B.To cluster documents into distinct topics
C.To extract named entities from text
D.To predict the sentiment of a given sentence
Correct Answer: To generate new, coherent text sequences
Explanation:
Generative NLP models are designed to produce new text, such as sentences or paragraphs, rather than just analyzing or classifying existing text.
Incorrect! Try again.
2In autoregressive text generation, how is a sequence typically generated?
text generation strategies
Easy
A.Tokens are generated one by one, conditioned on previously generated tokens
B.The sequence is generated backwards from the end to the beginning
C.All tokens are generated simultaneously in one step
D.Tokens are selected randomly from the vocabulary
Correct Answer: Tokens are generated one by one, conditioned on previously generated tokens
Explanation:
Autoregressive generation means the model predicts the next word in a sequence based on all the words it has generated so far.
Incorrect! Try again.
3How does the greedy search decoding strategy choose the next token?
greedy search
Easy
A.It randomly samples from the top 10 tokens
B.It selects the token with the highest probability at each step
C.It selects the token with the lowest probability to increase diversity
D.It looks ahead to the end of the sentence before choosing
Correct Answer: It selects the token with the highest probability at each step
Explanation:
Greedy search always picks the single most likely next word at every step, without considering future token probabilities.
Incorrect! Try again.
4What is a common drawback of using greedy search for text generation?
greedy search
Easy
A.It is too computationally expensive
B.It often leads to repetitive and highly predictable text
C.It can only generate one word per minute
D.It requires a specialized neural network architecture
Correct Answer: It often leads to repetitive and highly predictable text
Explanation:
Because greedy search always picks the highest probability word, it can easily get stuck in repetitive loops and lacks the creativity of sampling methods.
Incorrect! Try again.
5Which of the following best describes beam search?
beam search
Easy
A.It generates text by translating it back and forth between languages
B.It only keeps the single best token at each step
C.It keeps track of the top most probable sequences at each step
D.It randomly selects tokens until a complete sentence is formed
Correct Answer: It keeps track of the top most probable sequences at each step
Explanation:
Beam search maintains a 'beam' of the top most likely partial sequences (hypotheses), expanding them at each step to find a better overall sequence than greedy search.
Incorrect! Try again.
6What happens during top-k sampling in text generation?
top-k
Easy
A.The model generates entirely different documents
B.The model samples the next token only from the most likely tokens
C.The model removes the top tokens and samples from the rest
D.The model selects exactly tokens at every step
Correct Answer: The model samples the next token only from the most likely tokens
Explanation:
Top-k sampling restricts the vocabulary to the words with the highest probabilities, redistributing the probability mass among them before sampling.
Incorrect! Try again.
7If in top-k sampling, the strategy becomes exactly equivalent to which other decoding method?
top-k
Easy
A.Random sampling
B.Greedy search
C.Nucleus sampling
D.Beam search
Correct Answer: Greedy search
Explanation:
If , the model restricts its choice to only the single most likely token, which is the exact definition of greedy search.
Incorrect! Try again.
8What is another common name for nucleus sampling?
nucleus sampling
Easy
A.Bottom-up sampling
B.Greedy sampling
C.Top-p sampling
D.Top-k sampling
Correct Answer: Top-p sampling
Explanation:
Nucleus sampling is widely known as top-p sampling because it selects from the smallest set of tokens whose cumulative probability exceeds a threshold .
Incorrect! Try again.
9Unlike top-k sampling which uses a fixed number of tokens, how does nucleus (top-p) sampling determine the pool of candidate tokens?
nucleus sampling
Easy
A.By filtering out tokens with less than 5 characters
B.By picking all tokens that start with the same letter
C.By selecting tokens based on a fixed cumulative probability mass
D.By dynamically choosing a random number at each step
Correct Answer: By selecting tokens based on a fixed cumulative probability mass
Explanation:
Nucleus sampling includes the most probable tokens such that the sum of their probabilities reaches the chosen value . The number of tokens in this pool varies at each step.
Incorrect! Try again.
10What does it mean for a large language model to be 'instruction-tuned'?
instruction-tuned
Easy
A.It has been trained solely on programming language instructions
B.It is only capable of giving instructions to other AI models
C.It contains hard-coded rules for grammar instructions
D.It has been fine-tuned on a dataset of tasks described via natural language instructions
Correct Answer: It has been fine-tuned on a dataset of tasks described via natural language instructions
Explanation:
Instruction tuning is a fine-tuning process where models learn to understand and follow explicit user commands or prompts.
Incorrect! Try again.
11Which of the following is a primary characteristic of Large Language Models (LLMs)?
large language models
Easy
A.They usually have fewer than 1 million parameters
B.They are exclusively rule-based expert systems
C.They cannot process input longer than one sentence
D.They are trained on massive amounts of textual data using self-supervised learning
Correct Answer: They are trained on massive amounts of textual data using self-supervised learning
Explanation:
LLMs like GPT-3 or LLaMA are characterized by their massive scale in parameters (often billions) and training datasets, utilizing self-supervised learning objectives like next-word prediction.
Incorrect! Try again.
12When an LLM performs abstractive summarization, what is it primarily doing?
model behaviors in summarization
Easy
A.Identifying the author of the original text
B.Generating new sentences that convey the core meaning of the source text
C.Translating the text into a different language to make it shorter
D.Extracting exact sentences from the source text and pasting them together
Correct Answer: Generating new sentences that convey the core meaning of the source text
Explanation:
Abstractive summarization involves the model writing a new, concise summary in its own words, rather than simply extracting existing sentences (extractive summarization).
Incorrect! Try again.
13What is a critical requirement for a model to perform well in multi-turn dialogue generation?
dialogue generation
Easy
A.It must only provide answers using greedy search
B.It must ignore the user's previous inputs to save memory
C.It must maintain conversational context and memory across multiple interactions
D.It must use a different language for every turn
Correct Answer: It must maintain conversational context and memory across multiple interactions
Explanation:
In multi-turn dialogue, the model must track what was said previously to provide relevant and coherent responses.
Incorrect! Try again.
14Which prompting technique explicitly encourages an LLM to generate intermediate logical steps to solve complex reasoning tasks?
reasoning tasks
Easy
A.Chain-of-Thought (CoT) prompting
B.Negative prompting
C.Zero-shot prompting
D.Greedy prompting
Correct Answer: Chain-of-Thought (CoT) prompting
Explanation:
The correct option follows directly from the given concept and definitions.
Incorrect! Try again.
15Which automated metric is widely used in NLP for evaluating the n-gram overlap between generated text and reference text?
evaluation metrics
Easy
A.Mean Squared Error (MSE)
B.Accuracy
C.BLEU score
D.F1 score for classification
Correct Answer: BLEU score
Explanation:
BLEU (Bilingual Evaluation Understudy) is a classic automated metric used to evaluate generated text by measuring n-gram overlap with human reference texts.
Incorrect! Try again.
16In the context of evaluating a language model, what does a lower perplexity score indicate?
perplexity and human judgment measures
Easy
A.The model generates text at a slower speed
B.The model generates longer text sequences
C.The model is more confused and performs poorly
D.The model assigns higher probability to the test data, indicating better performance
Correct Answer: The model assigns higher probability to the test data, indicating better performance
Explanation:
Perplexity measures how 'surprised' a model is by real text. A lower perplexity means the model is better at predicting the sequence, indicating higher quality.
Incorrect! Try again.
17Why are human judgment measures often considered essential for evaluating generative LLMs?
perplexity and human judgment measures
Easy
A.Automated metrics are too computationally expensive to calculate
B.Automated metrics like BLEU or perplexity do not fully capture text fluency, coherence, and safety
C.Humans can evaluate millions of sentences faster than a computer
D.Automated metrics perfectly capture semantic meaning but struggle with grammar
Correct Answer: Automated metrics like BLEU or perplexity do not fully capture text fluency, coherence, and safety
Explanation:
Generative tasks are highly subjective. Automated metrics often miss nuances like tone, factual accuracy, coherence, and helpfulness, making human evaluation the gold standard.
Incorrect! Try again.
18What does the term 'hallucination' refer to in the context of Large Language Models?
explainability and hallucination in LLMs
Easy
A.The process of extracting hidden rules from the training data
B.The model generating factually incorrect, nonsensical, or fabricated information as if it were true
C.A visual glitch that occurs during the training phase of a neural network
D.The model's ability to generate highly creative works of fiction when prompted
Correct Answer: The model generating factually incorrect, nonsensical, or fabricated information as if it were true
Explanation:
In LLMs, a hallucination occurs when the model confidently generates false or unverified information that is not grounded in reality or the provided context.
Incorrect! Try again.
19Why is 'explainability' a major challenge in modern Large Language Models?
explainability and hallucination in LLMs
Easy
A.Because their code is generally kept secret from developers
B.Because they only output numerical data instead of text
C.Because they are rule-based systems with too many explicit rules
D.Because they act as 'black boxes' with billions of parameters, making it hard to trace how a specific output was derived
Correct Answer: Because they act as 'black boxes' with billions of parameters, making it hard to trace how a specific output was derived
Explanation:
LLMs use deep neural networks with billions of weights. The complex, non-linear interactions make it extremely difficult for humans to understand exactly why a model made a specific prediction.
Incorrect! Try again.
20Which neural network architecture serves as the foundation for modern generative LLMs like GPT?
generative NLP models
Easy
A.Convolutional Neural Network (CNN)
B.Long Short-Term Memory (LSTM)
C.Support Vector Machine (SVM)
D.Transformer Decoder
Correct Answer: Transformer Decoder
Explanation:
Modern generative models like the GPT (Generative Pre-trained Transformer) family are primarily based on the Transformer architecture, specifically using decoder-only setups for autoregressive text generation.
Incorrect! Try again.
21Which of the following best describes the core mathematical objective of standard autoregressive generative NLP models during generation?
generative NLP models
Medium
A.Predicting the next token by maximizing the conditional probability .
B.Minimizing the distance between the input sequence embeddings and the output sequence embeddings.
C.Masking random tokens and predicting them using bidirectional context .
D.Maximizing the probability of the entire sequence at once using .
Correct Answer: Predicting the next token by maximizing the conditional probability .
Explanation:
Autoregressive generative models generate text one token at a time, calculating the probability of the next token based strictly on the preceding context tokens.
Incorrect! Try again.
22When generating text, modifying the 'temperature' parameter alters the probability distribution of the next token. If a temperature is applied, how does it affect the softmax output?
text generation strategies
Medium
A.It truncates the distribution by setting all probabilities below to zero.
B.It adds a constant noise value to all logits before the softmax is computed.
C.It sharpens the distribution, making the model more confident and likely to select the highest-probability tokens.
D.It makes the distribution more uniform, increasing the likelihood of selecting low-probability tokens.
Correct Answer: It sharpens the distribution, making the model more confident and likely to select the highest-probability tokens.
Explanation:
Applying a temperature divides the logits by a small number before the softmax, which exaggerates the differences between the logits, thereby sharpening the probability distribution.
Incorrect! Try again.
23A generative model is using greedy search to translate a sentence. Why might this strategy fail to find the sequence with the highest overall probability?
greedy search
Medium
A.Greedy search always selects the longest possible sequence, penalizing shorter valid translations.
B.Greedy search requires an excessive amount of memory to maintain multiple candidate sequences.
C.Greedy search randomly samples tokens, which introduces too much variance.
D.Greedy search makes locally optimal choices at each time step without considering future token probabilities.
Correct Answer: Greedy search makes locally optimal choices at each time step without considering future token probabilities.
Explanation:
Greedy search selects the single most probable token at each step. This local optimization can lead to dead ends where future tokens have very low probabilities, missing a globally optimal sequence.
Incorrect! Try again.
24In beam search text generation with a beam width of , what exactly does the algorithm track at time step ?
beam search
Medium
A.All possible sequences of length , but only evaluating the top 3 nodes at the final step.
B.The 3 highest probability tokens across the entire vocabulary, independently of previous steps.
C.The 3 most probable sequences of length , along with their cumulative log-probabilities.
D.The 3 most probable tokens for the single best sequence generated up to time .
Correct Answer: The 3 most probable sequences of length , along with their cumulative log-probabilities.
Explanation:
Beam search maintains a 'beam' of the top most probable partial sequences (hypotheses) at every time step, expanding them and then pruning back down to .
Incorrect! Try again.
25Why is a length penalty often applied to the scores of candidate sequences during beam search?
beam search
Medium
A.Because the model's vocabulary size grows exponentially with sequence length.
B.Because the attention mechanism degrades in performance as the sequence gets longer.
C.Because beam search naturally favors generating very long, repetitive sequences.
D.Because cumulative log-probabilities are negative, making longer sequences naturally score lower.
Correct Answer: Because cumulative log-probabilities are negative, making longer sequences naturally score lower.
Explanation:
Since probabilities are , their log values are negative. Summing more negative numbers (for longer sequences) results in a lower total score, so a length penalty normalizes the score to prevent bias against longer outputs.
Incorrect! Try again.
26A text generation model applies Top-K sampling with . If the model predicts a very confident distribution where the top 3 tokens contain of the probability mass, what is a potential drawback of this Top-K approach?
top-k
Medium
A.It will dynamically reduce to 3 to match the probability mass.
B.It will force the model to select a token from the long tail, resulting in gibberish.
C.It will automatically switch to greedy search because the confidence is too high.
D.It still allows a small probability of selecting from the remaining 47 tokens, which might be irrelevant.
Correct Answer: It still allows a small probability of selecting from the remaining 47 tokens, which might be irrelevant.
Explanation:
Top-K always keeps exactly tokens regardless of the distribution's shape. Even if the top 3 tokens are highly confident, the remaining 47 tokens are kept in the sampling pool, risking the selection of inappropriate words.
Incorrect! Try again.
27In nucleus (Top-p) sampling with , how does the algorithm determine which tokens to keep in the sampling pool?
nucleus sampling
Medium
A.It keeps all tokens with an individual probability greater than $0.9$.
B.It selects tokens randomly until the mean probability of the selected tokens is $0.9$.
C.It selects the top of the vocabulary size.
D.It sorts tokens by probability and selects the smallest set whose cumulative probability exceeds $0.9$.
Correct Answer: It sorts tokens by probability and selects the smallest set whose cumulative probability exceeds $0.9$.
Explanation:
Nucleus sampling dynamically adjusts the number of candidate tokens by computing the cumulative distribution of sorted probabilities and cutting off as soon as the sum reaches the threshold .
Incorrect! Try again.
28Which of the following describes a scenario where nucleus sampling (Top-p) behaves significantly differently than Top-K sampling?
nucleus sampling
Medium
A.When the probability distribution is completely uniform across the entire vocabulary.
B.When the temperature is set to $0$, causing both to collapse to greedy search.
C.When the distribution shifts from highly peaked (confident) to very flat (uncertain) across different generation steps.
D.When the generation task is a pure classification task with only two possible output tokens.
Correct Answer: When the distribution shifts from highly peaked (confident) to very flat (uncertain) across different generation steps.
Explanation:
Nucleus sampling dynamically adjusts the pool size based on the distribution shape (fewer tokens when confident, more when uncertain), whereas Top-K uses a rigid cutoff regardless of uncertainty.
Incorrect! Try again.
29What is the primary objective of instruction tuning a Large Language Model compared to standard pre-training?
instruction-tuned
Medium
A.To teach the model grammar and syntax from raw text corpora.
B.To convert the model from a causal decoder to a masked language encoder.
C.To compress the model size by removing redundant attention heads.
D.To align the model's outputs with human intent by training on (instruction, response) pairs.
Correct Answer: To align the model's outputs with human intent by training on (instruction, response) pairs.
Explanation:
While pre-training teaches the model to predict the next word on raw text, instruction tuning uses supervised learning on task-specific prompts and expected responses to make the model follow human commands.
Incorrect! Try again.
30When performing few-shot prompting with an LLM, the model successfully adapts to a new task. What mechanism allows this adaptation without updating the model's parameters?
large language models
Medium
A.Parameter-efficient fine-tuning (PEFT)
B.In-context learning
C.Weight quantization
D.Gradient descent
Correct Answer: In-context learning
Explanation:
In-context learning allows large language models to recognize patterns and perform new tasks simply by observing examples provided in the prompt, without any parameter updates.
Incorrect! Try again.
31When utilizing a generative LLM for abstractive summarization, which of the following is the most significant risk compared to extractive summarization?
model behaviors in summarization
Medium
A.The model is unable to process long documents due to vocabulary constraints.
B.The model will inherently produce a summary longer than the original text.
C.The model might generate fluent but factually incorrect statements not present in the source.
D.The model might only copy sentences verbatim without any rephrasing.
Correct Answer: The model might generate fluent but factually incorrect statements not present in the source.
Explanation:
Abstractive summarization generates new text, making it highly susceptible to hallucination (creating false facts), whereas extractive summarization only selects existing sentences from the source.
Incorrect! Try again.
32In a multi-turn dialogue generation system, how does an autoregressive LLM typically 'remember' previous conversational turns?
dialogue generation
Medium
A.By updating its neural weights via continuous backpropagation after every user turn.
B.By storing previous turns in a separate relational database that modifies the softmax layer.
C.By concatenating previous turns with the current input into a single prompt, up to the context window limit.
D.By relying exclusively on a static context vector generated during the pre-training phase.
Correct Answer: By concatenating previous turns with the current input into a single prompt, up to the context window limit.
Explanation:
LLMs are stateless. To maintain context in a dialogue, previous user and assistant turns are concatenated into the prompt provided to the model for the next generation step.
Incorrect! Try again.
33How does 'Chain-of-Thought' (CoT) prompting improve an LLM's performance on complex mathematical reasoning tasks?
reasoning tasks
Medium
A.It encourages the model to generate intermediate reasoning steps, allocating more computational steps before the final answer.
B.It instructs the model to ignore intermediate steps and directly output the final number to reduce hallucination.
C.It forces the model to use an external calculator API to compute the mathematical operations.
D.It alters the decoding strategy from Top-p sampling to beam search to guarantee the correct answer.
Correct Answer: It encourages the model to generate intermediate reasoning steps, allocating more computational steps before the final answer.
Explanation:
Chain-of-Thought prompting elicits step-by-step reasoning from the model. Because the model is autoregressive, generating intermediate tokens effectively gives it more 'computational time' to arrive at the correct final answer.
Incorrect! Try again.
34Which of the following best describes the primary difference between how BLEU and ROUGE evaluate generated text?
evaluation metrics
Medium
A.BLEU measures grammatical correctness, while ROUGE measures factuality.
B.BLEU evaluates semantic meaning using embeddings, while ROUGE evaluates exact lexical overlap.
C.BLEU is based on precision of n-grams, while ROUGE is traditionally based on recall of n-grams.
D.BLEU is used exclusively for text summarization, while ROUGE is used for machine translation.
Correct Answer: BLEU is based on precision of n-grams, while ROUGE is traditionally based on recall of n-grams.
Explanation:
BLEU focuses on precision (how much of the generated text appears in the reference), making it suited for translation. ROUGE focuses on recall (how much of the reference is captured in the generation), making it suited for summarization.
Incorrect! Try again.
35A student evaluates an LLM's response using BLEU and gets a very low score, yet human evaluators rate the response as excellent. What is the most likely reason for this discrepancy?
evaluation metrics
Medium
A.BLEU penalizes text that uses synonyms and paraphrasing instead of exact n-gram matches from the reference.
B.The LLM produced a highly repetitive sequence that tricked the human evaluators.
C.BLEU scores increase when the generated text is shorter than the reference text, which humans dislike.
D.The human evaluators calculated perplexity instead of precision.
Correct Answer: BLEU penalizes text that uses synonyms and paraphrasing instead of exact n-gram matches from the reference.
Explanation:
N-gram based metrics like BLEU rely on exact string matching. A perfectly valid and fluent paraphrase will receive a low BLEU score because it doesn't share the exact vocabulary with the reference.
Incorrect! Try again.
36Given a sequence of words , perplexity (PP) is defined as . What does a lower perplexity score on a test set indicate about a language model?
perplexity and human judgment measures
Medium
A.The model assigns a higher probability to the test data, indicating it predicts the sequence well.
B.The model aligns better with human ethical judgments and safety guidelines.
C.The model generates text with a wider variety of vocabulary.
D.The model assigns a lower probability to the test data, indicating it is confused.
Correct Answer: The model assigns a higher probability to the test data, indicating it predicts the sequence well.
Explanation:
Perplexity is the inverse probability of the test set, normalized by the number of words. A lower perplexity means the model assigned a higher probability to the actual data, indicating better predictive performance.
Incorrect! Try again.
37Why is perplexity generally considered insufficient on its own for evaluating modern instruction-tuned LLMs?
perplexity and human judgment measures
Medium
A.Perplexity is bounded between 0 and 1, making it difficult to distinguish between high-performing models.
B.Perplexity measures how well the model predicts the next token in a static corpus, but not how helpful, safe, or factually accurate the generated responses are.
C.Perplexity cannot be mathematically calculated for autoregressive models.
D.Perplexity requires a human in the loop to calculate, making it too expensive to use at scale.
Correct Answer: Perplexity measures how well the model predicts the next token in a static corpus, but not how helpful, safe, or factually accurate the generated responses are.
Explanation:
While perplexity measures the statistical likelihood of text, it does not assess alignment with human intent, factuality, coherence of long outputs, or adherence to safety guardrails.
Incorrect! Try again.
38In the context of evaluating LLM hallucination in abstractive summarization, what distinguishes an 'intrinsic hallucination' from an 'extrinsic hallucination'?
explainability and hallucination in LLMs
Medium
A.Intrinsic hallucination directly contradicts information in the source text, while extrinsic hallucination introduces external information that cannot be verified from the source.
B.Intrinsic hallucination occurs when the model introduces information that is factually false in the real world, while extrinsic hallucination is mathematically invalid logic.
C.Intrinsic hallucination is caused by hyperparameter tuning, while extrinsic hallucination is caused by biased pre-training data.
D.Intrinsic hallucination is a failure in the self-attention mechanism, while extrinsic hallucination is a failure in the feed-forward network.
Correct Answer: Intrinsic hallucination directly contradicts information in the source text, while extrinsic hallucination introduces external information that cannot be verified from the source.
Explanation:
Intrinsic hallucinations contradict the provided source material. Extrinsic hallucinations add details not found in the source text (which might be true or false, but are unverified by the source).
Incorrect! Try again.
39Which of the following techniques is most commonly used to mitigate factual hallucination in LLMs by grounding the model's responses?
explainability and hallucination in LLMs
Medium
A.Applying an absolute length penalty to the generated sequences.
B.Increasing the temperature parameter during nucleus sampling.
C.Retrieval-Augmented Generation (RAG).
D.Decreasing the beam width during beam search decoding.
RAG retrieves relevant, factual documents from an external knowledge base and includes them in the model's prompt, grounding the generation in verified facts and reducing hallucination.
Incorrect! Try again.
40When attempting to explain the predictions of a Transformer-based LLM, researchers often look at attention weights. What is a widely recognized limitation of using attention weights as an explainability tool?
explainability and hallucination in LLMs
Medium
A.Attention weights do not always correlate with feature importance or the actual causal impact on the model's output.
B.Attention weights are binary and cannot represent the magnitude of importance.
C.Attention weights are only computed for the final layer, leaving previous layers unexplainable.
D.Attention weights can only be extracted from encoder-decoder models, not decoder-only LLMs.
Correct Answer: Attention weights do not always correlate with feature importance or the actual causal impact on the model's output.
Explanation:
Research has shown that high attention weights do not strictly imply that a token was the causal reason for a specific prediction, as information is mixed across many layers and heads (a debate known as 'Attention is not Explanation').
Incorrect! Try again.
41In nucleus sampling (Top-), the model samples from the smallest set of tokens such that the sum of their probabilities is greater than or equal to . If , and the probability distribution over the vocabulary for the next token is , , , and , what will be the effective size of the sampling vocabulary ?
nucleus sampling
Hard
A.
B.The sampling fails because without adding up.
C.
D.
Correct Answer:
Explanation:
In nucleus sampling, we sort the tokens by probability in descending order and iteratively add them to until the cumulative probability meets or exceeds . Since the first token alone has a probability of $0.92$, which is already , the sampling pool will consist of only .
Incorrect! Try again.
42Autoregressive models decoded using standard beam search often exhibit a strong bias towards shorter sequences. To mitigate this, a length penalty is introduced to the objective function: . If , how does this objective theoretically alter the sequence scoring compared to standard beam search?
beam search
Hard
A.It eliminates the influence of the prior probabilities, acting as a length-invariant constant across all beams.
B.It squares the log-probability sum, punishing longer sequences exponentially more than standard beam search.
C.It normalizes the cumulative log-probability by the sequence length, converting the score to the geometric mean of the token probabilities.
D.It biases the model exclusively towards sequences that have the highest possible single-token probability, regardless of length.
Correct Answer: It normalizes the cumulative log-probability by the sequence length, converting the score to the geometric mean of the token probabilities.
Explanation:
When , dividing the sum of the log-probabilities by computes the arithmetic mean of the log-probabilities. Because , this is mathematically equivalent to optimizing the geometric mean of the token probabilities, removing the inherent additive bias against longer sequences.
Incorrect! Try again.
43During text generation, consider a scenario where the vocabulary distribution is completely uniform across a large vocabulary . If we transition from Top- sampling with to nucleus sampling with , what happens to the size of the restricted vocabulary from which we sample?
top-k
Hard
A. will increase if , because nucleus sampling dynamically adjusts to the entropy of the uniform distribution.
B. will remain exactly 50 because uniform distributions bypass the cumulative probability condition.
C. will collapse to 1, effectively becoming greedy search.
D. will strictly decrease, regardless of the size of .
Correct Answer: will increase if , because nucleus sampling dynamically adjusts to the entropy of the uniform distribution.
Explanation:
For a uniform distribution, each token has probability . To reach a cumulative probability of , nucleus sampling must select tokens. If is large (e.g., 10,000), , which is much greater than . Top- is advantageous precisely because it expands for high-entropy distributions.
Incorrect! Try again.
44Instruction-tuned Large Language Models are often refined using Reinforcement Learning from Human Feedback (RLHF). During the PPO (Proximal Policy Optimization) phase, a Kullback-Leibler (KL) divergence penalty is typically added to the reward. What is the primary analytical purpose of this KL penalty?
instruction-tuned
Hard
A.To prevent the policy model from moving too far from the original Supervised Fine-Tuned (SFT) model, mitigating 'reward hacking' and catastrophic forgetting.
B.To decrease the computational overhead of the reward model by bounding the policy gradients.
C.To enforce syntactic similarity between the generated response and the human-provided reference response.
D.To maximize the entropy of the generated sequences, ensuring the model maintains a diverse vocabulary.
Correct Answer: To prevent the policy model from moving too far from the original Supervised Fine-Tuned (SFT) model, mitigating 'reward hacking' and catastrophic forgetting.
Explanation:
In RLHF, optimizing purely on the reward model often leads to 'reward hacking', where the model exploits flaws in the reward function to get high scores while generating nonsensical or degraded text. The KL divergence penalty ensures the updated policy stays close to the SFT model's distribution, preserving language fluency and stability.
Incorrect! Try again.
45When evaluating an LLM on a reasoning task using the BLEU score, the resulting score is extremely low, yet human evaluation shows the model's reasoning is perfectly accurate. Which of the following best explains this divergence?
evaluation metrics
Hard
A.BLEU heavily penalizes the generation of short chains of thought, even if the final answer is correct.
B.BLEU requires multiple references to compute the brevity penalty properly, which is impossible in reasoning tasks.
C.BLEU computes recall rather than precision, which fails to capture the generative completeness of a reasoning path.
D.BLEU measures exact n-gram overlap; logical reasoning tasks can have multiple valid structural phrasing paths that share no n-grams with the reference.
Correct Answer: BLEU measures exact n-gram overlap; logical reasoning tasks can have multiple valid structural phrasing paths that share no n-grams with the reference.
Explanation:
BLEU is an n-gram precision-based metric. It relies on exact lexical overlap between the generated text and the reference. In reasoning, a model can arrive at the correct logic using entirely different synonyms or sentence structures, resulting in a low BLEU score despite high functional correctness.
Incorrect! Try again.
46Perplexity (PPL) is a standard evaluation metric for language models, defined as . However, models with lower perplexity on a validation set do not always generate text that humans judge as higher quality. Which phenomenon best explains this paradox?
perplexity and human judgment measures
Hard
A.Perplexity only measures the precision of the generated text, ignoring recall which is highly valued by human evaluators.
B.Perplexity is evaluated using teacher forcing on human-written text, which does not penalize the model for entering repetitive loops during free-form autoregressive generation.
D.Lower perplexity models often suffer from 'exposure bias', preventing them from generating tokens outside the validation set.
Correct Answer: Perplexity is evaluated using teacher forcing on human-written text, which does not penalize the model for entering repetitive loops during free-form autoregressive generation.
Explanation:
Perplexity is calculated over fixed, ground-truth human text (teacher forcing). A model can assign high probabilities to the next correct token given perfect context (low PPL), but during actual generation (where it consumes its own predictions), it may fall into degenerative, repetitive loops that human evaluators rate poorly.
Incorrect! Try again.
47Chain-of-Thought (CoT) prompting significantly improves performance on reasoning tasks compared to standard prompting. From a computational complexity perspective of standard Transformer-based LLMs, why does generating intermediate reasoning steps increase the model's problem-solving capability?
reasoning tasks
Hard
A.It effectively increases the computational depth allocated to a problem, as each generated token provides another complete forward pass through the model's layers.
B.It forces the model to use exact n-gram matching with the prompt, preventing hallucination in reasoning chains.
C.It allows the model to modify its own internal weights dynamically during the forward pass.
D.It bypasses the self-attention bottleneck by attending only to the prompt and the final answer token.
Correct Answer: It effectively increases the computational depth allocated to a problem, as each generated token provides another complete forward pass through the model's layers.
Explanation:
Transformers have a fixed computational budget per generated token (determined by the number of layers and hidden size). By generating intermediate reasoning steps (tokens), the model performs multiple forward passes before outputting the final answer, effectively expanding the total computation applied to solve the specific query.
Incorrect! Try again.
48Consider an autoregressive language model generating a sequence using greedy search. The generated text gets stuck in an infinite loop (e.g., 'I went to the store to the store to the store...'). Which mathematical characteristic of the model's learned distribution most directly contributes to this greedy decoding failure?
greedy search
Hard
A.The model forms an absorbing Markov chain state where creates a deterministic local optimum that outscores escaping it.
B.The length penalty is set to a negative value, forcing the model to repeat n-grams.
C.The context window is strictly larger than the loop length, preventing attention heads from attending to previous instances of the loop.
D.The token probabilities are perfectly uniformly distributed.
Correct Answer: The model forms an absorbing Markov chain state where creates a deterministic local optimum that outscores escaping it.
Explanation:
In greedy search, the model strictly chooses the token at each step. If a generated phrase strongly conditions the model to predict the same phrase again, it creates a high-probability 'sink' or local optimum. Because greedy search does not explore less probable tokens that might break the cycle, it becomes trapped in an infinite loop.
Incorrect! Try again.
49In the context of LLM hallucination, researchers distinguish between 'intrinsic' and 'extrinsic' hallucinations. An LLM generated a summary stating: 'The CEO of OpenAI, Sam Altman, announced a new model in Paris.' If the source text mentioned the announcement but did not state the location, how is this hallucination classified and why is it notoriously difficult to penalize using standard cross-entropy training?
explainability and hallucination in LLMs
Hard
A.Extrinsic hallucination; cross-entropy maximizes likelihood based on training priors (where announcements often happen in major cities), penalizing the model for factual abstention.
B.Intrinsic hallucination; the model lacks explicit causal attention heads.
C.Intrinsic hallucination; cross-entropy forces the model to ignore factual contradictions.
D.Extrinsic hallucination; standard cross-entropy training cannot be applied to summarization tasks.
Correct Answer: Extrinsic hallucination; cross-entropy maximizes likelihood based on training priors (where announcements often happen in major cities), penalizing the model for factual abstention.
Explanation:
Extrinsic hallucination occurs when the model introduces details not present in (but not strictly contradicting) the source. Standard cross-entropy objective trains the model to replicate the target distribution, which often involves the model relying on its parametric memory (priors) to 'fill in' plausible details, making it hard to train the model to output 'I don't know' or remain strictly faithful to the source.
Incorrect! Try again.
50A persistent issue in dialogue generation is 'exposure bias'. Which of the following training paradigms is specifically designed to mitigate exposure bias by bridging the gap between training and inference distributions?
dialogue generation
Hard
A.Byte-Pair Encoding (BPE), which reduces out-of-vocabulary tokens during inference.
B.Scheduled Sampling, where the model is increasingly fed its own predictions instead of the ground-truth tokens during training.
C.Teacher Forcing, where the model is strictly trained using the ground-truth previous tokens to stabilize gradients.
D.Knowledge Distillation, where a smaller model learns from the logits of a larger dialogue model.
Correct Answer: Scheduled Sampling, where the model is increasingly fed its own predictions instead of the ground-truth tokens during training.
Explanation:
Exposure bias arises because models are trained with teacher forcing (seeing perfect ground-truth context) but must generate autoregressively using their own potentially flawed predictions at inference. Scheduled sampling slowly replaces ground-truth tokens with the model's own predicted tokens during training, simulating the inference environment.
Incorrect! Try again.
51In the context of Large Language Models, 'emergent abilities' are capabilities that are not present in smaller models but suddenly appear when the model scale reaches a certain threshold. Which of the following provides the most statistically rigorous critique of emergent abilities as proposed by some recent NLP literature?
large language models
Hard
A.The 'emergence' is often an artifact of using non-linear, discontinuous evaluation metrics (like exact match) rather than smooth, continuous metrics (like Brier score or cross-entropy).
B.Scaling laws predict that all models will forget reasoning abilities if trained for more than one epoch.
C.Emergent abilities are strictly a result of catastrophic forgetting of simple syntax in favor of complex semantics.
D.Emergent abilities only occur in models utilizing Mixture of Experts (MoE) architectures.
Correct Answer: The 'emergence' is often an artifact of using non-linear, discontinuous evaluation metrics (like exact match) rather than smooth, continuous metrics (like Brier score or cross-entropy).
Explanation:
Recent research suggests that apparent 'sharp' emergent abilities are often mirages created by the choice of metric. If a task is measured by a strict threshold (e.g., exact match), performance appears to jump suddenly. When measured by continuous metrics (like token log-probabilities), the improvement is shown to be smooth and predictable across scales.
Incorrect! Try again.
52Contrastive Search is a text generation strategy introduced to prevent degeneration in LLMs. Its objective function at step is formulated to select a token that maximizes model confidence while minimizing what specific component?
text generation strategies
Hard
A.The KL divergence between the current token distribution and the uniform distribution.
B.The attention weight assigned to the prompt tokens.
C.The cosine similarity between the representation of and the representations of all previously generated tokens.
D.The absolute length of the generated sequence.
Correct Answer: The cosine similarity between the representation of and the representations of all previously generated tokens.
Explanation:
Contrastive search evaluates candidate tokens by balancing the model's predicted probability (confidence) against a degeneration penalty. This penalty is calculated as the maximum cosine similarity between the hidden representation of the candidate token and the hidden representations of the preceding tokens, actively discouraging repetition.
Incorrect! Try again.
53When leveraging LLMs for abstractive summarization, a phenomenon known as 'lead bias' is frequently observed. If an LLM is heavily exhibiting lead bias, how will this manifest in its outputs, and how does the self-attention mechanism theoretically contribute to it in long documents?
model behaviors in summarization
Hard
A.The model generates repetitive filler phrases at the lead of the summary; attention heads fail to normalize weights.
B.The model exclusively extracts verbatim sentences rather than paraphrasing; self-attention strictly enforces exact match routing.
C.The model focuses disproportionately on the beginning of the source text; early tokens act as 'attention sinks' and are heavily attended to across all layers.
D.The model tends to summarize only the concluding paragraphs; attention mechanism decays over long distances, ignoring early tokens.
Correct Answer: The model focuses disproportionately on the beginning of the source text; early tokens act as 'attention sinks' and are heavily attended to across all layers.
Explanation:
Lead bias refers to the tendency to extract or summarize information predominantly from the beginning of a document. In Transformers, initial tokens (often including the prompt or first sentence) act as 'attention sinks'—they accumulate massive attention scores across layers because the softmax needs a place to dump probability mass when subsequent tokens aren't highly relevant, cementing the model's focus on the start of the text.
Incorrect! Try again.
54A known empirical issue with beam search in neural machine translation and other generative tasks is the 'beam search curse', where increasing the beam size beyond a certain point (e.g., ) degrades the BLEU score. What is the primary cause of this degradation?
beam search
Hard
A.Larger beams find sequences with higher global log-probability, but the model's probability distribution is poorly calibrated and actually assigns the highest probabilities to overly short, inadequate sequences.
B.Larger beam sizes force the algorithm into a greedy search paradigm, removing diversity.
C.Larger beams cause the model to exceed the context window, truncating the output.
D.Larger beams require a negative length penalty, punishing the model for outputting any text.
Correct Answer: Larger beams find sequences with higher global log-probability, but the model's probability distribution is poorly calibrated and actually assigns the highest probabilities to overly short, inadequate sequences.
Explanation:
Beam search is an approximate search for the maximum likelihood sequence. A wider beam does a better job of finding the true maximum likelihood sequence. However, neural sequence models are often poorly calibrated and assign the highest overall probability to short, generic, or truncated sequences. A smaller beam size inadvertently acts as a regularizer against this flaw.
Incorrect! Try again.
55Which of the following describes a key architectural difference between Causal Language Models (like GPT-3) and Masked Language Models (like BERT) that fundamentally makes Causal LMs better suited for zero-shot generative prompting?
generative NLP models
Hard
A.Causal LMs optimize the entire sequence probability simultaneously using a CRF layer, allowing coherent long-form generation.
B.Causal LMs use a strictly lower-triangular causal mask in self-attention, naturally aligning their pre-training objective with left-to-right autoregressive text generation.
C.Masked LMs use absolute positional embeddings, whereas Causal LMs use no positional embeddings, allowing infinite text generation.
D.Causal LMs possess a bidirectional encoder that processes the prompt perfectly before generating text, while Masked LMs can only process text unidirectionally.
Correct Answer: Causal LMs use a strictly lower-triangular causal mask in self-attention, naturally aligning their pre-training objective with left-to-right autoregressive text generation.
Explanation:
Causal LMs use a masked self-attention mechanism where tokens can only attend to previous tokens (lower-triangular mask). This left-to-right autoregressive pre-training exactly matches the inference mechanism of reading a prompt and predicting the next word, enabling powerful zero-shot generation. Masked LMs rely on bidirectional context and are trained to fill in blanks, making standard left-to-right generation unnatural for them.
Incorrect! Try again.
56To explain the output of a generative LLM, researchers often use Integrated Gradients. For a specific generated token , Integrated Gradients computes the path integral of gradients from a baseline input to the actual input . Why is computing Integrated Gradients for generative LLMs significantly more complex than for standard classification models?
explainability and hallucination in LLMs
Hard
A.Generative LLMs use discrete token inputs; constructing a continuous interpolation path from a baseline token embedding to the input token embedding may pass through regions of the embedding space that correspond to no valid token.
B.Integrated Gradients can only be applied to CNNs, as self-attention matrices do not have well-defined gradients.
C.Generative models do not have an objective function during inference, so gradients cannot be calculated.
D.The softmax function at the output layer of an LLM is not differentiable.
Correct Answer: Generative LLMs use discrete token inputs; constructing a continuous interpolation path from a baseline token embedding to the input token embedding may pass through regions of the embedding space that correspond to no valid token.
Explanation:
Integrated Gradients requires interpolating inputs along a straight line from a baseline to the actual input and accumulating the gradients. Because language relies on discrete tokens, interpolating in the continuous embedding space traverses meaningless latent areas that the model never saw during training, complicating interpretation.
Incorrect! Try again.
57In 'Self-Consistency' decoding for LLM reasoning tasks, multiple distinct reasoning paths are sampled, and the final answer is selected via majority vote. For self-consistency to be effective, which decoding strategy MUST be utilized during the generation of the paths?
reasoning tasks
Hard
A.Contrastive search with a high degeneration penalty.
B.A non-deterministic sampling strategy (like temperature sampling with ) to ensure diverse reasoning paths are generated.
C.Greedy search, to ensure the model produces its most confident reasoning path every time.
D.Beam search with a beam size of 1.
Correct Answer: A non-deterministic sampling strategy (like temperature sampling with ) to ensure diverse reasoning paths are generated.
Explanation:
Self-consistency relies on generating multiple different reasoning paths to see if they converge on the same answer. If a deterministic strategy like greedy decoding or beam search of size 1 is used, the model will output the exact same sequence every time, rendering majority voting useless. Temperature sampling injects the necessary diversity.
Incorrect! Try again.
58During the creation of an instruction-tuned model, Supervised Fine-Tuning (SFT) is typically performed before RLHF. If the SFT phase trains the model exclusively on examples formatted as User: [Query] Assistant: [Response], and a user at inference prompts the model with System: [Directive], what failure mode is most likely to occur and why?
instruction-tuned
Hard
A.The model will switch to purely extractive summarization because instructions without a User: prefix are interpreted as documents.
B.The model will generate a sequence of [PAD] tokens because the attention mechanism will crash on unrecognized text.
C.The model will trigger an intrinsic hallucination because System is a reserved keyword in all LLM tokenizers.
D.Out-of-distribution formatting failure; the model has learned strict structural priors during SFT and may hallucinate a User: tag or generate degraded text when the prompt does not match the exact SFT template.
Correct Answer: Out-of-distribution formatting failure; the model has learned strict structural priors during SFT and may hallucinate a User: tag or generate degraded text when the prompt does not match the exact SFT template.
Explanation:
Instruction tuning aggressively shifts the model's distribution to expect highly specific prompt templates. If inference prompts do not match the SFT template (e.g., introducing a System: tag when only User: and Assistant: were trained), the input becomes out-of-distribution, often leading the model to hallucinate the expected formatting tags or fail to understand the instruction boundary.
Incorrect! Try again.
59ROUGE-L uses the Longest Common Subsequence (LCS) to evaluate summarization. Let the reference be and the generation be . The length of LCS is 3 (). What is a key mathematical advantage of LCS in ROUGE-L over using purely contiguous n-gram overlaps (like ROUGE-2)?
evaluation metrics
Hard
A.LCS automatically incorporates an exponential brevity penalty, replacing the need for an explicit length threshold.
B.LCS requires that the matching sequence be perfectly adjacent, heavily penalizing dropped words.
C.LCS computes semantic similarity in the embedding space rather than relying on lexical matching.
D.LCS naturally captures sentence-level structure by identifying in-sequence matches without requiring consecutive n-gram matches, allowing flexibility for insertion of novel words.
Correct Answer: LCS naturally captures sentence-level structure by identifying in-sequence matches without requiring consecutive n-gram matches, allowing flexibility for insertion of novel words.
Explanation:
The Longest Common Subsequence does not require the matching words to be strictly contiguous, only that they appear in the same relative order. This gives ROUGE-L an advantage over n-gram metrics (which require strict adjacency), as it credits the model for maintaining the correct overall flow and structure of the sentence, even if new words are inserted.
Incorrect! Try again.
60A massive LLM is deployed using pipeline parallelism across multiple GPUs. If the model exhibits high latency during autoregressive token generation (decoding phase) compared to the prefill phase (processing the prompt), what is the primary architectural bottleneck causing this difference?
large language models
Hard
A.The prefill phase operates sequentially on tokens, whereas the decoding phase evaluates all future tokens in parallel.
B.The decoding phase requires loading the full Key-Value (KV) cache into memory from VRAM for every single generated token, severely bottlenecking memory bandwidth compared to the highly parallel matrix multiplications in the prefill phase.
C.The softmax operation cannot be parallelized across multiple GPUs, meaning decoding must happen on a single CPU node.
D.Autoregressive generation uses backpropagation at every step, whereas the prefill phase only uses a forward pass.
Correct Answer: The decoding phase requires loading the full Key-Value (KV) cache into memory from VRAM for every single generated token, severely bottlenecking memory bandwidth compared to the highly parallel matrix multiplications in the prefill phase.
Explanation:
During the prefill phase, the entire prompt is processed simultaneously, allowing highly efficient, compute-bound dense matrix multiplications. During autoregressive decoding, tokens are generated one by one. Generating each token requires fetching the model weights and the entire historical KV cache for the sequence from memory, making the generation phase highly memory-bandwidth bound rather than compute-bound.