1Which foundational 2017 paper introduced the Transformer architecture?
transformer architecture
Easy
A.Attention Is All You Need
B.Deep Residual Learning for Image Recognition
C.BERT: Pre-training of Deep Bidirectional Transformers
D.Language Models are Few-Shot Learners
Correct Answer: Attention Is All You Need
Explanation:
The Transformer architecture was introduced in the seminal paper 'Attention Is All You Need' by Vaswani et al. in 2017.
Incorrect! Try again.
2What is the primary function of the self-attention mechanism in a Transformer?
self-attention
Easy
A.To translate words from a source language to a target language word-by-word
B.To compress the input sequence into a fixed-length vector using recurrence
C.To mask out unknown words in the input text
D.To relate different positions of a single sequence to compute a representation of the sequence
Correct Answer: To relate different positions of a single sequence to compute a representation of the sequence
Explanation:
Self-attention allows the model to look at other words in the same input sequence to gather context and better understand the relationships between words.
Incorrect! Try again.
3Why does the Transformer use multi-head attention instead of a single attention function?
multi-head attention
Easy
A.It replaces the need for tokenization in NLP tasks
B.It reduces the number of parameters in the model
C.It allows the model to jointly attend to information from different representation subspaces at different positions
D.It prevents the model from overfitting on small datasets
Correct Answer: It allows the model to jointly attend to information from different representation subspaces at different positions
Explanation:
Multi-head attention runs multiple attention mechanisms in parallel, allowing the model to focus on different aspects of the text (e.g., syntax, semantics) simultaneously.
Incorrect! Try again.
4Why is positional encoding necessary in a Transformer model?
positional encoding
Easy
A.Because it normalizes the input embeddings to have a mean of zero
B.Because Transformers process sequences in parallel and lack inherent notions of sequence order
C.Because it is required to calculate the softmax function properly
D.Because it reduces the memory footprint of the attention mechanism
Correct Answer: Because Transformers process sequences in parallel and lack inherent notions of sequence order
Explanation:
Unlike RNNs, Transformers process all tokens simultaneously. Positional encodings are added to embeddings to inject information about the relative or absolute position of the tokens.
Incorrect! Try again.
5Which component is present in a standard Transformer decoder block but NOT in an encoder block?
transformer encoder and decoder blocks
Easy
A.Layer Normalization
B.Masked Multi-Head Self-Attention
C.Feed-Forward Neural Network
D.Multi-Head Self-Attention
Correct Answer: Masked Multi-Head Self-Attention
Explanation:
The decoder contains a masked self-attention layer to prevent positions from attending to subsequent (future) positions during text generation.
Incorrect! Try again.
6What is the primary advantage of subword tokenization methods over traditional word-level tokenization?
tokenization methods
Easy
A.They are strictly rule-based and do not require training on a corpus
B.They ensure that every word is represented by a single, unique integer ID
C.They help handle Out-Of-Vocabulary (OOV) words by breaking them into known subwords
D.They completely eliminate the need for an embedding layer
Correct Answer: They help handle Out-Of-Vocabulary (OOV) words by breaking them into known subwords
Explanation:
Subword tokenization breaks rare or unknown words into smaller, frequent chunks (like prefixes or suffixes), effectively reducing the OOV problem while maintaining a reasonable vocabulary size.
Incorrect! Try again.
7How does the Byte-Pair Encoding (BPE) algorithm initially build its vocabulary?
Byte-Pair Encoding and WordPiece
Easy
A.It hashes every word in the training corpus into a fixed-size vector space
B.It starts with whole sentences and breaks them down using linguistic rules
C.It starts with a vocabulary of individual characters and iteratively merges the most frequent adjacent pairs
D.It uses a pre-defined dictionary from WordNet to select valid tokens
Correct Answer: It starts with a vocabulary of individual characters and iteratively merges the most frequent adjacent pairs
Explanation:
BPE is a data compression algorithm adapted for tokenization. It begins with individual characters and merges the most frequently occurring adjacent pairs until a target vocabulary size is reached.
Incorrect! Try again.
8What does 'pretraining' mean in the context of large language models?
pretrained transformer models
Easy
A.Testing a model on a validation set to tune its hyperparameters
B.Hardcoding linguistic rules into the model before deployment
C.Training a model explicitly on human feedback and supervised labels from the very beginning
D.Training a model on a large corpus of unlabeled text using a self-supervised objective before adapting it to specific tasks
Correct Answer: Training a model on a large corpus of unlabeled text using a self-supervised objective before adapting it to specific tasks
Explanation:
Pretraining involves exposing the model to vast amounts of text data using tasks like predicting hidden words, allowing it to learn general language patterns before being fine-tuned.
Incorrect! Try again.
9What does the acronym BERT stand for?
BERT
Easy
A.Basic Entity Recognition from Text
B.Binary Encoded Recurrent Transformers
C.Bidirectional Evaluation of Recurrent Transformers
D.Bidirectional Encoder Representations from Transformers
Correct Answer: Bidirectional Encoder Representations from Transformers
Explanation:
BERT stands for Bidirectional Encoder Representations from Transformers, reflecting its architecture (Transformer encoder) and its bidirectional context processing.
Incorrect! Try again.
10The GPT (Generative Pre-trained Transformer) architecture is primarily based on which part of the original Transformer?
GPT
Easy
A.The Cross-Attention Mechanism
B.The Decoder
C.Only the Positional Encodings
D.The Encoder
Correct Answer: The Decoder
Explanation:
GPT models use a decoder-only architecture, which includes masked self-attention to generate text autoregressively (predicting the next token based on previous ones).
Incorrect! Try again.
11What is the defining characteristic of the T5 (Text-to-Text Transfer Transformer) framework?
T5
Easy
A.It casts every NLP task as a text-to-text problem, where both input and output are strings
B.It processes audio and text simultaneously in a single model
C.It uses reinforcement learning instead of cross-entropy loss
D.It uses only an encoder architecture and cannot generate text
Correct Answer: It casts every NLP task as a text-to-text problem, where both input and output are strings
Explanation:
T5 reformulates all NLP tasks (like translation, summarization, and classification) into a unified text-to-text format, taking text as input and producing text as output.
Incorrect! Try again.
12In Masked Language Modeling (MLM), what is the model trained to do?
masked language modeling
Easy
A.Predict the very next sentence in a document
B.Translate a masked sentence into another language
C.Generate an entire paragraph from a single prompt word
D.Predict the original identity of tokens that have been randomly replaced with a [MASK] token
Correct Answer: Predict the original identity of tokens that have been randomly replaced with a [MASK] token
Explanation:
In MLM (used by BERT), a percentage of input tokens are masked, and the model must use the surrounding bidirectional context to predict what the masked tokens are.
Incorrect! Try again.
13What is the objective of the Next Sentence Prediction (NSP) task used during BERT's pretraining?
next sentence prediction
Easy
A.To generate the subsequent sentence given a starting phrase
B.To predict the exact word count of the following sentence
C.To predict whether Sentence B logically and sequentially follows Sentence A in the original document
D.To classify the grammatical correctness of the next sentence
Correct Answer: To predict whether Sentence B logically and sequentially follows Sentence A in the original document
Explanation:
NSP is a binary classification task where the model learns to understand sentence relationships by predicting if two sentences appear consecutively in the source text.
Incorrect! Try again.
14Which of the following best describes Causal Language Modeling (CLM)?
causal language modeling
Easy
A.Predicting a masked word using both left and right context
B.Finding the root cause of grammatical errors in a sentence
C.Classifying whether a text expresses a positive or negative sentiment
D.Predicting the next token in a sequence given only the previous tokens (left-to-right)
Correct Answer: Predicting the next token in a sequence given only the previous tokens (left-to-right)
Explanation:
Causal Language Modeling, also known as autoregressive modeling, trains a model to predict the next word in a sequence using only the past (causal) context. This is how GPT models are trained.
Incorrect! Try again.
15What is a major benefit of using transfer learning in NLP?
transfer learning for NLP tasks
Easy
A.It allows a model pretrained on a massive corpus to achieve high performance on downstream tasks with relatively little labeled data
B.It completely removes the need to tokenize text
C.It eliminates the need for computational resources like GPUs
D.It guarantees that the model will never produce biased or toxic outputs
Correct Answer: It allows a model pretrained on a massive corpus to achieve high performance on downstream tasks with relatively little labeled data
Explanation:
Transfer learning allows the rich language understanding gained during pretraining on massive datasets to be transferred to specific, smaller tasks through fine-tuning.
Incorrect! Try again.
16When fine-tuning BERT for a sequence classification task (like sentiment analysis), which special token's final hidden state is typically passed to the classification layer?
fine-tuning for text classification
Easy
A.[SEP]
B.[MASK]
C.[PAD]
D.[CLS]
Correct Answer: [CLS]
Explanation:
In BERT, the [CLS] (classification) token is prepended to the input sequence. Its final hidden state aggregates sequence-level information and is commonly used as the input to a classification head.
Incorrect! Try again.
17Named Entity Recognition (NER) using Transformers is typically modeled as what type of task?
named entity recognition
Easy
A.Sequence-to-sequence generation
B.Document clustering
C.Token classification
D.Next sentence prediction
Correct Answer: Token classification
Explanation:
NER involves assigning a label (like Person, Organization, or Location) to each individual word/token in a sequence, making it a token classification task.
Incorrect! Try again.
18In an extractive Question Answering task using a model like BERT, what does the model output?
question answering
Easy
A.A newly generated sentence summarizing the answer
B.The start and end token indices of the answer span within the provided context
C.A simple 'Yes' or 'No' based on the question
D.The Wikipedia URL containing the answer
Correct Answer: The start and end token indices of the answer span within the provided context
Explanation:
In extractive QA, the model identifies the exact segment of text that answers the question by predicting a start pointer and an end pointer over the context sequence.
Incorrect! Try again.
19What is the primary purpose of the HuggingFace Transformers library?
HuggingFace Transformers
Easy
A.To provide open-source APIs, tools, and pretrained weights for state-of-the-art NLP models
B.To serve as a cloud storage database for raw text datasets
C.To replace Python's standard string manipulation functions
D.To provide a proprietary operating system for machine learning servers
Correct Answer: To provide open-source APIs, tools, and pretrained weights for state-of-the-art NLP models
Explanation:
HuggingFace Transformers is a highly popular open-source library that simplifies downloading, utilizing, and fine-tuning state-of-the-art pretrained transformer models.
Incorrect! Try again.
20In the context of the Transformer architecture, what is passed through the Feed-Forward Neural Network in an encoder block?
transformer architecture
Easy
A.The final predictions of the decoder
B.Only the positional encodings
C.The output of the multi-head self-attention layer (after add and norm)
D.The raw, un-tokenized input string
Correct Answer: The output of the multi-head self-attention layer (after add and norm)
Explanation:
Inside an encoder block, the representations first pass through a multi-head self-attention mechanism, are added and normalized, and then pass through a position-wise Feed-Forward Neural Network.
Incorrect! Try again.
21In the scaled dot-product attention mechanism, why are the dot products scaled by the inverse square root of the dimension of the key vectors ()?
self-attention
Medium
A.To prevent the softmax gradients from vanishing
B.To reduce the computational complexity of the dot product
C.To match the dimensions of the queries and keys
D.To normalize the probabilities to sum to 1
Correct Answer: To prevent the softmax gradients from vanishing
Explanation:
When the dimensionality of the key vectors is large, the dot products can grow very large in magnitude, pushing the softmax function into regions where gradients are extremely small. Scaling by counteracts this effect.
Incorrect! Try again.
22What is the primary advantage of using Multi-Head Attention over a single attention function in a Transformer?
multi-head attention
Medium
A.It reduces the total number of parameters in the attention mechanism
B.It removes the need for positional encodings
C.It allows the model to jointly attend to information from different representation subspaces at different positions
D.It decreases the time complexity of the self-attention operation
Correct Answer: It allows the model to jointly attend to information from different representation subspaces at different positions
Explanation:
Multi-head attention projects the queries, keys, and values into multiple different subspaces. This enables the model to focus on different types of relationships (e.g., syntactic vs. semantic) simultaneously.
Incorrect! Try again.
23How are sine and cosine functions utilized for positional encoding in the original Transformer architecture?
positional encoding
Medium
A.They are learned continuously during the training process via backpropagation
B.Different frequencies of sine and cosine are used for different dimensions of the positional encoding vector
C.Sine functions encode odd sequence positions and cosine functions encode even sequence positions
D.They modulate the attention weights directly before applying the softmax function
Correct Answer: Different frequencies of sine and cosine are used for different dimensions of the positional encoding vector
Explanation:
The original Transformer uses sine and cosine functions of different frequencies, where sine is used for even dimensions and cosine for odd dimensions. This allows the model to easily learn to attend by relative positions.
Incorrect! Try again.
24In a standard Transformer decoder block, what is the purpose of the masked multi-head attention layer?
transformer encoder and decoder blocks
Medium
A.To prevent the decoder from attending to future tokens in the target sequence during training
B.To regularize the model by randomly dropping attention weights like dropout
C.To mask out padding tokens from the input sequence
D.To compute the alignment between the encoder output and the target sequence
Correct Answer: To prevent the decoder from attending to future tokens in the target sequence during training
Explanation:
During training, the masked attention layer ensures that the prediction for position can depend only on the known outputs at positions less than , preserving the autoregressive property.
Incorrect! Try again.
25When training a Byte-Pair Encoding (BPE) tokenizer, how are new subwords formed?
Byte-Pair Encoding and WordPiece
Medium
A.By iteratively merging the most frequently occurring pair of adjacent symbols
B.By combining characters that maximize the likelihood of the training data language model
C.By replacing rare words with an <UNK> token directly
D.By splitting words based on linguistic morphological rules
Correct Answer: By iteratively merging the most frequently occurring pair of adjacent symbols
Explanation:
BPE is a data compression technique adapted for NLP. It initializes the vocabulary with individual characters and iteratively merges the most frequent adjacent pairs of symbols to create new subword tokens.
Incorrect! Try again.
26Which components make up the input representation for a given token in the BERT model?
BERT
Medium
A.Word Embeddings, Character Embeddings, and Mask Embeddings
B.Token Embeddings, Syntactic Embeddings, and Position Embeddings
C.Byte-Pair Embeddings and Absolute Position Embeddings
D.Token Embeddings, Segment Embeddings, and Position Embeddings
Correct Answer: Token Embeddings, Segment Embeddings, and Position Embeddings
Explanation:
The input representation for BERT is constructed by summing the corresponding token embedding (representing the subword), segment embedding (representing sentence A or B), and position embedding (representing the sequence position).
Incorrect! Try again.
27During the pretraining phase of BERT using Masked Language Modeling (MLM), how are the selected tokens replaced?
masked language modeling
Medium
A.100% replaced with the [MASK] token
B.80% kept unchanged, 10% with [MASK], 10% with a random token
C.80% with [MASK], 10% with a random token, 10% kept unchanged
D.50% with [MASK] and 50% with a random token
Correct Answer: 80% with [MASK], 10% with a random token, 10% kept unchanged
Explanation:
To mitigate the mismatch between pretraining (where [MASK] is seen) and fine-tuning (where it isn't), BERT replaces 15% of tokens. Of those, 80% are replaced with [MASK], 10% with a random word, and 10% are left unchanged.
Incorrect! Try again.
28Which of the following best describes the architecture of the Generative Pre-trained Transformer (GPT) models?
GPT
Medium
A.Autoregressive decoder-only architecture
B.Bidirectional encoder-only architecture
C.Masked language model architecture
D.Encoder-decoder architecture with cross-attention
GPT models are built using only the decoder part of the original Transformer architecture. They generate text autoregressively, predicting the next token based on previous tokens.
Incorrect! Try again.
29In causal language modeling, what objective function is the model primarily optimizing?
causal language modeling
Medium
A.Predicting randomly masked words within a bidirectional context
B.Maximizing the likelihood of predicting the next token given all previous tokens in the sequence
C.Classifying whether the second half of a sequence logically follows the first half
D.Minimizing the reconstruction error of a corrupted input sequence
Correct Answer: Maximizing the likelihood of predicting the next token given all previous tokens in the sequence
Explanation:
Causal Language Modeling (CLM) trains the model to predict the next token based exclusively on the preceding context. This strictly left-to-right approach is what makes it 'causal'.
Incorrect! Try again.
30How does the Text-to-Text Transfer Transformer (T5) format its input and output for different NLP tasks?
T5
Medium
A.Every task is cast as feeding a text sequence as input and generating a new text sequence as output
B.Tasks are differentiated by adding task-specific classification heads on top of the encoder
C.It uses unique segment embeddings for each NLP task type
D.It requires replacing the decoder block for every specific task like translation or summarization
Correct Answer: Every task is cast as feeding a text sequence as input and generating a new text sequence as output
Explanation:
T5 uses a unified text-to-text framework. Whether the task is translation, classification, or regression, both the input and the target are represented as text strings.
Incorrect! Try again.
31When fine-tuning a BERT model for text classification, which vector is typically passed to the final classification layer?
fine-tuning for text classification
Medium
A.The final hidden state of the [SEP] token
B.The final hidden state corresponding to the [CLS] token
C.The concatenation of the first and last token's hidden states
D.The average pooling of all tokens' final hidden states
Correct Answer: The final hidden state corresponding to the [CLS] token
Explanation:
In BERT, the special [CLS] (classification) token is added to the beginning of every sequence. Its final hidden state aggregates sequence-wide information and is conventionally used for sentence-level classification.
Incorrect! Try again.
32In the HuggingFace Transformers library, what is the role of the AutoModelForSequenceClassification class?
HuggingFace Transformers
Medium
A.It automatically performs hyperparameter tuning for classification metrics
B.It automatically instantiates a base model with a sequence classification head on top
C.It searches the model hub for the best performing model for text classification
D.It dynamically changes the vocabulary size based on the classification task
Correct Answer: It automatically instantiates a base model with a sequence classification head on top
Explanation:
The AutoModelForSequenceClassification class loads a pretrained base model and automatically attaches an untrained classification head matching the specified number of labels.
Incorrect! Try again.
33In BERT's Next Sentence Prediction (NSP) task, what defines a positive pair during pretraining?
next sentence prediction
Medium
A.Two sentences that are semantically similar according to cosine similarity
B.Two sentences extracted from different documents but sharing the same topic
C.Two sentences where one is a direct paraphrase of the other
D.Two sentences that appear consecutively in the original corpus
Correct Answer: Two sentences that appear consecutively in the original corpus
Explanation:
For the NSP task, BERT takes pairs of sentences. A positive pair (label IsNext) consists of two sentences that are actually sequential in the source text, helping the model learn sentence relationships.
Incorrect! Try again.
34How is fine-tuning adapted for a Token Classification task like Named Entity Recognition (NER) using a transformer?
named entity recognition
Medium
A.The output embeddings are clustered to find entity boundaries
B.A classification head is applied to the final hidden state of every individual token in the sequence
C.A single classification head is applied only to the [CLS] token
D.The model is trained to generate the entity labels sequentially as text
Correct Answer: A classification head is applied to the final hidden state of every individual token in the sequence
Explanation:
NER requires classifying each word/token as part of an entity or not. Therefore, the dense classification layer is applied to the final hidden states of all tokens in the sequence, not just the [CLS] token.
Incorrect! Try again.
35When applying a transformer model to Extractive Question Answering (like SQuAD), how is the answer extracted from the context?
question answering
Medium
A.By predicting two probabilities for each token in the context: one for being the start of the answer span and one for being the end
B.By autoregressively generating the text of the answer word-by-word
C.By ranking multiple choice answers based on cosine similarity with the question
D.By classifying the context into predefined answer categories
Correct Answer: By predicting two probabilities for each token in the context: one for being the start of the answer span and one for being the end
Explanation:
In extractive QA, the model does not generate new text. Instead, it places a classification head on top of the context tokens to predict the start and end positions of the answer span.
Incorrect! Try again.
36Which component exists in a Transformer Decoder block but is ABSENT from a Transformer Encoder block?
The decoder block contains an additional multi-head attention layer that performs cross-attention over the output of the encoder stack. The encoder only contains self-attention.
Incorrect! Try again.
37What is the main benefit of using transfer learning in modern NLP?
transfer learning for NLP tasks
Medium
A.Pretraining on massive unlabelled corpora allows models to learn rich language representations, reducing the need for large labeled datasets during fine-tuning
B.It completely eliminates the need for computational resources during the fine-tuning phase
C.It ensures the model cannot overfit on the downstream task regardless of the training time
D.It allows language models to translate between languages without any task-specific data
Correct Answer: Pretraining on massive unlabelled corpora allows models to learn rich language representations, reducing the need for large labeled datasets during fine-tuning
Explanation:
Transfer learning works by initializing models with weights learned from self-supervised tasks on huge datasets. This means they already understand grammar and semantics, requiring far less labeled data to learn a specific task.
Incorrect! Try again.
38How does WordPiece tokenization differ primarily from Byte-Pair Encoding (BPE)?
Byte-Pair Encoding and WordPiece
Medium
A.WordPiece cannot handle out-of-vocabulary words and assigns them to an <UNK> token
B.WordPiece operates on complete words rather than subwords or characters
C.WordPiece relies strictly on linguistic rules such as stemming and lemmatization
D.WordPiece chooses symbol pairs that maximize the likelihood of the training data given the language model, rather than just frequency
Correct Answer: WordPiece chooses symbol pairs that maximize the likelihood of the training data given the language model, rather than just frequency
Explanation:
While BPE purely merges the most frequent pairs, WordPiece evaluates pairs based on how much merging them increases the likelihood of the training data under a unigram language model.
Incorrect! Try again.
39In the standard Transformer architecture, what specific type of neural network follows the attention mechanisms within both encoder and decoder layers?
transformer architecture
Medium
A.A Convolutional Neural Network (CNN) to capture local n-gram features
B.A Recurrent Neural Network (RNN) to process the sequential output
C.A Position-wise Feed-Forward Network applied identically to each position separately
D.A Max-Pooling layer to reduce the dimensionality of the sequence
Correct Answer: A Position-wise Feed-Forward Network applied identically to each position separately
Explanation:
Every encoder and decoder block contains a Position-wise Feed-Forward Network (FFN), consisting of two linear transformations with a ReLU activation in between, applied to each token position independently.
Incorrect! Try again.
40Given an input sequence of length and hidden dimension , what is the computational time complexity of the self-attention operation?
self-attention
Medium
A.
B.
C.
D.
Correct Answer:
Explanation:
In self-attention, computing the attention matrix requires taking the dot product of the query and key vectors for every pair of tokens. This results in combinations, each taking operations, leading to a complexity of .
Incorrect! Try again.
41In the standard scaled dot-product attention, the dot products of queries and keys are divided by . What is the primary theoretical justification for this specific scaling factor?
self-attention
Hard
A.It ensures that the computational complexity of the attention mechanism reduces from to .
B.It prevents the variance of the dot product from scaling linearly with , which would otherwise push the softmax function into regions with extremely small gradients.
C.It normalizes the self-attention weights so that they sum to exactly 1 across the sequence length.
D.It guarantees that the attention scores are bounded within the range before passing through the softmax function.
Correct Answer: It prevents the variance of the dot product from scaling linearly with , which would otherwise push the softmax function into regions with extremely small gradients.
Explanation:
Assuming elements of query and key vectors are independent random variables with mean 0 and variance 1, their dot product has a mean of 0 and variance of . Scaling by brings the variance back to 1, preventing the softmax output from approaching a hard max, which would result in vanishing gradients during training.
Incorrect! Try again.
42Consider a Multi-Head Attention layer with heads and a model dimension of . If , how many trainable weights (excluding biases) are present in the query, key, value, and output projection matrices combined for this single layer?
multi-head attention
Hard
A.
B.
C.
D.
Correct Answer:
Explanation:
The projections for Q, K, and V each require a weight matrix of size (since concatenating heads of size reconstructs ), totaling . The final output projection is also . Thus, the total is parameters.
Incorrect! Try again.
43Which of the following mathematical properties of the sinusoidal positional encodings proposed by Vaswani et al. is crucial for allowing the model to easily learn to attend by relative positions?
positional encoding
Hard
A.The positional encoding vectors are orthogonal to all token embeddings.
B.The norm of the positional encoding vector decays exponentially as the sequence position increases.
C.The dot product between and is strictly zero for any .
D.The positional encoding for any offset , , can be represented as a linear function of .
Correct Answer: The positional encoding for any offset , , can be represented as a linear function of .
Explanation:
Because of the properties of sine and cosine addition formulas, for any fixed offset , can be computed as a linear transformation (a rotation matrix) of . This allows the attention mechanism to easily learn relative positional relationships regardless of absolute position.
Incorrect! Try again.
44In standard Transformer architectures, Layer Normalization can be applied either before the sub-layers (Pre-LN) or after the sub-layers (Post-LN). Which of the following best characterizes the impact of choosing Pre-LN over Post-LN?
transformer encoder and decoder blocks
Hard
A.Pre-LN restricts the gradients from propagating deeply into the network, acting similarly to a bottleneck layer.
B.Pre-LN eliminates the need for positional encodings, whereas Post-LN requires them for convergence.
C.Pre-LN provides better gradient flow near the output layer, but requires a strict learning rate warmup to avoid early divergence.
D.Pre-LN generally results in more stable training and removes the strict necessity for learning rate warmup, though it may slightly degrade final performance compared to Post-LN.
Correct Answer: Pre-LN generally results in more stable training and removes the strict necessity for learning rate warmup, though it may slightly degrade final performance compared to Post-LN.
Explanation:
In Post-LN (used in the original Transformer), gradients near the output layer can be vanishingly small at initialization, necessitating learning rate warmup. Pre-LN keeps the scale of gradients more uniform across layers, allowing training without warmup, though sometimes yielding slightly lower final performance.
Incorrect! Try again.
45While both Byte-Pair Encoding (BPE) and WordPiece are subword tokenization algorithms, their merging criteria differ. BPE merges pairs based on frequency. What criterion does WordPiece primarily use to determine which subword pair to merge?
Byte-Pair Encoding and WordPiece
Hard
A.It merges the pair that maximizes the likelihood of the training data when trained as a unigram language model.
B.It merges the pair that yields the largest mutual information, calculated as the frequency of the pair divided by the product of individual frequencies.
C.It randomly selects a pair to merge based on a temperature parameter to inject noise into the tokenization process.
D.It merges pairs based on morphological suffixes and prefixes derived from a predefined language-specific rule set.
Correct Answer: It merges the pair that yields the largest mutual information, calculated as the frequency of the pair divided by the product of individual frequencies.
Explanation:
WordPiece scores pairs based on the formula . This measures how much more likely the pair is to appear together compared to what would be expected if they were independent, essentially maximizing the likelihood of the training data.
Incorrect! Try again.
46In BERT's Masked Language Modeling (MLM), 15% of the tokens are chosen for prediction. Of these chosen tokens, 80% are replaced with [MASK], 10% with a random word, and 10% kept unchanged. What is the primary reason for keeping 10% of the selected tokens unchanged?
masked language modeling
Hard
A.To bias the model towards retaining its original word embeddings in the lower layers.
B.To mitigate the discrepancy between pre-training and fine-tuning, teaching the model that the observed token might actually be the correct token and it should maintain its contextual representation.
C.To prevent the model from assigning zero probability to the actual input token, which acts as a regularization mechanism similar to label smoothing.
D.To force the model to rely solely on the target token's embedding rather than the surrounding context.
Correct Answer: To mitigate the discrepancy between pre-training and fine-tuning, teaching the model that the observed token might actually be the correct token and it should maintain its contextual representation.
Explanation:
During fine-tuning, the [MASK] token never appears. By keeping the token unchanged 10% of the time (and replacing it with a random token 10% of the time), BERT is forced to build a robust contextual representation for every token, as it never knows which tokens have been tampered with.
Incorrect! Try again.
47Which of the following pretrained models completely removed the Next Sentence Prediction (NSP) objective and demonstrated empirically that removing it actually improved performance on downstream tasks?
next sentence prediction
Hard
A.ALBERT
B.BART
C.DeBERTa
D.RoBERTa
Correct Answer: RoBERTa
Explanation:
The authors of RoBERTa found that the Next Sentence Prediction (NSP) objective used in BERT actually degraded performance. They removed NSP and trained with full-length sequences (often spanning multiple documents) using only Masked Language Modeling, resulting in better downstream performance.
Incorrect! Try again.
48During autoregressive generation in causal language models, a 'KV cache' is often utilized. If a model generates a sequence of length , how does the memory complexity of the KV cache scale with respect to ?
causal language modeling
Hard
A.
B.
C.
D.
Correct Answer:
Explanation:
The KV cache stores the Key and Value vectors for all previously generated tokens to avoid recomputing them. Since each token adds a fixed-size vector for its Keys and Values across all layers, the total memory required for the cache grows linearly, , with the sequence length.
Incorrect! Try again.
49T5 (Text-to-Text Transfer Transformer) uses a unique unsupervised pre-training objective known as 'span corruption'. How does T5 handle the targets for masked spans during pre-training compared to BERT?
T5
Hard
A.T5 replaces contiguous spans with a single unique sentinel token and the decoder must generate only the corrupted spans delimited by those sentinel tokens.
B.T5 uses a discriminator network to predict whether a span was corrupted, similar to ELECTRA, rather than generating the tokens.
C.T5 reconstructs the entire original text in the decoder, penalizing outputs that deviate from the uncorrupted input sequence.
D.T5 replaces spans with multiple [MASK] tokens and decodes them sequentially, whereas BERT predicts them independently.
Correct Answer: T5 replaces contiguous spans with a single unique sentinel token and the decoder must generate only the corrupted spans delimited by those sentinel tokens.
Explanation:
In T5's span corruption, contiguous dropped tokens are replaced by a single unique sentinel token (e.g., <extra_id_0>). The decoder's target sequence consists only of these sentinel tokens followed by the missing span tokens, rather than reconstructing the full original sequence.
Incorrect! Try again.
50When fine-tuning BERT for text classification, it is standard practice to pass the final hidden state of the [CLS] token to a classification head. If one extracts the [CLS] embedding from a pre-trained BERT without any fine-tuning, why is it generally a poor sentence representation for tasks like semantic textual similarity?
fine-tuning for text classification
Hard
A.During pre-training, the [CLS] token is specifically optimized only as an indicator for the Masked Language Modeling task, ignoring sentence-level semantics.
B.The [CLS] representation is heavily biased towards the Next Sentence Prediction (NSP) objective, acting essentially as a binary feature rather than capturing a continuous semantic space.
C.The attention mechanism is masked such that the [CLS] token cannot attend to the rest of the sequence unless fine-tuning unmasks it.
D.The [CLS] token embedding is deterministically initialized to zeros and only receives gradients during the fine-tuning phase.
Correct Answer: The [CLS] representation is heavily biased towards the Next Sentence Prediction (NSP) objective, acting essentially as a binary feature rather than capturing a continuous semantic space.
Explanation:
In a vanilla pretrained BERT, the [CLS] token's primary role is to aggregate information for the Next Sentence Prediction (NSP) binary classification task. Without fine-tuning (or contrastive learning as seen in Sentence-BERT), this representation does not inherently form a smooth semantic space useful for measuring cosine similarity between sentences.
Incorrect! Try again.
51When fine-tuning a transformer model for Named Entity Recognition (NER), subword tokenization often splits a single word into multiple tokens (e.g., 'Washington' -> 'Wash', '##ington'). What is the standard practice in HuggingFace for assigning labels to these subwords to compute the loss correctly?
named entity recognition
Hard
A.Assign B-LOC to the first subword and I-LOC to all subsequent subwords of the same word.
B.Assign the exact same target label to all subwords of the word to reinforce the gradient.
C.Assign the target label (e.g., B-LOC) to the first subword and assign a special ignore index (e.g., -100) to the remaining subwords.
D.Sum the logits of all subwords corresponding to a word before applying the softmax and calculating the loss.
Correct Answer: Assign the target label (e.g., B-LOC) to the first subword and assign a special ignore index (e.g., -100) to the remaining subwords.
Explanation:
The standard practice in sequence labeling with subwords is to calculate the loss only on the first subword of an entity. By setting the labels of trailing subwords to -100, the PyTorch CrossEntropyLoss function naturally ignores them, preventing the model from over-weighting words that happen to be split into many subwords.
Incorrect! Try again.
52In Extractive Question Answering tasks with unanswerable questions (like SQuAD 2.0), how does a standard BERT-based architecture signal that a question cannot be answered from the given context?
question answering
Hard
A.By generating an empty string through a specialized decoder head attached to the final layer.
B.By pointing both the predicted start and end spans to the [CLS] token at index 0.
C.By outputting start and end logits where the start index is strictly greater than the end index.
D.By setting all token probabilities in the context to precisely zero using a specialized thresholding function.
Correct Answer: By pointing both the predicted start and end spans to the [CLS] token at index 0.
Explanation:
In BERT models fine-tuned on SQuAD 2.0, the [CLS] token is used as a fallback. If the model determines the answer is not in the text, it maximizes the probability of the start and end indices both pointing to the [CLS] token.
Incorrect! Try again.
53Consider a Transformer block with sequence length and hidden dimension . The attention mechanism computes . If is significantly larger than (e.g., ), which operation becomes the primary computational bottleneck?
transformer architecture
Hard
A.The Softmax activation function, scaling as .
B.The Feed-Forward Network (FFN) sub-layer, scaling as .
C.The computation of the attention scores , scaling as .
D.The linear projections to form Q, K, and V, scaling as .
Correct Answer: The computation of the attention scores , scaling as .
Explanation:
The dot product requires multiplying an matrix by a matrix, which takes operations. When , the quadratic dependency on sequence length makes this step the primary computational bottleneck of the Transformer.
Incorrect! Try again.
54BERT relies on Token Embeddings, Segment (Token Type) Embeddings, and Positional Embeddings. If you are fine-tuning BERT for a single-sequence classification task (e.g., Sentiment Analysis), how are the Segment Embeddings typically handled?
BERT
Hard
A.They are dynamically generated based on the length of the sequence using a sinusoidal function.
B.A single segment ID (usually 0) is passed for all tokens in the sequence, mapping to a learned vector that is added to the token embeddings.
C.They are omitted entirely, and the model architecture is adjusted to bypass the segment embedding addition.
D.A constant vector of ones is added to all token embeddings to indicate a single segment.
Correct Answer: A single segment ID (usually 0) is passed for all tokens in the sequence, mapping to a learned vector that is added to the token embeddings.
Explanation:
For single-sequence tasks, BERT still expects segment IDs as part of its architecture. By convention, a tensor of zeros is passed (indicating segment A), which retrieves a learned embedding that is uniformly added to all tokens in the sequence.
Incorrect! Try again.
55When adapting a large pre-trained language model to a specific task using Low-Rank Adaptation (LoRA), which of the following is true regarding the model's parameters?
transfer learning for NLP tasks
Hard
A.LoRA freezes the original pre-trained weights and injects trainable rank decomposition matrices, maintaining the same inference latency as the base model when merged.
B.LoRA introduces bottleneck layers between every transformer block, increasing the depth of the model and slightly increasing inference latency.
C.LoRA fine-tunes only the final layer of the network while applying a sparse mask to the gradients of the earlier layers.
D.LoRA completely replaces the self-attention weights with smaller matrices, significantly reducing the memory required for the forward pass during inference.
Correct Answer: LoRA freezes the original pre-trained weights and injects trainable rank decomposition matrices, maintaining the same inference latency as the base model when merged.
Explanation:
LoRA introduces trainable matrices and such that . During training, the base weights are frozen. For deployment, can be explicitly computed and added to the original weights (), meaning there is zero extra latency during inference.
Incorrect! Try again.
56In the HuggingFace generate() method, when both top_k and top_p (nucleus sampling) parameters are provided (e.g., top_k=50, top_p=0.95), in what order are these filtering operations mathematically applied to the logits?
HuggingFace Transformers
Hard
A.HuggingFace raises an exception because top_k and top_p are mutually exclusive generation constraints.
B.top_k filtering is applied first, completely truncating the vocabulary to 50 tokens, and then top_p filtering is applied within that restricted set.
C.They are applied independently to two copies of the logits, and the intersection of the valid tokens is used.
D.top_p filtering is applied first, followed by top_k filtering on the remaining probabilities.
Correct Answer: top_k filtering is applied first, completely truncating the vocabulary to 50 tokens, and then top_p filtering is applied within that restricted set.
Explanation:
In HuggingFace's implementation of the generation mixins (specifically the LogitsProcessors), TopKLogitsWarper is instantiated and applied before TopPLogitsWarper. Therefore, the vocabulary is first truncated to tokens, and nucleus sampling is subsequently applied over this reduced distribution.
Incorrect! Try again.
57In the cross-attention sub-layer of a standard Transformer Decoder, how are the Query (Q), Key (K), and Value (V) matrices derived?
transformer decoder blocks
Hard
A.Q and K are derived from the final hidden states of the Encoder, while V is derived from the previous decoder layer.
B.Q, K, and V are derived from the Encoder, but a look-ahead mask is applied to prevent the decoder from seeing future tokens.
C.Q, K, and V are all derived from the output of the previous decoder layer.
D.Q is derived from the previous decoder layer, while K and V are derived from the final hidden states of the Encoder.
Correct Answer: Q is derived from the previous decoder layer, while K and V are derived from the final hidden states of the Encoder.
Explanation:
In cross-attention (or Encoder-Decoder attention), the decoder "queries" the encoder's representations. Therefore, the Queries come from the decoder's previous layer, while the Keys and Values are projections of the encoder's final output.
Incorrect! Try again.
58Unlike BPE which starts with characters and builds up, the Unigram Language Model tokenizer starts with a large vocabulary and prunes it down. What metric does the Unigram algorithm use to decide which tokens to remove at each iteration?
tokenization methods
Hard
A.It removes the tokens whose removal results in the smallest increase in the overall negative log-likelihood of the training data.
B.It removes the tokens that have the lowest absolute frequency in the training corpus.
C.It removes tokens that are not valid morphological roots according to a predefined dictionary.
D.It removes the longest tokens first to enforce a bias towards shorter subwords and characters.
Correct Answer: It removes the tokens whose removal results in the smallest increase in the overall negative log-likelihood of the training data.
Explanation:
The Unigram tokenization algorithm iteratively evaluates how much the training data's likelihood would drop if a token were removed from the vocabulary. It prunes the bottom \% of tokens that cause the least degradation to the likelihood.
Incorrect! Try again.
59GPT-3 popularized the concept of 'in-context learning' (few-shot prompting) where the model performs tasks without gradient updates. From a representational perspective, how does the model 'learn' a task dynamically during inference?
GPT
Hard
A.By executing an internal discrete search algorithm over its vocabulary space prompted by a specialized meta-token.
B.By relying on the self-attention mechanism to route the representations of the provided examples to modulate the hidden states of the target query.
C.By temporarily caching gradients and updating only the Layer Normalization parameters using a lightweight background thread.
D.By shifting the positional encodings of the input prompt so that the context behaves as updated weight matrices.
Correct Answer: By relying on the self-attention mechanism to route the representations of the provided examples to modulate the hidden states of the target query.
Explanation:
During in-context learning, no parameters are updated. The "learning" happens entirely through activations via self-attention: the model dynamically aggregates information from the context examples to appropriately form the output distribution for the new prompt.
Incorrect! Try again.
60In the context of the Effective Receptive Field (ERF) of deep neural networks, how does the theoretical receptive field of a token at the very first layer of a Transformer compare to a standard CNN?
pretrained transformer models
Hard
A.A Transformer's receptive field scales logarithmically with depth, while a CNN's scales linearly.
B.Transformers and CNNs both have identical local receptive fields initially, but Transformers expand theirs via positional encodings.
C.A Transformer has a global receptive field encompassing the entire sequence, whereas a CNN has a strictly local receptive field constrained by kernel size.
D.A Transformer has a local receptive field of size in the first layer, just like a CNN.
Correct Answer: A Transformer has a global receptive field encompassing the entire sequence, whereas a CNN has a strictly local receptive field constrained by kernel size.
Explanation:
Because of the self-attention mechanism, every token computes attention scores with every other token in the sequence (in unmasked models like BERT) or all past tokens (in masked models like GPT) starting immediately at layer 1, resulting in a global theoretical receptive field, unlike CNNs which require deep hierarchies to expand their receptive fields.