Unit 5 - Practice Quiz

CSE472 60 Questions
0 Correct 0 Wrong 60 Left
0/60

1 Which foundational 2017 paper introduced the Transformer architecture?

transformer architecture Easy
A. Attention Is All You Need
B. Deep Residual Learning for Image Recognition
C. BERT: Pre-training of Deep Bidirectional Transformers
D. Language Models are Few-Shot Learners

2 What is the primary function of the self-attention mechanism in a Transformer?

self-attention Easy
A. To translate words from a source language to a target language word-by-word
B. To compress the input sequence into a fixed-length vector using recurrence
C. To mask out unknown words in the input text
D. To relate different positions of a single sequence to compute a representation of the sequence

3 Why does the Transformer use multi-head attention instead of a single attention function?

multi-head attention Easy
A. It replaces the need for tokenization in NLP tasks
B. It reduces the number of parameters in the model
C. It allows the model to jointly attend to information from different representation subspaces at different positions
D. It prevents the model from overfitting on small datasets

4 Why is positional encoding necessary in a Transformer model?

positional encoding Easy
A. Because it normalizes the input embeddings to have a mean of zero
B. Because Transformers process sequences in parallel and lack inherent notions of sequence order
C. Because it is required to calculate the softmax function properly
D. Because it reduces the memory footprint of the attention mechanism

5 Which component is present in a standard Transformer decoder block but NOT in an encoder block?

transformer encoder and decoder blocks Easy
A. Layer Normalization
B. Masked Multi-Head Self-Attention
C. Feed-Forward Neural Network
D. Multi-Head Self-Attention

6 What is the primary advantage of subword tokenization methods over traditional word-level tokenization?

tokenization methods Easy
A. They are strictly rule-based and do not require training on a corpus
B. They ensure that every word is represented by a single, unique integer ID
C. They help handle Out-Of-Vocabulary (OOV) words by breaking them into known subwords
D. They completely eliminate the need for an embedding layer

7 How does the Byte-Pair Encoding (BPE) algorithm initially build its vocabulary?

Byte-Pair Encoding and WordPiece Easy
A. It hashes every word in the training corpus into a fixed-size vector space
B. It starts with whole sentences and breaks them down using linguistic rules
C. It starts with a vocabulary of individual characters and iteratively merges the most frequent adjacent pairs
D. It uses a pre-defined dictionary from WordNet to select valid tokens

8 What does 'pretraining' mean in the context of large language models?

pretrained transformer models Easy
A. Testing a model on a validation set to tune its hyperparameters
B. Hardcoding linguistic rules into the model before deployment
C. Training a model explicitly on human feedback and supervised labels from the very beginning
D. Training a model on a large corpus of unlabeled text using a self-supervised objective before adapting it to specific tasks

9 What does the acronym BERT stand for?

BERT Easy
A. Basic Entity Recognition from Text
B. Binary Encoded Recurrent Transformers
C. Bidirectional Evaluation of Recurrent Transformers
D. Bidirectional Encoder Representations from Transformers

10 The GPT (Generative Pre-trained Transformer) architecture is primarily based on which part of the original Transformer?

GPT Easy
A. The Cross-Attention Mechanism
B. The Decoder
C. Only the Positional Encodings
D. The Encoder

11 What is the defining characteristic of the T5 (Text-to-Text Transfer Transformer) framework?

T5 Easy
A. It casts every NLP task as a text-to-text problem, where both input and output are strings
B. It processes audio and text simultaneously in a single model
C. It uses reinforcement learning instead of cross-entropy loss
D. It uses only an encoder architecture and cannot generate text

12 In Masked Language Modeling (MLM), what is the model trained to do?

masked language modeling Easy
A. Predict the very next sentence in a document
B. Translate a masked sentence into another language
C. Generate an entire paragraph from a single prompt word
D. Predict the original identity of tokens that have been randomly replaced with a [MASK] token

13 What is the objective of the Next Sentence Prediction (NSP) task used during BERT's pretraining?

next sentence prediction Easy
A. To generate the subsequent sentence given a starting phrase
B. To predict the exact word count of the following sentence
C. To predict whether Sentence B logically and sequentially follows Sentence A in the original document
D. To classify the grammatical correctness of the next sentence

14 Which of the following best describes Causal Language Modeling (CLM)?

causal language modeling Easy
A. Predicting a masked word using both left and right context
B. Finding the root cause of grammatical errors in a sentence
C. Classifying whether a text expresses a positive or negative sentiment
D. Predicting the next token in a sequence given only the previous tokens (left-to-right)

15 What is a major benefit of using transfer learning in NLP?

transfer learning for NLP tasks Easy
A. It allows a model pretrained on a massive corpus to achieve high performance on downstream tasks with relatively little labeled data
B. It completely removes the need to tokenize text
C. It eliminates the need for computational resources like GPUs
D. It guarantees that the model will never produce biased or toxic outputs

16 When fine-tuning BERT for a sequence classification task (like sentiment analysis), which special token's final hidden state is typically passed to the classification layer?

fine-tuning for text classification Easy
A. [SEP]
B. [MASK]
C. [PAD]
D. [CLS]

17 Named Entity Recognition (NER) using Transformers is typically modeled as what type of task?

named entity recognition Easy
A. Sequence-to-sequence generation
B. Document clustering
C. Token classification
D. Next sentence prediction

18 In an extractive Question Answering task using a model like BERT, what does the model output?

question answering Easy
A. A newly generated sentence summarizing the answer
B. The start and end token indices of the answer span within the provided context
C. A simple 'Yes' or 'No' based on the question
D. The Wikipedia URL containing the answer

19 What is the primary purpose of the HuggingFace Transformers library?

HuggingFace Transformers Easy
A. To provide open-source APIs, tools, and pretrained weights for state-of-the-art NLP models
B. To serve as a cloud storage database for raw text datasets
C. To replace Python's standard string manipulation functions
D. To provide a proprietary operating system for machine learning servers

20 In the context of the Transformer architecture, what is passed through the Feed-Forward Neural Network in an encoder block?

transformer architecture Easy
A. The final predictions of the decoder
B. Only the positional encodings
C. The output of the multi-head self-attention layer (after add and norm)
D. The raw, un-tokenized input string

21 In the scaled dot-product attention mechanism, why are the dot products scaled by the inverse square root of the dimension of the key vectors ()?

self-attention Medium
A. To prevent the softmax gradients from vanishing
B. To reduce the computational complexity of the dot product
C. To match the dimensions of the queries and keys
D. To normalize the probabilities to sum to 1

22 What is the primary advantage of using Multi-Head Attention over a single attention function in a Transformer?

multi-head attention Medium
A. It reduces the total number of parameters in the attention mechanism
B. It removes the need for positional encodings
C. It allows the model to jointly attend to information from different representation subspaces at different positions
D. It decreases the time complexity of the self-attention operation

23 How are sine and cosine functions utilized for positional encoding in the original Transformer architecture?

positional encoding Medium
A. They are learned continuously during the training process via backpropagation
B. Different frequencies of sine and cosine are used for different dimensions of the positional encoding vector
C. Sine functions encode odd sequence positions and cosine functions encode even sequence positions
D. They modulate the attention weights directly before applying the softmax function

24 In a standard Transformer decoder block, what is the purpose of the masked multi-head attention layer?

transformer encoder and decoder blocks Medium
A. To prevent the decoder from attending to future tokens in the target sequence during training
B. To regularize the model by randomly dropping attention weights like dropout
C. To mask out padding tokens from the input sequence
D. To compute the alignment between the encoder output and the target sequence

25 When training a Byte-Pair Encoding (BPE) tokenizer, how are new subwords formed?

Byte-Pair Encoding and WordPiece Medium
A. By iteratively merging the most frequently occurring pair of adjacent symbols
B. By combining characters that maximize the likelihood of the training data language model
C. By replacing rare words with an <UNK> token directly
D. By splitting words based on linguistic morphological rules

26 Which components make up the input representation for a given token in the BERT model?

BERT Medium
A. Word Embeddings, Character Embeddings, and Mask Embeddings
B. Token Embeddings, Syntactic Embeddings, and Position Embeddings
C. Byte-Pair Embeddings and Absolute Position Embeddings
D. Token Embeddings, Segment Embeddings, and Position Embeddings

27 During the pretraining phase of BERT using Masked Language Modeling (MLM), how are the selected tokens replaced?

masked language modeling Medium
A. 100% replaced with the [MASK] token
B. 80% kept unchanged, 10% with [MASK], 10% with a random token
C. 80% with [MASK], 10% with a random token, 10% kept unchanged
D. 50% with [MASK] and 50% with a random token

28 Which of the following best describes the architecture of the Generative Pre-trained Transformer (GPT) models?

GPT Medium
A. Autoregressive decoder-only architecture
B. Bidirectional encoder-only architecture
C. Masked language model architecture
D. Encoder-decoder architecture with cross-attention

29 In causal language modeling, what objective function is the model primarily optimizing?

causal language modeling Medium
A. Predicting randomly masked words within a bidirectional context
B. Maximizing the likelihood of predicting the next token given all previous tokens in the sequence
C. Classifying whether the second half of a sequence logically follows the first half
D. Minimizing the reconstruction error of a corrupted input sequence

30 How does the Text-to-Text Transfer Transformer (T5) format its input and output for different NLP tasks?

T5 Medium
A. Every task is cast as feeding a text sequence as input and generating a new text sequence as output
B. Tasks are differentiated by adding task-specific classification heads on top of the encoder
C. It uses unique segment embeddings for each NLP task type
D. It requires replacing the decoder block for every specific task like translation or summarization

31 When fine-tuning a BERT model for text classification, which vector is typically passed to the final classification layer?

fine-tuning for text classification Medium
A. The final hidden state of the [SEP] token
B. The final hidden state corresponding to the [CLS] token
C. The concatenation of the first and last token's hidden states
D. The average pooling of all tokens' final hidden states

32 In the HuggingFace Transformers library, what is the role of the AutoModelForSequenceClassification class?

HuggingFace Transformers Medium
A. It automatically performs hyperparameter tuning for classification metrics
B. It automatically instantiates a base model with a sequence classification head on top
C. It searches the model hub for the best performing model for text classification
D. It dynamically changes the vocabulary size based on the classification task

33 In BERT's Next Sentence Prediction (NSP) task, what defines a positive pair during pretraining?

next sentence prediction Medium
A. Two sentences that are semantically similar according to cosine similarity
B. Two sentences extracted from different documents but sharing the same topic
C. Two sentences where one is a direct paraphrase of the other
D. Two sentences that appear consecutively in the original corpus

34 How is fine-tuning adapted for a Token Classification task like Named Entity Recognition (NER) using a transformer?

named entity recognition Medium
A. The output embeddings are clustered to find entity boundaries
B. A classification head is applied to the final hidden state of every individual token in the sequence
C. A single classification head is applied only to the [CLS] token
D. The model is trained to generate the entity labels sequentially as text

35 When applying a transformer model to Extractive Question Answering (like SQuAD), how is the answer extracted from the context?

question answering Medium
A. By predicting two probabilities for each token in the context: one for being the start of the answer span and one for being the end
B. By autoregressively generating the text of the answer word-by-word
C. By ranking multiple choice answers based on cosine similarity with the question
D. By classifying the context into predefined answer categories

36 Which component exists in a Transformer Decoder block but is ABSENT from a Transformer Encoder block?

transformer encoder and decoder blocks Medium
A. Feed-Forward Neural Network
B. Layer Normalization
C. Self-Attention mechanism
D. Encoder-Decoder Cross-Attention mechanism

37 What is the main benefit of using transfer learning in modern NLP?

transfer learning for NLP tasks Medium
A. Pretraining on massive unlabelled corpora allows models to learn rich language representations, reducing the need for large labeled datasets during fine-tuning
B. It completely eliminates the need for computational resources during the fine-tuning phase
C. It ensures the model cannot overfit on the downstream task regardless of the training time
D. It allows language models to translate between languages without any task-specific data

38 How does WordPiece tokenization differ primarily from Byte-Pair Encoding (BPE)?

Byte-Pair Encoding and WordPiece Medium
A. WordPiece cannot handle out-of-vocabulary words and assigns them to an <UNK> token
B. WordPiece operates on complete words rather than subwords or characters
C. WordPiece relies strictly on linguistic rules such as stemming and lemmatization
D. WordPiece chooses symbol pairs that maximize the likelihood of the training data given the language model, rather than just frequency

39 In the standard Transformer architecture, what specific type of neural network follows the attention mechanisms within both encoder and decoder layers?

transformer architecture Medium
A. A Convolutional Neural Network (CNN) to capture local n-gram features
B. A Recurrent Neural Network (RNN) to process the sequential output
C. A Position-wise Feed-Forward Network applied identically to each position separately
D. A Max-Pooling layer to reduce the dimensionality of the sequence

40 Given an input sequence of length and hidden dimension , what is the computational time complexity of the self-attention operation?

self-attention Medium
A.
B.
C.
D.

41 In the standard scaled dot-product attention, the dot products of queries and keys are divided by . What is the primary theoretical justification for this specific scaling factor?

self-attention Hard
A. It ensures that the computational complexity of the attention mechanism reduces from to .
B. It prevents the variance of the dot product from scaling linearly with , which would otherwise push the softmax function into regions with extremely small gradients.
C. It normalizes the self-attention weights so that they sum to exactly 1 across the sequence length.
D. It guarantees that the attention scores are bounded within the range before passing through the softmax function.

42 Consider a Multi-Head Attention layer with heads and a model dimension of . If , how many trainable weights (excluding biases) are present in the query, key, value, and output projection matrices combined for this single layer?

multi-head attention Hard
A.
B.
C.
D.

43 Which of the following mathematical properties of the sinusoidal positional encodings proposed by Vaswani et al. is crucial for allowing the model to easily learn to attend by relative positions?

positional encoding Hard
A. The positional encoding vectors are orthogonal to all token embeddings.
B. The norm of the positional encoding vector decays exponentially as the sequence position increases.
C. The dot product between and is strictly zero for any .
D. The positional encoding for any offset , , can be represented as a linear function of .

44 In standard Transformer architectures, Layer Normalization can be applied either before the sub-layers (Pre-LN) or after the sub-layers (Post-LN). Which of the following best characterizes the impact of choosing Pre-LN over Post-LN?

transformer encoder and decoder blocks Hard
A. Pre-LN restricts the gradients from propagating deeply into the network, acting similarly to a bottleneck layer.
B. Pre-LN eliminates the need for positional encodings, whereas Post-LN requires them for convergence.
C. Pre-LN provides better gradient flow near the output layer, but requires a strict learning rate warmup to avoid early divergence.
D. Pre-LN generally results in more stable training and removes the strict necessity for learning rate warmup, though it may slightly degrade final performance compared to Post-LN.

45 While both Byte-Pair Encoding (BPE) and WordPiece are subword tokenization algorithms, their merging criteria differ. BPE merges pairs based on frequency. What criterion does WordPiece primarily use to determine which subword pair to merge?

Byte-Pair Encoding and WordPiece Hard
A. It merges the pair that maximizes the likelihood of the training data when trained as a unigram language model.
B. It merges the pair that yields the largest mutual information, calculated as the frequency of the pair divided by the product of individual frequencies.
C. It randomly selects a pair to merge based on a temperature parameter to inject noise into the tokenization process.
D. It merges pairs based on morphological suffixes and prefixes derived from a predefined language-specific rule set.

46 In BERT's Masked Language Modeling (MLM), 15% of the tokens are chosen for prediction. Of these chosen tokens, 80% are replaced with [MASK], 10% with a random word, and 10% kept unchanged. What is the primary reason for keeping 10% of the selected tokens unchanged?

masked language modeling Hard
A. To bias the model towards retaining its original word embeddings in the lower layers.
B. To mitigate the discrepancy between pre-training and fine-tuning, teaching the model that the observed token might actually be the correct token and it should maintain its contextual representation.
C. To prevent the model from assigning zero probability to the actual input token, which acts as a regularization mechanism similar to label smoothing.
D. To force the model to rely solely on the target token's embedding rather than the surrounding context.

47 Which of the following pretrained models completely removed the Next Sentence Prediction (NSP) objective and demonstrated empirically that removing it actually improved performance on downstream tasks?

next sentence prediction Hard
A. ALBERT
B. BART
C. DeBERTa
D. RoBERTa

48 During autoregressive generation in causal language models, a 'KV cache' is often utilized. If a model generates a sequence of length , how does the memory complexity of the KV cache scale with respect to ?

causal language modeling Hard
A.
B.
C.
D.

49 T5 (Text-to-Text Transfer Transformer) uses a unique unsupervised pre-training objective known as 'span corruption'. How does T5 handle the targets for masked spans during pre-training compared to BERT?

T5 Hard
A. T5 replaces contiguous spans with a single unique sentinel token and the decoder must generate only the corrupted spans delimited by those sentinel tokens.
B. T5 uses a discriminator network to predict whether a span was corrupted, similar to ELECTRA, rather than generating the tokens.
C. T5 reconstructs the entire original text in the decoder, penalizing outputs that deviate from the uncorrupted input sequence.
D. T5 replaces spans with multiple [MASK] tokens and decodes them sequentially, whereas BERT predicts them independently.

50 When fine-tuning BERT for text classification, it is standard practice to pass the final hidden state of the [CLS] token to a classification head. If one extracts the [CLS] embedding from a pre-trained BERT without any fine-tuning, why is it generally a poor sentence representation for tasks like semantic textual similarity?

fine-tuning for text classification Hard
A. During pre-training, the [CLS] token is specifically optimized only as an indicator for the Masked Language Modeling task, ignoring sentence-level semantics.
B. The [CLS] representation is heavily biased towards the Next Sentence Prediction (NSP) objective, acting essentially as a binary feature rather than capturing a continuous semantic space.
C. The attention mechanism is masked such that the [CLS] token cannot attend to the rest of the sequence unless fine-tuning unmasks it.
D. The [CLS] token embedding is deterministically initialized to zeros and only receives gradients during the fine-tuning phase.

51 When fine-tuning a transformer model for Named Entity Recognition (NER), subword tokenization often splits a single word into multiple tokens (e.g., 'Washington' -> 'Wash', '##ington'). What is the standard practice in HuggingFace for assigning labels to these subwords to compute the loss correctly?

named entity recognition Hard
A. Assign B-LOC to the first subword and I-LOC to all subsequent subwords of the same word.
B. Assign the exact same target label to all subwords of the word to reinforce the gradient.
C. Assign the target label (e.g., B-LOC) to the first subword and assign a special ignore index (e.g., -100) to the remaining subwords.
D. Sum the logits of all subwords corresponding to a word before applying the softmax and calculating the loss.

52 In Extractive Question Answering tasks with unanswerable questions (like SQuAD 2.0), how does a standard BERT-based architecture signal that a question cannot be answered from the given context?

question answering Hard
A. By generating an empty string through a specialized decoder head attached to the final layer.
B. By pointing both the predicted start and end spans to the [CLS] token at index 0.
C. By outputting start and end logits where the start index is strictly greater than the end index.
D. By setting all token probabilities in the context to precisely zero using a specialized thresholding function.

53 Consider a Transformer block with sequence length and hidden dimension . The attention mechanism computes . If is significantly larger than (e.g., ), which operation becomes the primary computational bottleneck?

transformer architecture Hard
A. The Softmax activation function, scaling as .
B. The Feed-Forward Network (FFN) sub-layer, scaling as .
C. The computation of the attention scores , scaling as .
D. The linear projections to form Q, K, and V, scaling as .

54 BERT relies on Token Embeddings, Segment (Token Type) Embeddings, and Positional Embeddings. If you are fine-tuning BERT for a single-sequence classification task (e.g., Sentiment Analysis), how are the Segment Embeddings typically handled?

BERT Hard
A. They are dynamically generated based on the length of the sequence using a sinusoidal function.
B. A single segment ID (usually 0) is passed for all tokens in the sequence, mapping to a learned vector that is added to the token embeddings.
C. They are omitted entirely, and the model architecture is adjusted to bypass the segment embedding addition.
D. A constant vector of ones is added to all token embeddings to indicate a single segment.

55 When adapting a large pre-trained language model to a specific task using Low-Rank Adaptation (LoRA), which of the following is true regarding the model's parameters?

transfer learning for NLP tasks Hard
A. LoRA freezes the original pre-trained weights and injects trainable rank decomposition matrices, maintaining the same inference latency as the base model when merged.
B. LoRA introduces bottleneck layers between every transformer block, increasing the depth of the model and slightly increasing inference latency.
C. LoRA fine-tunes only the final layer of the network while applying a sparse mask to the gradients of the earlier layers.
D. LoRA completely replaces the self-attention weights with smaller matrices, significantly reducing the memory required for the forward pass during inference.

56 In the HuggingFace generate() method, when both top_k and top_p (nucleus sampling) parameters are provided (e.g., top_k=50, top_p=0.95), in what order are these filtering operations mathematically applied to the logits?

HuggingFace Transformers Hard
A. HuggingFace raises an exception because top_k and top_p are mutually exclusive generation constraints.
B. top_k filtering is applied first, completely truncating the vocabulary to 50 tokens, and then top_p filtering is applied within that restricted set.
C. They are applied independently to two copies of the logits, and the intersection of the valid tokens is used.
D. top_p filtering is applied first, followed by top_k filtering on the remaining probabilities.

57 In the cross-attention sub-layer of a standard Transformer Decoder, how are the Query (Q), Key (K), and Value (V) matrices derived?

transformer decoder blocks Hard
A. Q and K are derived from the final hidden states of the Encoder, while V is derived from the previous decoder layer.
B. Q, K, and V are derived from the Encoder, but a look-ahead mask is applied to prevent the decoder from seeing future tokens.
C. Q, K, and V are all derived from the output of the previous decoder layer.
D. Q is derived from the previous decoder layer, while K and V are derived from the final hidden states of the Encoder.

58 Unlike BPE which starts with characters and builds up, the Unigram Language Model tokenizer starts with a large vocabulary and prunes it down. What metric does the Unigram algorithm use to decide which tokens to remove at each iteration?

tokenization methods Hard
A. It removes the tokens whose removal results in the smallest increase in the overall negative log-likelihood of the training data.
B. It removes the tokens that have the lowest absolute frequency in the training corpus.
C. It removes tokens that are not valid morphological roots according to a predefined dictionary.
D. It removes the longest tokens first to enforce a bias towards shorter subwords and characters.

59 GPT-3 popularized the concept of 'in-context learning' (few-shot prompting) where the model performs tasks without gradient updates. From a representational perspective, how does the model 'learn' a task dynamically during inference?

GPT Hard
A. By executing an internal discrete search algorithm over its vocabulary space prompted by a specialized meta-token.
B. By relying on the self-attention mechanism to route the representations of the provided examples to modulate the hidden states of the target query.
C. By temporarily caching gradients and updating only the Layer Normalization parameters using a lightweight background thread.
D. By shifting the positional encodings of the input prompt so that the context behaves as updated weight matrices.

60 In the context of the Effective Receptive Field (ERF) of deep neural networks, how does the theoretical receptive field of a token at the very first layer of a Transformer compare to a standard CNN?

pretrained transformer models Hard
A. A Transformer's receptive field scales logarithmically with depth, while a CNN's scales linearly.
B. Transformers and CNNs both have identical local receptive fields initially, but Transformers expand theirs via positional encodings.
C. A Transformer has a global receptive field encompassing the entire sequence, whereas a CNN has a strictly local receptive field constrained by kernel size.
D. A Transformer has a local receptive field of size in the first layer, just like a CNN.