1 $Which foundational 2017 paper introduced the Transformer architecture?$

transformer architecture Easy

A.

Attention Is All You Need

B.

BERT: Pre-training of Deep Bidirectional Transformers

C.

Deep Residual Learning for Image Recognition

D.

Language Models are Few-Shot Learners

2 $What is the primary function of the self-attention mechanism in a Transformer?$

self-attention Easy

A.

To translate words from a source language to a target language word-by-word

B.

To mask out unknown words in the input text

C.

To compress the input sequence into a fixed-length vector using recurrence

D.

To relate different positions of a single sequence to compute a representation of the sequence

3 $Why does the Transformer use multi-head attention instead of a single attention function?$

multi-head attention Easy

A.

It prevents the model from overfitting on small datasets

B.

It allows the model to jointly attend to information from different representation subspaces at different positions

C.

It replaces the need for tokenization in NLP tasks

D.

It reduces the number of parameters in the model

4 $Why is positional encoding necessary in a Transformer model?$

positional encoding Easy

A.

Because it reduces the memory footprint of the attention mechanism

B.

Because it is required to calculate the softmax function properly

C.

Because Transformers process sequences in parallel and lack inherent notions of sequence order

D.

Because it normalizes the input embeddings to have a mean of zero

5 $Which component is present in a standard Transformer decoder block but NOT in an encoder block?$

transformer encoder and decoder blocks Easy

A.

Masked Multi-Head Self-Attention

B.

Layer Normalization

C.

Multi-Head Self-Attention

D.

Feed-Forward Neural Network

6 $What is the primary advantage of subword tokenization methods over traditional word-level tokenization?$

tokenization methods Easy

A.

They are strictly rule-based and do not require training on a corpus

B.

They help handle Out-Of-Vocabulary (OOV) words by breaking them into known subwords

C.

They completely eliminate the need for an embedding layer

D.

They ensure that every word is represented by a single, unique integer ID

7 $How does the Byte-Pair Encoding (BPE) algorithm initially build its vocabulary?$

Byte-Pair Encoding and WordPiece Easy

A.

It hashes every word in the training corpus into a fixed-size vector space

B.

It starts with whole sentences and breaks them down using linguistic rules

C.

It uses a pre-defined dictionary from WordNet to select valid tokens

D.

It starts with a vocabulary of individual characters and iteratively merges the most frequent adjacent pairs

8 $What does 'pretraining' mean in the context of large language models?$

pretrained transformer models Easy

A.

Training a model on a large corpus of unlabeled text using a self-supervised objective before adapting it to specific tasks

B.

Testing a model on a validation set to tune its hyperparameters

C.

Training a model explicitly on human feedback and supervised labels from the very beginning

D.

Hardcoding linguistic rules into the model before deployment

9 $What does the acronym BERT stand for?$

BERT Easy

A.

Binary Encoded Recurrent Transformers

B.

Bidirectional Encoder Representations from Transformers

C.

Bidirectional Evaluation of Recurrent Transformers

D.

Basic Entity Recognition from Text

10 $The GPT (Generative Pre-trained Transformer) architecture is primarily based on which part of the original Transformer?$

GPT Easy

A.

The Cross-Attention Mechanism

B.

The Encoder

C.

The Decoder

D.

Only the Positional Encodings

11 $What is the defining characteristic of the T5 (Text-to-Text Transfer Transformer) framework?$

T5 Easy

A.

It processes audio and text simultaneously in a single model

B.

It casts every NLP task as a text-to-text problem, where both input and output are strings

C.

It uses reinforcement learning instead of cross-entropy loss

D.

It uses only an encoder architecture and cannot generate text

12 $In Masked Language Modeling (MLM), what is the model trained to do?$

masked language modeling Easy

A.

Predict the original identity of tokens that have been randomly replaced with a [MASK] token

B.

Translate a masked sentence into another language

C.

Predict the very next sentence in a document

D.

Generate an entire paragraph from a single prompt word

13 $What is the objective of the Next Sentence Prediction (NSP) task used during BERT's pretraining?$

next sentence prediction Easy

A.

To generate the subsequent sentence given a starting phrase

B.

To classify the grammatical correctness of the next sentence

C.

To predict whether Sentence B logically and sequentially follows Sentence A in the original document

D.

To predict the exact word count of the following sentence

14 $Which of the following best describes Causal Language Modeling (CLM)?$

causal language modeling Easy

A.

Predicting a masked word using both left and right context

B.

Classifying whether a text expresses a positive or negative sentiment

C.

Finding the root cause of grammatical errors in a sentence

D.

Predicting the next token in a sequence given only the previous tokens (left-to-right)

15 $What is a major benefit of using transfer learning in NLP?$

transfer learning for NLP tasks Easy

A.

It eliminates the need for computational resources like GPUs

B.

It guarantees that the model will never produce biased or toxic outputs

C.

It allows a model pretrained on a massive corpus to achieve high performance on downstream tasks with relatively little labeled data

D.

It completely removes the need to tokenize text

16 $When fine-tuning BERT for a sequence classification task (like sentiment analysis), which special token's final hidden state is typically passed to the classification layer?$

fine-tuning for text classification Easy

A.

[SEP]

B.

[CLS]

C.

[PAD]

D.

[MASK]

17 $Named Entity Recognition (NER) using Transformers is typically modeled as what type of task?$

named entity recognition Easy

A.

Next sentence prediction

B.

Sequence-to-sequence generation

C.

Token classification

D.

Document clustering

18 $In an extractive Question Answering task using a model like BERT, what does the model output?$

question answering Easy

A.

A simple 'Yes' or 'No' based on the question

B.

A newly generated sentence summarizing the answer

C.

The Wikipedia URL containing the answer

D.

The start and end token indices of the answer span within the provided context

19 $What is the primary purpose of the HuggingFace Transformers library?$

HuggingFace Transformers Easy

A.

To provide open-source APIs, tools, and pretrained weights for state-of-the-art NLP models

B.

To serve as a cloud storage database for raw text datasets

C.

To provide a proprietary operating system for machine learning servers

D.

To replace Python's standard string manipulation functions

20 $In the context of the Transformer architecture, what is passed through the Feed-Forward Neural Network in an encoder block?$

transformer architecture Easy

A.

The final predictions of the decoder

B.

The output of the multi-head self-attention layer (after add and norm)

C.

Only the positional encodings

D.

The raw, un-tokenized input string

21 $In the scaled dot-product attention mechanism, why are the dot products scaled by the inverse square root of the dimension of the key vectors ()?$

self-attention Medium

A.

To reduce the computational complexity of the dot product

B.

To prevent the softmax gradients from vanishing

C.

To normalize the probabilities to sum to 1

D.

To match the dimensions of the queries and keys

22 $What is the primary advantage of using Multi-Head Attention over a single attention function in a Transformer?$

multi-head attention Medium

A.

It reduces the total number of parameters in the attention mechanism

B.

It removes the need for positional encodings

C.

It decreases the time complexity of the self-attention operation

D.

It allows the model to jointly attend to information from different representation subspaces at different positions

23 $How are sine and cosine functions utilized for positional encoding in the original Transformer architecture?$

positional encoding Medium

A.

Different frequencies of sine and cosine are used for different dimensions of the positional encoding vector

B.

Sine functions encode odd sequence positions and cosine functions encode even sequence positions

C.

They modulate the attention weights directly before applying the softmax function

D.

They are learned continuously during the training process via backpropagation

24 $In a standard Transformer decoder block, what is the purpose of the masked multi-head attention layer?$

transformer encoder and decoder blocks Medium

A.

To mask out padding tokens from the input sequence

B.

To prevent the decoder from attending to future tokens in the target sequence during training

C.

To compute the alignment between the encoder output and the target sequence

D.

To regularize the model by randomly dropping attention weights like dropout

25 $When training a Byte-Pair Encoding (BPE) tokenizer, how are new subwords formed?$

Byte-Pair Encoding and WordPiece Medium

A.

By replacing rare words with an <UNK> token directly

B.

By combining characters that maximize the likelihood of the training data language model

C.

By splitting words based on linguistic morphological rules

D.

By iteratively merging the most frequently occurring pair of adjacent symbols

26 $Which components make up the input representation for a given token in the BERT model?$

BERT Medium

A.

Byte-Pair Embeddings and Absolute Position Embeddings

B.

Word Embeddings, Character Embeddings, and Mask Embeddings

C.

Token Embeddings, Syntactic Embeddings, and Position Embeddings

D.

Token Embeddings, Segment Embeddings, and Position Embeddings

27 $During the pretraining phase of BERT using Masked Language Modeling (MLM), how are the selected tokens replaced?$

masked language modeling Medium

A.

80% kept unchanged, 10% with [MASK], 10% with a random token

B.

100% replaced with the [MASK] token

C.

80% with [MASK], 10% with a random token, 10% kept unchanged

D.

50% with [MASK] and 50% with a random token

28 $Which of the following best describes the architecture of the Generative Pre-trained Transformer (GPT) models?$

GPT Medium

A.

Encoder-decoder architecture with cross-attention

B.

Masked language model architecture

C.

Bidirectional encoder-only architecture

D.

Autoregressive decoder-only architecture

29 $In causal language modeling, what objective function is the model primarily optimizing?$

causal language modeling Medium

A.

Minimizing the reconstruction error of a corrupted input sequence

B.

Maximizing the likelihood of predicting the next token given all previous tokens in the sequence

C.

Classifying whether the second half of a sequence logically follows the first half

D.

Predicting randomly masked words within a bidirectional context

30 $How does the Text-to-Text Transfer Transformer (T5) format its input and output for different NLP tasks?$

T5 Medium

A.

Every task is cast as feeding a text sequence as input and generating a new text sequence as output

B.

Tasks are differentiated by adding task-specific classification heads on top of the encoder

C.

It requires replacing the decoder block for every specific task like translation or summarization

D.

It uses unique segment embeddings for each NLP task type

31 $When fine-tuning a BERT model for text classification, which vector is typically passed to the final classification layer?$

fine-tuning for text classification Medium

A.

The final hidden state corresponding to the [CLS] token

B.

The final hidden state of the [SEP] token

C.

The concatenation of the first and last token's hidden states

D.

The average pooling of all tokens' final hidden states

32 $In the HuggingFace Transformers library, what is the role of the AutoModelForSequenceClassification class?$

HuggingFace Transformers Medium

A.

It automatically performs hyperparameter tuning for classification metrics

B.

It searches the model hub for the best performing model for text classification

C.

It automatically instantiates a base model with a sequence classification head on top

D.

It dynamically changes the vocabulary size based on the classification task

33 $In BERT's Next Sentence Prediction (NSP) task, what defines a positive pair during pretraining?$

next sentence prediction Medium

A.

Two sentences extracted from different documents but sharing the same topic

B.

Two sentences that appear consecutively in the original corpus

C.

Two sentences that are semantically similar according to cosine similarity

D.

Two sentences where one is a direct paraphrase of the other

34 $How is fine-tuning adapted for a Token Classification task like Named Entity Recognition (NER) using a transformer?$

named entity recognition Medium

A.

The output embeddings are clustered to find entity boundaries

B.

The model is trained to generate the entity labels sequentially as text

C.

A single classification head is applied only to the [CLS] token

D.

A classification head is applied to the final hidden state of every individual token in the sequence

35 $When applying a transformer model to Extractive Question Answering (like SQuAD), how is the answer extracted from the context?$

question answering Medium

A.

By classifying the context into predefined answer categories

B.

By ranking multiple choice answers based on cosine similarity with the question

C.

By autoregressively generating the text of the answer word-by-word

D.

By predicting two probabilities for each token in the context: one for being the start of the answer span and one for being the end

36 $Which component exists in a Transformer Decoder block but is ABSENT from a Transformer Encoder block?$

transformer encoder and decoder blocks Medium

A.

Layer Normalization

B.

Self-Attention mechanism

C.

Encoder-Decoder Cross-Attention mechanism

D.

Feed-Forward Neural Network

37 $What is the main benefit of using transfer learning in modern NLP?$

transfer learning for NLP tasks Medium

A.

It allows language models to translate between languages without any task-specific data

B.

It ensures the model cannot overfit on the downstream task regardless of the training time

C.

Pretraining on massive unlabelled corpora allows models to learn rich language representations, reducing the need for large labeled datasets during fine-tuning

D.

It completely eliminates the need for computational resources during the fine-tuning phase

38 $How does WordPiece tokenization differ primarily from Byte-Pair Encoding (BPE)?$

Byte-Pair Encoding and WordPiece Medium

A.

WordPiece chooses symbol pairs that maximize the likelihood of the training data given the language model, rather than just frequency

B.

WordPiece relies strictly on linguistic rules such as stemming and lemmatization

C.

WordPiece operates on complete words rather than subwords or characters

D.

WordPiece cannot handle out-of-vocabulary words and assigns them to an <UNK> token

39 $In the standard Transformer architecture, what specific type of neural network follows the attention mechanisms within both encoder and decoder layers?$

transformer architecture Medium

A.

A Position-wise Feed-Forward Network applied identically to each position separately

B.

A Max-Pooling layer to reduce the dimensionality of the sequence

C.

A Recurrent Neural Network (RNN) to process the sequential output

D.

A Convolutional Neural Network (CNN) to capture local n-gram features

40 $Given an input sequence of length and hidden dimension, what is the computational time complexity of the self-attention operation?$

self-attention Medium

A.

B.

C.

D.

41 $In the standard scaled dot-product attention, the dot products of queries and keys are divided by . What is the primary theoretical justification for this specific scaling factor?$

self-attention Hard

A.

It ensures that the computational complexity of the attention mechanism reduces from to .

B.

It prevents the variance of the dot product from scaling linearly with, which would otherwise push the softmax function into regions with extremely small gradients.

C.

It normalizes the self-attention weights so that they sum to exactly 1 across the sequence length.

D.

It guarantees that the attention scores are bounded within the range before passing through the softmax function.

42 $Consider a Multi-Head Attention layer with heads and a model dimension of . If, how many trainable weights (excluding biases) are present in the query, key, value, and output projection matrices combined for this single layer?$

multi-head attention Hard

A.

B.

C.

D.

43 $Which of the following mathematical properties of the sinusoidal positional encodings proposed by Vaswani et al. is crucial for allowing the model to easily learn to attend by relative positions?$

positional encoding Hard

A.

The dot product between and is strictly zero for any .

B.

The norm of the positional encoding vector decays exponentially as the sequence position increases.

C.

The positional encoding vectors are orthogonal to all token embeddings.

D.

The positional encoding for any offset,, can be represented as a linear function of .

44 $In standard Transformer architectures, Layer Normalization can be applied either before the sub-layers (Pre-LN) or after the sub-layers (Post-LN). Which of the following best characterizes the impact of choosing Pre-LN over Post-LN?$

transformer encoder and decoder blocks Hard

A.

Pre-LN restricts the gradients from propagating deeply into the network, acting similarly to a bottleneck layer.

B.

Pre-LN eliminates the need for positional encodings, whereas Post-LN requires them for convergence.

C.

Pre-LN generally results in more stable training and removes the strict necessity for learning rate warmup, though it may slightly degrade final performance compared to Post-LN.

D.

Pre-LN provides better gradient flow near the output layer, but requires a strict learning rate warmup to avoid early divergence.

45 $While both Byte-Pair Encoding (BPE) and WordPiece are subword tokenization algorithms, their merging criteria differ. BPE merges pairs based on frequency. What criterion does WordPiece primarily use to determine which subword pair to merge?$

Byte-Pair Encoding and WordPiece Hard

A.

It merges the pair that maximizes the likelihood of the training data when trained as a unigram language model.

B.

It merges pairs based on morphological suffixes and prefixes derived from a predefined language-specific rule set.

C.

It merges the pair that yields the largest mutual information, calculated as the frequency of the pair divided by the product of individual frequencies.

D.

It randomly selects a pair to merge based on a temperature parameter to inject noise into the tokenization process.

46 $In BERT's Masked Language Modeling (MLM), 15% of the tokens are chosen for prediction. Of these chosen tokens, 80% are replaced with [MASK], 10% with a random word, and 10% kept unchanged. What is the primary reason for keeping 10% of the selected tokens unchanged?$

masked language modeling Hard

A.

To force the model to rely solely on the target token's embedding rather than the surrounding context.

B.

To mitigate the discrepancy between pre-training and fine-tuning, teaching the model that the observed token might actually be the correct token and it should maintain its contextual representation.

C.

To prevent the model from assigning zero probability to the actual input token, which acts as a regularization mechanism similar to label smoothing.

D.

To bias the model towards retaining its original word embeddings in the lower layers.

47 $Which of the following pretrained models completely removed the Next Sentence Prediction (NSP) objective and demonstrated empirically that removing it actually improved performance on downstream tasks?$

next sentence prediction Hard

A.

DeBERTa

B.

RoBERTa

C.

BART

D.

ALBERT

48 $During autoregressive generation in causal language models, a 'KV cache' is often utilized. If a model generates a sequence of length, how does the memory complexity of the KV cache scale with respect to ?$

causal language modeling Hard

A.

B.

C.

D.

49 $T5 (Text-to-Text Transfer Transformer) uses a unique unsupervised pre-training objective known as 'span corruption'. How does T5 handle the targets for masked spans during pre-training compared to BERT?$

T5 Hard

A.

T5 replaces contiguous spans with a single unique sentinel token and the decoder must generate only the corrupted spans delimited by those sentinel tokens.

B.

T5 uses a discriminator network to predict whether a span was corrupted, similar to ELECTRA, rather than generating the tokens.

C.

T5 replaces spans with multiple [MASK] tokens and decodes them sequentially, whereas BERT predicts them independently.

D.

T5 reconstructs the entire original text in the decoder, penalizing outputs that deviate from the uncorrupted input sequence.

50 $When fine-tuning BERT for text classification, it is standard practice to pass the final hidden state of the [CLS] token to a classification head. If one extracts the [CLS] embedding from a pre-trained BERT without any fine-tuning, why is it generally a poor sentence representation for tasks like semantic textual similarity?$

fine-tuning for text classification Hard

A.

The [CLS] representation is heavily biased towards the Next Sentence Prediction (NSP) objective, acting essentially as a binary feature rather than capturing a continuous semantic space.

B.

The [CLS] token embedding is deterministically initialized to zeros and only receives gradients during the fine-tuning phase.

C.

During pre-training, the [CLS] token is specifically optimized only as an indicator for the Masked Language Modeling task, ignoring sentence-level semantics.

D.

The attention mechanism is masked such that the [CLS] token cannot attend to the rest of the sequence unless fine-tuning unmasks it.

51 $When fine-tuning a transformer model for Named Entity Recognition (NER), subword tokenization often splits a single word into multiple tokens (e.g., 'Washington' -> 'Wash', '##ington'). What is the standard practice in HuggingFace for assigning labels to these subwords to compute the loss correctly?$

named entity recognition Hard

A.

Assign the exact same target label to all subwords of the word to reinforce the gradient.

B.

Sum the logits of all subwords corresponding to a word before applying the softmax and calculating the loss.

C.

Assign B-LOC to the first subword and I-LOC to all subsequent subwords of the same word.

D.

Assign the target label (e.g., B-LOC) to the first subword and assign a special ignore index (e.g., -100) to the remaining subwords.

52 $In Extractive Question Answering tasks with unanswerable questions (like SQuAD 2.0), how does a standard BERT-based architecture signal that a question cannot be answered from the given context?$

question answering Hard

A.

By outputting start and end logits where the start index is strictly greater than the end index.

B.

By pointing both the predicted start and end spans to the [CLS] token at index 0.

C.

By generating an empty string through a specialized decoder head attached to the final layer.

D.

By setting all token probabilities in the context to precisely zero using a specialized thresholding function.

53 $Consider a Transformer block with sequence length and hidden dimension . The attention mechanism computes . If is significantly larger than (e.g.,), which operation becomes the primary computational bottleneck?$

transformer architecture Hard

A.

The Softmax activation function, scaling as .

B.

The Feed-Forward Network (FFN) sub-layer, scaling as .

C.

The computation of the attention scores, scaling as .

D.

The linear projections to form Q, K, and V, scaling as .

54 $BERT relies on Token Embeddings, Segment (Token Type) Embeddings, and Positional Embeddings. If you are fine-tuning BERT for a single-sequence classification task (e.g., Sentiment Analysis), how are the Segment Embeddings typically handled?$

BERT Hard

A.

A single segment ID (usually 0) is passed for all tokens in the sequence, mapping to a learned vector that is added to the token embeddings.

B.

They are dynamically generated based on the length of the sequence using a sinusoidal function.

C.

A constant vector of ones is added to all token embeddings to indicate a single segment.

D.

They are omitted entirely, and the model architecture is adjusted to bypass the segment embedding addition.

55 $When adapting a large pre-trained language model to a specific task using Low-Rank Adaptation (LoRA), which of the following is true regarding the model's parameters?$

transfer learning for NLP tasks Hard

A.

LoRA completely replaces the self-attention weights with smaller matrices, significantly reducing the memory required for the forward pass during inference.

B.

LoRA freezes the original pre-trained weights and injects trainable rank decomposition matrices, maintaining the same inference latency as the base model when merged.

C.

LoRA introduces bottleneck layers between every transformer block, increasing the depth of the model and slightly increasing inference latency.

D.

LoRA fine-tunes only the final layer of the network while applying a sparse mask to the gradients of the earlier layers.

56 $In the HuggingFace generate() method, when both top_k and top_p (nucleus sampling) parameters are provided (e.g., top_k=50, top_p=0.95), in what order are these filtering operations mathematically applied to the logits?$

HuggingFace Transformers Hard

A.

top_p filtering is applied first, followed by top_k filtering on the remaining probabilities.

B.

They are applied independently to two copies of the logits, and the intersection of the valid tokens is used.

C.

HuggingFace raises an exception because top_k and top_p are mutually exclusive generation constraints.

D.

top_k filtering is applied first, completely truncating the vocabulary to 50 tokens, and then top_p filtering is applied within that restricted set.

57 $In the cross-attention sub-layer of a standard Transformer Decoder, how are the Query (Q), Key (K), and Value (V) matrices derived?$

transformer decoder blocks Hard

A.

Q is derived from the previous decoder layer, while K and V are derived from the final hidden states of the Encoder.

B.

Q and K are derived from the final hidden states of the Encoder, while V is derived from the previous decoder layer.

C.

Q, K, and V are derived from the Encoder, but a look-ahead mask is applied to prevent the decoder from seeing future tokens.

D.

Q, K, and V are all derived from the output of the previous decoder layer.

58 $Unlike BPE which starts with characters and builds up, the Unigram Language Model tokenizer starts with a large vocabulary and prunes it down. What metric does the Unigram algorithm use to decide which tokens to remove at each iteration?$

tokenization methods Hard

A.

It removes the longest tokens first to enforce a bias towards shorter subwords and characters.

B.

It removes tokens that are not valid morphological roots according to a predefined dictionary.

C.

It removes the tokens that have the lowest absolute frequency in the training corpus.

D.

It removes the tokens whose removal results in the smallest increase in the overall negative log-likelihood of the training data.

59 $GPT-3 popularized the concept of 'in-context learning' (few-shot prompting) where the model performs tasks without gradient updates. From a representational perspective, how does the model 'learn' a task dynamically during inference?$

GPT Hard

A.

By relying on the self-attention mechanism to route the representations of the provided examples to modulate the hidden states of the target query.

B.

By shifting the positional encodings of the input prompt so that the context behaves as updated weight matrices.

C.

By executing an internal discrete search algorithm over its vocabulary space prompted by a specialized meta-token.

D.

By temporarily caching gradients and updating only the Layer Normalization parameters using a lightweight background thread.

60 $In the context of the Effective Receptive Field (ERF) of deep neural networks, how does the theoretical receptive field of a token at the very first layer of a Transformer compare to a standard CNN?$

pretrained transformer models Hard

A.

A Transformer has a local receptive field of size in the first layer, just like a CNN.

B.

Transformers and CNNs both have identical local receptive fields initially, but Transformers expand theirs via positional encodings.

C.

A Transformer's receptive field scales logarithmically with depth, while a CNN's scales linearly.

D.

A Transformer has a global receptive field encompassing the entire sequence, whereas a CNN has a strictly local receptive field constrained by kernel size.

Unit 5 - Practice Quiz