1What are the two primary components of an encoder-decoder model in NLP?
encoder–decoder architectures for NLP
Easy
A.A Transformer and an Optimizer
B.A Convolutional layer and a Pooling layer
C.An Encoder and a Decoder
D.A Generator and a Discriminator
Correct Answer: An Encoder and a Decoder
Explanation:
An encoder-decoder model consists of an encoder that processes the input sequence and a decoder that generates the output sequence.
Incorrect! Try again.
2In a basic encoder-decoder architecture, what does the encoder produce to pass information to the decoder?
encoder–decoder architectures for NLP
Easy
A.A fixed-size context vector
B.A one-hot encoded matrix
C.A sparse dependency tree
D.A continuous stream of tokens
Correct Answer: A fixed-size context vector
Explanation:
The encoder compresses the input sequence into a fixed-size context vector (or hidden state), which summarizes the input for the decoder.
Incorrect! Try again.
3Which of the following neural network types was classically used to build encoders and decoders for sequential text data?
encoder–decoder architectures for NLP
Easy
A.Radial Basis Function Networks (RBFNs)
B.Generative Adversarial Networks (GANs)
C.Recurrent Neural Networks (RNNs)
D.Convolutional Neural Networks (CNNs)
Correct Answer: Recurrent Neural Networks (RNNs)
Explanation:
Recurrent Neural Networks, including LSTMs and GRUs, were the standard choice for processing sequential data in classical encoder-decoder architectures.
Incorrect! Try again.
4What is the primary function of a sequence-to-sequence (seq2seq) model in machine translation?
sequence-to-sequence models for machine translation and summarization
Easy
A.To classify the language of the input sentence
B.To predict the sentiment of the input sentence
C.To cluster similar words together
D.To map a sequence of words in one language to a sequence in another language
Correct Answer: To map a sequence of words in one language to a sequence in another language
Explanation:
In machine translation, seq2seq models read an input sequence in the source language and generate a corresponding sequence in the target language.
Incorrect! Try again.
5When a sequence-to-sequence model is used for text summarization, how does the output sequence typically compare to the input sequence?
sequence-to-sequence models for machine translation and summarization
Easy
A.It is a translated version of the input sequence
B.It is a much longer and more detailed sequence
C.It is a shorter sequence that captures the main points of the input
D.It is an exact copy of the input sequence
Correct Answer: It is a shorter sequence that captures the main points of the input
Explanation:
Text summarization involves condensing a long text into a shorter summary while retaining the most important information.
Incorrect! Try again.
6What primary problem with basic seq2seq models does the attention mechanism solve?
attention in deep NLP
Easy
A.The information bottleneck of using a single fixed-size context vector
B.The vanishing gradient problem in CNNs
C.The lack of word embeddings
D.The inability to process numeric data
Correct Answer: The information bottleneck of using a single fixed-size context vector
Explanation:
Attention allows the model to look at all hidden states of the encoder, avoiding the bottleneck of compressing a long sequence into a single fixed-size vector.
Incorrect! Try again.
7What does the attention mechanism allow the decoder to do during generation?
attention in deep NLP
Easy
A.Translate words without any training data
B.Generate multiple tokens at the exact same time
C.Ignore the encoder completely
D.Focus on different, relevant parts of the input sequence for each output step
Correct Answer: Focus on different, relevant parts of the input sequence for each output step
Explanation:
Attention computes a weight for each input token, allowing the decoder to 'attend' to the most relevant input words when generating each output word.
Incorrect! Try again.
8Which of the following best describes 'soft attention'?
soft attention
Easy
A.It applies a hard threshold to remove low-frequency words.
B.It calculates a weighted average of all input hidden states.
C.It selects only one single input word to focus on deterministically.
D.It randomly drops attention weights to prevent overfitting.
Correct Answer: It calculates a weighted average of all input hidden states.
Explanation:
Soft attention assigns a probability (weight) to every input token and computes a weighted sum, making the entire process smooth and differentiable.
Incorrect! Try again.
9In soft attention, what must the attention weights for a given decoding step sum up to?
soft attention
Easy
A.The size of the hidden layer
B.The length of the input sequence
C.$0$
D.$1$
Correct Answer: $1$
Explanation:
Because the attention weights represent a probability distribution over the input sequence, they are normalized (usually via softmax) to sum to exactly $1$.
Incorrect! Try again.
10Bahdanau attention is also commonly referred to by which of the following names?
Bahdanau and Luong attention
Easy
A.Multiplicative attention
B.Additive attention
C.Dot-product attention
D.Self-attention
Correct Answer: Additive attention
Explanation:
Bahdanau attention calculates the alignment score using a feed-forward neural network with a activation, which is why it is known as additive attention.
Incorrect! Try again.
11How does Luong attention (multiplicative attention) typically calculate the alignment score?
Bahdanau and Luong attention
Easy
A.By adding the encoder and decoder states together
B.By using a Convolutional Neural Network
C.By computing the Euclidean distance between words
D.By taking the dot product between the decoder hidden state and encoder hidden states
Correct Answer: By taking the dot product between the decoder hidden state and encoder hidden states
Explanation:
Luong attention calculates alignment scores by using multiplicative operations, such as a simple dot product between the decoder and encoder hidden states.
Incorrect! Try again.
12Which attention mechanism fundamentally introduced the concept of aligning and translating jointly using a feed-forward network to score alignments?
Bahdanau and Luong attention
Easy
A.Scaled Dot-Product attention
B.Bahdanau attention
C.Luong attention
D.Multi-Head attention
Correct Answer: Bahdanau attention
Explanation:
Dzmitry Bahdanau et al. introduced this alignment mechanism in their seminal paper, framing translation as joint aligning and translating using an additive scoring function.
Incorrect! Try again.
13When integrating attention into an encoder-decoder network, what function is typically applied to the alignment scores to obtain the final attention weights?
integrating attention into encoder–decoder networks
Easy
A.ReLU
B.Tanh
C.Sigmoid
D.Softmax
Correct Answer: Softmax
Explanation:
The softmax function is applied to the raw alignment scores to convert them into a valid probability distribution, ensuring the weights sum to $1$.
Incorrect! Try again.
14In an attention-equipped seq2seq model, what is the 'context vector' composed of?
integrating attention into encoder–decoder networks
Easy
A.A randomly initialized vector
B.The sum of all word embeddings in the target language
C.A weighted sum of the encoder's hidden states
D.The very last hidden state of the decoder
Correct Answer: A weighted sum of the encoder's hidden states
Explanation:
The context vector is dynamically generated for each decoding step by taking a weighted sum of the encoder's hidden states, using the attention weights.
Incorrect! Try again.
15Why is evaluating sequence-to-sequence generation tasks (like translation) fundamentally harder than evaluating classification tasks?
evaluation techniques
Easy
A.There is usually only one correct answer in text generation.
B.Accuracy metrics can only be applied to numerical data.
C.Classification models don't use loss functions.
D.Text generation can have multiple valid and correct outputs for the same input.
Correct Answer: Text generation can have multiple valid and correct outputs for the same input.
Explanation:
In text generation, a single sentence can be correctly translated or summarized in many different ways, making simple accuracy metrics ineffective.
Incorrect! Try again.
16Which evaluation metric is most heavily used for Machine Translation and relies on modified n-gram precision?
BLEU and ROUGE scores
Easy
A.BLEU
B.Perplexity
C.ROUGE
D.F1-Score
Correct Answer: BLEU
Explanation:
BLEU (Bilingual Evaluation Understudy) measures how many n-grams in the candidate translation match the reference translations, focusing primarily on precision.
Incorrect! Try again.
17Which metric is commonly used for evaluating Text Summarization and relies heavily on n-gram recall?
BLEU and ROUGE scores
Easy
A.Accuracy
B.Word Error Rate (WER)
C.BLEU
D.ROUGE
Correct Answer: ROUGE
Explanation:
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) focuses on recall, measuring how much of the human reference summary is captured by the model's summary.
Incorrect! Try again.
18What is the purpose of the 'Brevity Penalty' (BP) in the BLEU score calculation?
BLEU and ROUGE scores
Easy
A.To penalize candidate translations that are too short compared to the reference.
B.To penalize models that take too long to translate.
C.To penalize the use of rare words.
D.To penalize candidate translations that are excessively long.
Correct Answer: To penalize candidate translations that are too short compared to the reference.
Explanation:
Since BLEU is precision-based, a model could achieve a perfect score by outputting a single correct word. The Brevity Penalty prevents this by lowering the score for translations shorter than the reference.
Incorrect! Try again.
19What is a major limitation of standard seq2seq models that do NOT use attention?
limitations of classical seq2seq models
Easy
A.They cannot handle any sequence data.
B.They require paired training data.
C.Performance drops significantly when processing long input sequences.
D.They can only generate one word at a time.
Correct Answer: Performance drops significantly when processing long input sequences.
Explanation:
Without attention, standard seq2seq models must compress the entire input into a single fixed-size vector, causing them to 'forget' early parts of long sequences.
Incorrect! Try again.
20Why do classical RNN-based seq2seq models suffer from slow training speeds compared to modern architectures like Transformers?
limitations of classical seq2seq models
Easy
A.They must process tokens sequentially, which prevents parallelization.
B.They require massive amounts of memory for attention weights.
C.They can only be trained on CPUs.
D.They use too many convolutional filters.
Correct Answer: They must process tokens sequentially, which prevents parallelization.
Explanation:
RNNs inherently process data step-by-step. To compute the hidden state at step , the state at step must be known, making it impossible to parallelize training across the sequence.
Incorrect! Try again.
21In a basic RNN-based encoder-decoder architecture without attention, how does the encoder transfer information to the decoder?
encoder–decoder architectures for NLP
Medium
A.By sharing the same weight matrices for both encoding and decoding.
B.By passing its final hidden state as the initial context vector to the decoder.
C.By using a continuous feedback loop between the encoder and decoder.
D.By passing all its hidden states simultaneously to the decoder.
Correct Answer: By passing its final hidden state as the initial context vector to the decoder.
Explanation:
The standard basic encoder-decoder architecture compresses the entire input sequence into a single context vector, which is the final hidden state of the encoder RNN, and passes it to initialize the decoder.
Incorrect! Try again.
22When generating an output sequence in an encoder-decoder model during inference, what is typically used as the input to the decoder at time step ?
encoder–decoder architectures for NLP
Medium
A.The entire original input sequence.
B.The actual ground-truth token from time step .
C.The predicted token from time step .
D.A fixed context vector re-encoded at every step.
Correct Answer: The predicted token from time step .
Explanation:
During inference, ground-truth labels are not available, so the model feeds its own prediction from the previous time step back into the network as the current input.
Incorrect! Try again.
23Why is Teacher Forcing commonly used during the training of sequence-to-sequence models for machine translation?
sequence-to-sequence models for machine translation and summarization
Medium
A.It allows the model to learn without requiring any target output data.
B.It stabilizes and speeds up training by feeding the true previous target token as input to the decoder.
C.It prevents the model from relying on the encoder context.
D.It automatically corrects the weights of the encoder using external dictionaries.
Correct Answer: It stabilizes and speeds up training by feeding the true previous target token as input to the decoder.
Explanation:
Teacher forcing provides the actual ground-truth token from the previous step as input to the current step. This prevents early mistakes in the sequence from cascading and ruining the context, thus improving training convergence.
Incorrect! Try again.
24In the context of seq2seq models for text summarization, what issue often arises when using a standard model without a pointer-generator network?
sequence-to-sequence models for machine translation and summarization
Medium
A.The model fails to process sequences longer than 10 tokens.
B.The model struggles to accurately reproduce out-of-vocabulary (OOV) words like proper nouns.
C.The model perfectly memorizes the source text but cannot generate new words.
D.The model generates summaries that are longer than the original text.
Correct Answer: The model struggles to accurately reproduce out-of-vocabulary (OOV) words like proper nouns.
Explanation:
Standard seq2seq models generate words from a fixed vocabulary. Pointer-generator networks are often added to allow the model to copy specific, out-of-vocabulary words directly from the source text.
Incorrect! Try again.
25Which search strategy is generally preferred during inference in seq2seq models to balance computational efficiency and translation quality?
sequence-to-sequence models for machine translation and summarization
Medium
A.Greedy Search
B.Beam Search
C.Exhaustive Search
D.Random Sampling
Correct Answer: Beam Search
Explanation:
Beam search keeps track of the top- most probable partial sequences, offering a good compromise between the computationally cheap but sub-optimal greedy search and the computationally intractable exhaustive search.
Incorrect! Try again.
26What primary problem does the introduction of attention mechanisms solve in deep NLP?
attention in deep NLP
Medium
A.The need for Teacher Forcing during the training phase.
B.The information bottleneck caused by compressing long input sequences into a fixed-length context vector.
C.The inability of RNNs to process discrete token inputs.
D.The vanishing gradient problem in the decoder network.
Correct Answer: The information bottleneck caused by compressing long input sequences into a fixed-length context vector.
Explanation:
Attention allows the decoder to look at all encoder hidden states dynamically, removing the need to compress all source information into a single, fixed-size vector.
Incorrect! Try again.
27In the context of attention, the alignment score is calculated between which two components?
attention in deep NLP
Medium
A.The current decoder hidden state and the encoder hidden states.
B.The current decoder hidden state and all other decoder hidden states.
C.The input word embeddings and the output word embeddings.
D.The current encoder hidden state and the previous encoder hidden state.
Correct Answer: The current decoder hidden state and the encoder hidden states.
Explanation:
The alignment score determines how much focus the current decoder step should place on each of the encoder's hidden states.
Incorrect! Try again.
28Which mathematical operation is applied to the raw alignment scores to produce soft attention weights?
soft attention
Medium
A.Softmax function
B.Argmax operation
C.Sigmoid function
D.ReLU activation
Correct Answer: Softmax function
Explanation:
The softmax function converts the raw alignment scores into a probability distribution (soft attention weights) that sums to 1.
Incorrect! Try again.
29How does soft attention differ from hard attention computationally?
soft attention
Medium
A.Soft attention cannot be used with sequence-to-sequence models.
B.Soft attention has higher variance in gradients than hard attention.
C.Soft attention selects exactly one input word, whereas hard attention averages them.
D.Soft attention is fully differentiable, while hard attention requires reinforcement learning techniques to train.
Correct Answer: Soft attention is fully differentiable, while hard attention requires reinforcement learning techniques to train.
Explanation:
Soft attention computes a weighted average over all states and is fully differentiable, allowing standard backpropagation. Hard attention makes discrete choices, requiring techniques like REINFORCE.
Incorrect! Try again.
30In Bahdanau (Additive) attention, how is the alignment score typically computed?
Bahdanau and Luong attention
Medium
A.By using a feed-forward neural network with a single hidden layer.
B.By taking the dot product of the encoder and decoder hidden states.
C.By passing the states through a multi-head self-attention block.
D.By computing the cosine similarity between encoder and decoder states.
Correct Answer: By using a feed-forward neural network with a single hidden layer.
Explanation:
Bahdanau attention concatenates the decoder hidden state and the encoder hidden state, passes them through a linear layer with a activation, and multiplies by a weight vector to get the score.
Incorrect! Try again.
31Which of the following best describes the scoring function used in Luong's general (multiplicative) attention?
Bahdanau and Luong attention
Medium
A.
B.
C.
D.
Correct Answer:
Explanation:
Luong's general attention uses a bilinear (multiplicative) scoring function where the decoder hidden state and encoder hidden state are multiplied with a learned weight matrix .
Incorrect! Try again.
32A key difference between Bahdanau and Luong attention mechanisms lies in when the context vector is used. How does Luong's global attention utilize the context vector?
Bahdanau and Luong attention
Medium
A.It concatenates the context vector with the decoder's current hidden state to compute the final attentional hidden state.
B.It uses the context vector to predict the previous hidden state.
C.It feeds the context vector exclusively to the encoder to update embeddings.
D.It uses the context vector only to compute the next encoder state.
Correct Answer: It concatenates the context vector with the decoder's current hidden state to compute the final attentional hidden state.
Explanation:
In Luong attention, the context vector is calculated using the current decoder hidden state, and then concatenated with it to produce an attentional hidden state for predicting the output. Bahdanau uses the previous hidden state to compute the context vector.
Incorrect! Try again.
33After computing the attention weights in an encoder-decoder network, how is the context vector generated?
integrating attention into encoder–decoder networks
Medium
A.By taking the dot product of the attention weights and the decoder hidden state.
B.By calculating the unweighted average of the encoder hidden states.
C.By computing a weighted sum of the encoder hidden states using the attention weights.
D.By applying a max-pooling operation over the encoder hidden states.
Correct Answer: By computing a weighted sum of the encoder hidden states using the attention weights.
Explanation:
The context vector is formed by multiplying each encoder hidden state by its corresponding attention weight and summing the results.
Incorrect! Try again.
34When integrating attention into an RNN-based seq2seq model, what is the impact on the model's computational complexity per decoding step with respect to the input sequence length ?
integrating attention into encoder–decoder networks
Medium
A.The complexity becomes .
B.The complexity becomes independent of .
C.The complexity becomes .
D.The complexity becomes .
Correct Answer: The complexity becomes .
Explanation:
At each decoding step, the attention mechanism must compute scores and a weighted sum over all encoder hidden states, resulting in complexity per output token.
Incorrect! Try again.
35Which of the following is a significant drawback of -gram based evaluation metrics like BLEU and ROUGE?
evaluation techniques
Medium
A.They cannot evaluate models that use attention mechanisms.
B.They penalize models for outputting sequences of different lengths than the reference.
C.They require computationally expensive neural network forward passes to evaluate.
D.They fail to account for semantic similarity and synonyms if exact word matches do not occur.
Correct Answer: They fail to account for semantic similarity and synonyms if exact word matches do not occur.
Explanation:
-gram metrics rely on exact lexical overlap. If a model generates a perfectly valid synonym that isn't in the reference text, it receives no credit for that word.
Incorrect! Try again.
36In the BLEU metric, what is the purpose of the Brevity Penalty (BP)?
BLEU and ROUGE scores
Medium
A.To penalize candidate translations that use too many low-frequency words.
B.To penalize candidate translations that are too long compared to the reference.
C.To penalize candidate translations that are shorter than the reference translation.
D.To penalize candidate translations that contain grammatically incorrect -grams.
Correct Answer: To penalize candidate translations that are shorter than the reference translation.
Explanation:
Because BLEU relies on precision, a very short output (e.g., just one highly confident word) could achieve an artificially high score. The Brevity Penalty reduces the score of overly short translations to prevent gaming the metric.
Incorrect! Try again.
37While BLEU focuses primarily on precision, ROUGE is typically designed to emphasize which metric for tasks like summarization?
BLEU and ROUGE scores
Medium
A.Accuracy
B.Recall
C.F1-Score
D.Specificity
Correct Answer: Recall
Explanation:
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) fundamentally focuses on recall, measuring how much of the -grams in the human reference summaries are captured by the model-generated summary.
Incorrect! Try again.
38What does ROUGE-L specifically measure when evaluating a generated text sequence?
BLEU and ROUGE scores
Medium
A.The Longest Common Subsequence (LCS) between the candidate and reference texts.
B.The semantic distance using word embeddings of length .
C.The average precision of all -grams up to length .
D.The overlap of unigrams and bigrams combined.
Correct Answer: The Longest Common Subsequence (LCS) between the candidate and reference texts.
Explanation:
ROUGE-L uses the Longest Common Subsequence (LCS), which accounts for sentence-level structure similarity naturally without requiring consecutive matches.
Incorrect! Try again.
39Which of the following is a primary limitation of classical seq2seq models without attention?
limitations of classical seq2seq models
Medium
A.They require hand-crafted features for syntactic parsing.
B.They cannot be trained using standard backpropagation through time (BPTT).
C.They suffer from an information bottleneck when encoding long sequences.
D.They cannot generate text in different languages.
Correct Answer: They suffer from an information bottleneck when encoding long sequences.
Explanation:
Classical seq2seq models must compress the entire input sequence into a single fixed-length context vector, causing a severe loss of information for long input sequences.
Incorrect! Try again.
40Due to the sequential nature of classical RNN-based seq2seq models, which of the following computational bottlenecks occurs during training?
limitations of classical seq2seq models
Medium
A.The impossibility of computing gradients for the decoder network.
B.The inability to parallelize operations across time steps.
C.The excessive memory consumption caused by large attention matrices.
D.The need to invert large vocabulary matrices at every step.
Correct Answer: The inability to parallelize operations across time steps.
Explanation:
RNNs must process tokens sequentially because computation at step depends on step . This means computations cannot be parallelized across the sequence length, leading to slower training compared to non-sequential models.
Incorrect! Try again.
41In a classical encoder-decoder architecture without attention, the entire input sequence is compressed into a fixed-length context vector. From an information-theoretic and optimization perspective, which of the following best describes the primary consequence of this architectural constraint on long sequences?
limitations of classical seq2seq models
Hard
A.The fixed-length vector strictly limits the vocabulary size the decoder can generate, causing an increase in out-of-vocabulary (OOV) errors for sequences longer than the context vector dimensionality.
B.The lack of attention reduces the time complexity of the decoding phase from to , but exponentially increases the space complexity required for the hidden state.
C.The context vector causes the decoder to overfit on the beginning of the sequence, completely ignoring the latter half of the input tokens during the generation phase.
D.The model suffers from the information bottleneck problem and aggravated vanishing gradients during backpropagation through time (BPTT), leading to a rapid decay in the decoder's ability to recall early encoder tokens.
Correct Answer: The model suffers from the information bottleneck problem and aggravated vanishing gradients during backpropagation through time (BPTT), leading to a rapid decay in the decoder's ability to recall early encoder tokens.
Explanation:
Classical seq2seq models compress all input into a fixed-length vector, creating an information bottleneck. For long sequences, early tokens are 'forgotten' because their influence diminishes as BPTT suffers from vanishing gradients over many timesteps.
Incorrect! Try again.
42Consider the alignment models in Bahdanau and Luong attention mechanisms. Let be the encoder hidden state and be the decoder hidden state. Which of the following correctly identifies the fundamental mathematical difference in how the alignment score is computed?
Bahdanau and Luong attention
Hard
A.Bahdanau computes scores using (current decoder state) via a dot product, whereas Luong computes scores using via an additive feed-forward network.
B.Bahdanau uses an additive feed-forward network , whereas Luong evaluates multiplicative functions such as .
C.Bahdanau calculates the context vector using hard attention sampling, whereas Luong uses deterministic soft attention based on the cosine similarity between and .
D.Bahdanau requires computing a self-attention matrix over before comparing with , whereas Luong directly computes without trainable weights.
Correct Answer: Bahdanau uses an additive feed-forward network , whereas Luong evaluates multiplicative functions such as .
Explanation:
Bahdanau (additive) attention uses a one-hidden-layer feed-forward network applied to the previous decoder state and encoder state . Luong (multiplicative) attention generally uses the current decoder state and employs dot-product, general (), or concatenation formulations.
Incorrect! Try again.
43In the computation of the BLEU score, the Brevity Penalty (BP) is used to penalize short translations. Suppose a candidate translation has length , and there are three reference translations with lengths , , and . If the effective reference length is chosen as the closest reference length to (with ties broken by selecting the shorter length), what is the value of the BP?
BLEU and ROUGE scores
Hard
A.
B.
C.
D.
Correct Answer:
Explanation:
The closest reference lengths to are and (both are 1 unit away). The tie-breaking rule selects the shorter length, so . Because (10 > 9), the Brevity Penalty formula yields .
Incorrect! Try again.
44A candidate summary contains 0 matches for 4-grams against the reference summary, but has non-zero matches for 1-gram, 2-gram, and 3-gram. When calculating the standard un-smoothed BLEU-4 score (using a uniform weight distribution ), what will be the resulting BLEU-4 score?
BLEU and ROUGE scores
Hard
A.The BLEU-4 score will be exactly 0, because the geometric mean calculation involves multiplying the precisions, and a 0 precision for 4-grams nullifies the entire score.
B.The BLEU-4 score will be the arithmetic average of the non-zero n-gram precisions, bypassing the 4-gram score.
C.The BLEU-4 score will compute the geometric mean of only the 1-gram, 2-gram, and 3-gram precisions, scaled by .
D.The BLEU-4 score evaluates to a highly penalized but non-zero value, as standard BLEU automatically applies add-one smoothing to zero counts.
Correct Answer: The BLEU-4 score will be exactly 0, because the geometric mean calculation involves multiplying the precisions, and a 0 precision for 4-grams nullifies the entire score.
Explanation:
Standard BLEU calculates the geometric mean of n-gram precisions: . If any , is undefined/negative infinity, resulting in a geometric mean of 0. Smoothing techniques must be explicitly applied to avoid this.
Incorrect! Try again.
45In a soft attention mechanism, a temperature parameter can be introduced into the softmax function: . What is the effect on the expected context vector as ?
soft attention
Hard
A.The attention distribution sharpens to a one-hot vector (ArgMax), meaning closely approximates the single encoder hidden state with the highest alignment score, simulating hard attention.
B.The attention distribution approaches a uniform distribution, making an unweighted average of all encoder hidden states.
C.The attention distribution becomes infinitely flat, causing the gradient of with respect to the encoder states to explode.
D.The softmax function becomes undefined, requiring the use of the REINFORCE algorithm to sample from the resulting probability distribution.
Correct Answer: The attention distribution sharpens to a one-hot vector (ArgMax), meaning closely approximates the single encoder hidden state with the highest alignment score, simulating hard attention.
Explanation:
As the temperature approaches 0, the softmax function acts more like an argmax function. The highest score gets a probability approaching 1, and all others approach 0, effectively mirroring a deterministic 'hard' selection.
Incorrect! Try again.
46Which of the following describes the key structural difference in how the context vector is utilized to update the decoder's hidden state and predict the next word between the standard Bahdanau and standard Luong (global) architectures?
integrating attention into encoder–decoder networks
Hard
A.Bahdanau processes through an additional Bidirectional RNN layer in the decoder, whereas Luong uses a standard Unidirectional RNN.
B.Bahdanau concatenates with the target input token before it passes through the decoder RNN, while Luong computes the decoder RNN state first and then concatenates it with to form an attentional hidden state.
C.Bahdanau uses exclusively to initialize the first hidden state of the decoder, while Luong recomputes at every time step.
D.Bahdanau sums with the decoder's cell state for LSTM variants, whereas Luong concatenates only at the final softmax layer.
Correct Answer: Bahdanau concatenates with the target input token before it passes through the decoder RNN, while Luong computes the decoder RNN state first and then concatenates it with to form an attentional hidden state.
Explanation:
In Bahdanau attention, the alignment is computed using the previous decoder state, and the resulting context vector is concatenated with the current input to feed into the decoder RNN. In Luong attention, the current decoder state is computed first, used to find the context vector, and then concatenated with it to produce the final attentional vector for prediction.
Incorrect! Try again.
47Seq2Seq models often suffer from 'exposure bias' during inference. Which of the following best defines this problem and identifies a common technique used to mitigate it?
sequence-to-sequence models for machine translation and summarization
Hard
A.The model is trained to predict the next token given the ground-truth previous token (teacher forcing), but at inference must rely on its own possibly erroneous predictions; mitigated by Scheduled Sampling.
B.The decoder is exposed to too much context from the encoder, causing vanishing gradients; mitigated by Truncated Backpropagation Through Time (TBPTT).
C.The attention weights become overly focused on a single token, limiting translation diversity; mitigated by applying a Coverage Penalty.
D.The model is exposed to out-of-vocabulary words during inference; mitigated by using Pointer-Generator Networks.
Correct Answer: The model is trained to predict the next token given the ground-truth previous token (teacher forcing), but at inference must rely on its own possibly erroneous predictions; mitigated by Scheduled Sampling.
Explanation:
Exposure bias occurs because the model never sees its own mistakes during training (due to teacher forcing), leading to compounding errors during auto-regressive inference. Scheduled sampling slowly replaces ground truth tokens with the model's own predictions during training to bridge this gap.
Incorrect! Try again.
48Assume an input sequence of length , an output sequence of length , and hidden states of dimension . What is the overall asymptotic time complexity of computing the standard soft attention alignments (e.g., Luong dot-product attention) across the entire decoding process?
attention in deep NLP
Hard
A.
B.
C.
D.
Correct Answer:
Explanation:
For each of the decoding steps, the attention mechanism calculates an alignment score with all encoder states. A dot product between two -dimensional vectors takes . Therefore, the total time complexity is .
Incorrect! Try again.
49Luong introduced a 'local attention' mechanism to reduce the computational cost of global attention. In the 'predictive' alignment local attention model (local-p), how is the aligned position determined?
Bahdanau and Luong attention
Hard
A.It is assumed to be strictly monotonic, meaning for all decoding steps.
B.It is predicted by passing the current decoder state through a dense layer with a sigmoid activation, scaled by the source sentence length : .
C.It is computed by finding the moving average of the previous attention distributions .
D.It is selected via a hard ArgMax over the global alignment scores, making the model non-differentiable.
Correct Answer: It is predicted by passing the current decoder state through a dense layer with a sigmoid activation, scaled by the source sentence length : .
Explanation:
In Luong's local-p attention, the model predicts an aligned position for the current target word. It uses a predictive function to map the decoder state to a position in the source sequence , around which a window is placed to compute local attention.
Incorrect! Try again.
50When initializing a unidirectional decoder's hidden state from a bidirectional LSTM (BiLSTM) encoder, a dimension mismatch occurs if both use hidden dimension . Which of the following is the standard rigorous mathematical approach to initialize the decoder's initial state ?
encoder–decoder architectures for NLP
Hard
A.Concatenate the final forward state and the final backward state , and pass the concatenated -dimensional vector through a linear projection layer parameterized by .
B.Directly assign the final forward state to the decoder: , completely ignoring the backward state to preserve causality.
C.Take the element-wise average of all forward and backward hidden states from the encoder: .
D.Use the context vector generated by a zero-initialized attention query to map the encoder states into the dimensional decoder state.
Correct Answer: Concatenate the final forward state and the final backward state , and pass the concatenated -dimensional vector through a linear projection layer parameterized by .
Explanation:
A BiLSTM encoder produces a final forward state at step and a final backward state at step 1. These are usually concatenated to form a vector of size . To initialize a unidirectional decoder of size , this vector is multiplied by a learnable weight matrix to project it down to dimension .
Incorrect! Try again.
51Consider a candidate translation: 'the the the the'. There are two reference translations: Ref 1: 'the cat is on the mat' and Ref 2: 'there is a cat on the mat'. Using the modified n-gram precision for BLEU, what is the modified unigram precision for this candidate?
evaluation techniques
Hard
A.
B.
C.
D.
Correct Answer:
Explanation:
Modified precision clips the count of each candidate word by its maximum frequency in any single reference sentence. 'the' appears twice in Ref 1 and once in Ref 2. The max reference count is 2. The candidate has 4 'the's, so the clipped count is . The modified unigram precision is .
Incorrect! Try again.
52In beam search decoding for seq2seq models, sequences of different lengths must be compared. Since log-probabilities are negative, longer sequences naturally have lower scores. To counteract this, length normalization is applied. Which of the following formulations is the standard Google Neural Machine Translation (GNMT) length penalty?
sequence-to-sequence models for machine translation and summarization
Hard
A.
B.
C.
D.
Correct Answer:
Explanation:
The GNMT length penalty is defined as , where is the current sequence length and is a tunable parameter (usually between 0.6 and 0.7). This provides a smoothed normalization that doesn't penalize extremely short or long sentences too harshly compared to raw division by length.
Incorrect! Try again.
53A classical seq2seq model generates abstractive summaries but consistently struggles with Out-of-Vocabulary (OOV) entities (e.g., rare names) present in the source text. Which architectural extension mathematically defines a generation probability to dynamically choose between sampling from the vocabulary distribution and the attention distribution?
limitations of classical seq2seq models
Hard
A.Transformer with relative position encodings
B.Local-m Attention Networks
C.Byte-Pair Encoding (BPE) subword tokenizers
D.Pointer-Generator Networks
Correct Answer: Pointer-Generator Networks
Explanation:
Pointer-Generator Networks calculate a generation probability at each step. This scalar dictates whether the model generates a word from the fixed vocabulary or 'points' to (copies) a word from the source sequence using the attention distribution, elegantly solving the OOV issue.
Incorrect! Try again.
54When applying sequence-to-sequence models to abstractive summarization, models frequently repeat the same phrases. To address this, a 'coverage penalty' is often added to the loss function. If is the attention weight for source token at decoder step , and the coverage vector is , which of the following is the standard formulation of the coverage loss added at step ?
sequence-to-sequence models for machine translation and summarization
Hard
A.
B.
C.
D.
Correct Answer:
Explanation:
The standard coverage loss, introduced by See et al., penalizes overlap between the current attention distribution and the historical coverage . By penalizing , it explicitly forces the model to heavily penalize attending to tokens that already have a high accumulated coverage.
Incorrect! Try again.
55While BLEU focuses on precision, ROUGE scores evaluate recall. ROUGE-L utilizes the Longest Common Subsequence (LCS). What is a known limitation of standard ROUGE-L compared to ROUGE-W (Weighted LCS)?
BLEU and ROUGE scores
Hard
A.ROUGE-L calculates precision instead of recall, making it redundant when evaluated alongside BLEU-4.
B.ROUGE-L strictly requires n-grams to be contiguous, thereby failing to capture sequence-level similarity if a single word is inserted.
C.ROUGE-L assigns the same score to a candidate that matches a reference with spatial gaps as it does to a candidate with consecutive matches of the same length.
D.ROUGE-L cannot scale beyond sentence-level evaluation, causing it to crash on multi-document summarization tasks.
Correct Answer: ROUGE-L assigns the same score to a candidate that matches a reference with spatial gaps as it does to a candidate with consecutive matches of the same length.
Explanation:
ROUGE-L uses the length of the LCS. If a candidate matches 4 words in a row, or 4 words spread far apart, the LCS length is still 4. ROUGE-W improves upon this by applying a weighting scheme that explicitly rewards consecutive spatial matches.
Incorrect! Try again.
56In Luong's 'input-feeding' approach to attention integration, how is the attentional vector structurally passed to subsequent time steps to ensure the network maintains a history of past alignment decisions?
integrating attention into encoder–decoder networks
Hard
A. is multiplied by a learned decay matrix and fed as the initial state for the final softmax classifier at step .
B. is added to the cell state before being passed to the next LSTM step .
C. replaces the encoder's original context vector and is passed purely through the residual connections of the network.
D. is concatenated with the next target word input at step and fed into the decoder RNN.
Correct Answer: is concatenated with the next target word input at step and fed into the decoder RNN.
Explanation:
In Luong's input-feeding approach, the attentional hidden state computed at time is concatenated with the input at time (the embedding of the previous predicted word) before being fed into the RNN. This allows the model to 'remember' past alignment decisions.
Incorrect! Try again.
57Contrasting soft attention and hard attention, soft attention computes a deterministic weighted average of encoder states, making it differentiable. Hard attention samples a single state. From a mathematical optimization perspective, how must a hard attention mechanism be trained?
soft attention
Hard
A.Using standard Backpropagation Through Time (BPTT) with the reparameterization trick on the categorical distribution.
B.Using reinforcement learning techniques, such as the REINFORCE algorithm, to maximize an expected reward since sampling from a categorical distribution is non-differentiable.
C.Using standard Backpropagation by approximating the argmax operation with a straight-through estimator exclusively.
D.Using a purely unsupervised Expectation-Maximization (EM) algorithm to maximize the lower bound of the attention marginal likelihood.
Correct Answer: Using reinforcement learning techniques, such as the REINFORCE algorithm, to maximize an expected reward since sampling from a categorical distribution is non-differentiable.
Explanation:
Hard attention relies on stochastically sampling a specific location, which breaks the differentiability required for standard backpropagation. It is typically trained using policy gradient methods like REINFORCE, treating the selection as an action and the loss as a negative reward.
Incorrect! Try again.
58The METEOR metric was designed to fix some of the flaws in the BLEU score. Which of the following describes a specific computational phase in METEOR that structurally handles lexical variations ignored by BLEU?
evaluation techniques
Hard
A.METEOR applies a TF-IDF weighting scheme over the BLEU unigram matches, de-emphasizing highly frequent function words like 'the'.
B.METEOR maps candidate words to reference words using exact match, stem match (via Porter stemmer), and synonym match (via WordNet), maximizing the alignment score.
D.METEOR calculates a Brevity Penalty based on the harmonic mean of lengths, effectively penalizing synonyms that have more characters.
Correct Answer: METEOR maps candidate words to reference words using exact match, stem match (via Porter stemmer), and synonym match (via WordNet), maximizing the alignment score.
Explanation:
METEOR explicitly accounts for linguistic variation by establishing alignments between candidate and reference words in stages: exact word matching, stem matching (using a stemmer), and synonym matching (using lexical databases like WordNet). BLEU only uses exact n-gram matching.
Incorrect! Try again.
59In the Bahdanau attention model, the attention alignment function heavily relies on the previous decoder hidden state . If one were to replace the unidirectional RNN in the decoder with a bidirectional RNN (BiRNN) for sequence generation, why would the Bahdanau mechanism theoretically break or become illogical in an auto-regressive context?
Bahdanau and Luong attention
Hard
A.The previous decoder state would mathematically cancel out the backward hidden state, resulting in a constant context vector .
B.The alignment function cannot accept non-linear concatenations produced by BiRNNs without blowing up the dimensionality of .
D.A BiRNN requires knowledge of future generated tokens to compute the backward pass, violating the auto-regressive property of generating sequences one token at a time.
Correct Answer: A BiRNN requires knowledge of future generated tokens to compute the backward pass, violating the auto-regressive property of generating sequences one token at a time.
Explanation:
Auto-regressive sequence generation inherently prevents the use of bidirectional RNNs in the decoder because computing the backward RNN states requires access to future tokens, which have not yet been generated. Thus, the premise of using in a bidirectional decoder is impossible.
Incorrect! Try again.
60Self-attention (intra-attention) differs mathematically from classical seq2seq attention. In a standard seq2seq attention model translating a sentence of length to length , what is the size of the attention weight matrix at a single decoding time step , and what is the size of the complete attention weight matrix for self-attention over the source sentence?
attention in deep NLP
Hard
A.Seq2seq step : ; Self-attention:
B.Seq2seq step : ; Self-attention:
C.Seq2seq step : ; Self-attention:
D.Seq2seq step : ; Self-attention:
Correct Answer: Seq2seq step : ; Self-attention:
Explanation:
In classical seq2seq attention, at a single decoder step , the decoder state attends to all encoder states, producing an attention vector of size . Self-attention compares every token in the sequence of length to every other token, producing a weight matrix of size .