Unit 4 - Practice Quiz

CSE472 60 Questions
0 Correct 0 Wrong 60 Left
0/60

1 What are the two primary components of an encoder-decoder model in NLP?

encoder–decoder architectures for NLP Easy
A. A Transformer and an Optimizer
B. A Convolutional layer and a Pooling layer
C. An Encoder and a Decoder
D. A Generator and a Discriminator

2 In a basic encoder-decoder architecture, what does the encoder produce to pass information to the decoder?

encoder–decoder architectures for NLP Easy
A. A fixed-size context vector
B. A one-hot encoded matrix
C. A sparse dependency tree
D. A continuous stream of tokens

3 Which of the following neural network types was classically used to build encoders and decoders for sequential text data?

encoder–decoder architectures for NLP Easy
A. Radial Basis Function Networks (RBFNs)
B. Generative Adversarial Networks (GANs)
C. Recurrent Neural Networks (RNNs)
D. Convolutional Neural Networks (CNNs)

4 What is the primary function of a sequence-to-sequence (seq2seq) model in machine translation?

sequence-to-sequence models for machine translation and summarization Easy
A. To classify the language of the input sentence
B. To predict the sentiment of the input sentence
C. To cluster similar words together
D. To map a sequence of words in one language to a sequence in another language

5 When a sequence-to-sequence model is used for text summarization, how does the output sequence typically compare to the input sequence?

sequence-to-sequence models for machine translation and summarization Easy
A. It is a translated version of the input sequence
B. It is a much longer and more detailed sequence
C. It is a shorter sequence that captures the main points of the input
D. It is an exact copy of the input sequence

6 What primary problem with basic seq2seq models does the attention mechanism solve?

attention in deep NLP Easy
A. The information bottleneck of using a single fixed-size context vector
B. The vanishing gradient problem in CNNs
C. The lack of word embeddings
D. The inability to process numeric data

7 What does the attention mechanism allow the decoder to do during generation?

attention in deep NLP Easy
A. Translate words without any training data
B. Generate multiple tokens at the exact same time
C. Ignore the encoder completely
D. Focus on different, relevant parts of the input sequence for each output step

8 Which of the following best describes 'soft attention'?

soft attention Easy
A. It applies a hard threshold to remove low-frequency words.
B. It calculates a weighted average of all input hidden states.
C. It selects only one single input word to focus on deterministically.
D. It randomly drops attention weights to prevent overfitting.

9 In soft attention, what must the attention weights for a given decoding step sum up to?

soft attention Easy
A. The size of the hidden layer
B. The length of the input sequence
C. $0$
D. $1$

10 Bahdanau attention is also commonly referred to by which of the following names?

Bahdanau and Luong attention Easy
A. Multiplicative attention
B. Additive attention
C. Dot-product attention
D. Self-attention

11 How does Luong attention (multiplicative attention) typically calculate the alignment score?

Bahdanau and Luong attention Easy
A. By adding the encoder and decoder states together
B. By using a Convolutional Neural Network
C. By computing the Euclidean distance between words
D. By taking the dot product between the decoder hidden state and encoder hidden states

12 Which attention mechanism fundamentally introduced the concept of aligning and translating jointly using a feed-forward network to score alignments?

Bahdanau and Luong attention Easy
A. Scaled Dot-Product attention
B. Bahdanau attention
C. Luong attention
D. Multi-Head attention

13 When integrating attention into an encoder-decoder network, what function is typically applied to the alignment scores to obtain the final attention weights?

integrating attention into encoder–decoder networks Easy
A. ReLU
B. Tanh
C. Sigmoid
D. Softmax

14 In an attention-equipped seq2seq model, what is the 'context vector' composed of?

integrating attention into encoder–decoder networks Easy
A. A randomly initialized vector
B. The sum of all word embeddings in the target language
C. A weighted sum of the encoder's hidden states
D. The very last hidden state of the decoder

15 Why is evaluating sequence-to-sequence generation tasks (like translation) fundamentally harder than evaluating classification tasks?

evaluation techniques Easy
A. There is usually only one correct answer in text generation.
B. Accuracy metrics can only be applied to numerical data.
C. Classification models don't use loss functions.
D. Text generation can have multiple valid and correct outputs for the same input.

16 Which evaluation metric is most heavily used for Machine Translation and relies on modified n-gram precision?

BLEU and ROUGE scores Easy
A. BLEU
B. Perplexity
C. ROUGE
D. F1-Score

17 Which metric is commonly used for evaluating Text Summarization and relies heavily on n-gram recall?

BLEU and ROUGE scores Easy
A. Accuracy
B. Word Error Rate (WER)
C. BLEU
D. ROUGE

18 What is the purpose of the 'Brevity Penalty' (BP) in the BLEU score calculation?

BLEU and ROUGE scores Easy
A. To penalize candidate translations that are too short compared to the reference.
B. To penalize models that take too long to translate.
C. To penalize the use of rare words.
D. To penalize candidate translations that are excessively long.

19 What is a major limitation of standard seq2seq models that do NOT use attention?

limitations of classical seq2seq models Easy
A. They cannot handle any sequence data.
B. They require paired training data.
C. Performance drops significantly when processing long input sequences.
D. They can only generate one word at a time.

20 Why do classical RNN-based seq2seq models suffer from slow training speeds compared to modern architectures like Transformers?

limitations of classical seq2seq models Easy
A. They must process tokens sequentially, which prevents parallelization.
B. They require massive amounts of memory for attention weights.
C. They can only be trained on CPUs.
D. They use too many convolutional filters.

21 In a basic RNN-based encoder-decoder architecture without attention, how does the encoder transfer information to the decoder?

encoder–decoder architectures for NLP Medium
A. By sharing the same weight matrices for both encoding and decoding.
B. By passing its final hidden state as the initial context vector to the decoder.
C. By using a continuous feedback loop between the encoder and decoder.
D. By passing all its hidden states simultaneously to the decoder.

22 When generating an output sequence in an encoder-decoder model during inference, what is typically used as the input to the decoder at time step ?

encoder–decoder architectures for NLP Medium
A. The entire original input sequence.
B. The actual ground-truth token from time step .
C. The predicted token from time step .
D. A fixed context vector re-encoded at every step.

23 Why is Teacher Forcing commonly used during the training of sequence-to-sequence models for machine translation?

sequence-to-sequence models for machine translation and summarization Medium
A. It allows the model to learn without requiring any target output data.
B. It stabilizes and speeds up training by feeding the true previous target token as input to the decoder.
C. It prevents the model from relying on the encoder context.
D. It automatically corrects the weights of the encoder using external dictionaries.

24 In the context of seq2seq models for text summarization, what issue often arises when using a standard model without a pointer-generator network?

sequence-to-sequence models for machine translation and summarization Medium
A. The model fails to process sequences longer than 10 tokens.
B. The model struggles to accurately reproduce out-of-vocabulary (OOV) words like proper nouns.
C. The model perfectly memorizes the source text but cannot generate new words.
D. The model generates summaries that are longer than the original text.

25 Which search strategy is generally preferred during inference in seq2seq models to balance computational efficiency and translation quality?

sequence-to-sequence models for machine translation and summarization Medium
A. Greedy Search
B. Beam Search
C. Exhaustive Search
D. Random Sampling

26 What primary problem does the introduction of attention mechanisms solve in deep NLP?

attention in deep NLP Medium
A. The need for Teacher Forcing during the training phase.
B. The information bottleneck caused by compressing long input sequences into a fixed-length context vector.
C. The inability of RNNs to process discrete token inputs.
D. The vanishing gradient problem in the decoder network.

27 In the context of attention, the alignment score is calculated between which two components?

attention in deep NLP Medium
A. The current decoder hidden state and the encoder hidden states.
B. The current decoder hidden state and all other decoder hidden states.
C. The input word embeddings and the output word embeddings.
D. The current encoder hidden state and the previous encoder hidden state.

28 Which mathematical operation is applied to the raw alignment scores to produce soft attention weights?

soft attention Medium
A. Softmax function
B. Argmax operation
C. Sigmoid function
D. ReLU activation

29 How does soft attention differ from hard attention computationally?

soft attention Medium
A. Soft attention cannot be used with sequence-to-sequence models.
B. Soft attention has higher variance in gradients than hard attention.
C. Soft attention selects exactly one input word, whereas hard attention averages them.
D. Soft attention is fully differentiable, while hard attention requires reinforcement learning techniques to train.

30 In Bahdanau (Additive) attention, how is the alignment score typically computed?

Bahdanau and Luong attention Medium
A. By using a feed-forward neural network with a single hidden layer.
B. By taking the dot product of the encoder and decoder hidden states.
C. By passing the states through a multi-head self-attention block.
D. By computing the cosine similarity between encoder and decoder states.

31 Which of the following best describes the scoring function used in Luong's general (multiplicative) attention?

Bahdanau and Luong attention Medium
A.
B.
C.
D.

32 A key difference between Bahdanau and Luong attention mechanisms lies in when the context vector is used. How does Luong's global attention utilize the context vector?

Bahdanau and Luong attention Medium
A. It concatenates the context vector with the decoder's current hidden state to compute the final attentional hidden state.
B. It uses the context vector to predict the previous hidden state.
C. It feeds the context vector exclusively to the encoder to update embeddings.
D. It uses the context vector only to compute the next encoder state.

33 After computing the attention weights in an encoder-decoder network, how is the context vector generated?

integrating attention into encoder–decoder networks Medium
A. By taking the dot product of the attention weights and the decoder hidden state.
B. By calculating the unweighted average of the encoder hidden states.
C. By computing a weighted sum of the encoder hidden states using the attention weights.
D. By applying a max-pooling operation over the encoder hidden states.

34 When integrating attention into an RNN-based seq2seq model, what is the impact on the model's computational complexity per decoding step with respect to the input sequence length ?

integrating attention into encoder–decoder networks Medium
A. The complexity becomes .
B. The complexity becomes independent of .
C. The complexity becomes .
D. The complexity becomes .

35 Which of the following is a significant drawback of -gram based evaluation metrics like BLEU and ROUGE?

evaluation techniques Medium
A. They cannot evaluate models that use attention mechanisms.
B. They penalize models for outputting sequences of different lengths than the reference.
C. They require computationally expensive neural network forward passes to evaluate.
D. They fail to account for semantic similarity and synonyms if exact word matches do not occur.

36 In the BLEU metric, what is the purpose of the Brevity Penalty (BP)?

BLEU and ROUGE scores Medium
A. To penalize candidate translations that use too many low-frequency words.
B. To penalize candidate translations that are too long compared to the reference.
C. To penalize candidate translations that are shorter than the reference translation.
D. To penalize candidate translations that contain grammatically incorrect -grams.

37 While BLEU focuses primarily on precision, ROUGE is typically designed to emphasize which metric for tasks like summarization?

BLEU and ROUGE scores Medium
A. Accuracy
B. Recall
C. F1-Score
D. Specificity

38 What does ROUGE-L specifically measure when evaluating a generated text sequence?

BLEU and ROUGE scores Medium
A. The Longest Common Subsequence (LCS) between the candidate and reference texts.
B. The semantic distance using word embeddings of length .
C. The average precision of all -grams up to length .
D. The overlap of unigrams and bigrams combined.

39 Which of the following is a primary limitation of classical seq2seq models without attention?

limitations of classical seq2seq models Medium
A. They require hand-crafted features for syntactic parsing.
B. They cannot be trained using standard backpropagation through time (BPTT).
C. They suffer from an information bottleneck when encoding long sequences.
D. They cannot generate text in different languages.

40 Due to the sequential nature of classical RNN-based seq2seq models, which of the following computational bottlenecks occurs during training?

limitations of classical seq2seq models Medium
A. The impossibility of computing gradients for the decoder network.
B. The inability to parallelize operations across time steps.
C. The excessive memory consumption caused by large attention matrices.
D. The need to invert large vocabulary matrices at every step.

41 In a classical encoder-decoder architecture without attention, the entire input sequence is compressed into a fixed-length context vector. From an information-theoretic and optimization perspective, which of the following best describes the primary consequence of this architectural constraint on long sequences?

limitations of classical seq2seq models Hard
A. The fixed-length vector strictly limits the vocabulary size the decoder can generate, causing an increase in out-of-vocabulary (OOV) errors for sequences longer than the context vector dimensionality.
B. The lack of attention reduces the time complexity of the decoding phase from to , but exponentially increases the space complexity required for the hidden state.
C. The context vector causes the decoder to overfit on the beginning of the sequence, completely ignoring the latter half of the input tokens during the generation phase.
D. The model suffers from the information bottleneck problem and aggravated vanishing gradients during backpropagation through time (BPTT), leading to a rapid decay in the decoder's ability to recall early encoder tokens.

42 Consider the alignment models in Bahdanau and Luong attention mechanisms. Let be the encoder hidden state and be the decoder hidden state. Which of the following correctly identifies the fundamental mathematical difference in how the alignment score is computed?

Bahdanau and Luong attention Hard
A. Bahdanau computes scores using (current decoder state) via a dot product, whereas Luong computes scores using via an additive feed-forward network.
B. Bahdanau uses an additive feed-forward network , whereas Luong evaluates multiplicative functions such as .
C. Bahdanau calculates the context vector using hard attention sampling, whereas Luong uses deterministic soft attention based on the cosine similarity between and .
D. Bahdanau requires computing a self-attention matrix over before comparing with , whereas Luong directly computes without trainable weights.

43 In the computation of the BLEU score, the Brevity Penalty (BP) is used to penalize short translations. Suppose a candidate translation has length , and there are three reference translations with lengths , , and . If the effective reference length is chosen as the closest reference length to (with ties broken by selecting the shorter length), what is the value of the BP?

BLEU and ROUGE scores Hard
A.
B.
C.
D.

44 A candidate summary contains 0 matches for 4-grams against the reference summary, but has non-zero matches for 1-gram, 2-gram, and 3-gram. When calculating the standard un-smoothed BLEU-4 score (using a uniform weight distribution ), what will be the resulting BLEU-4 score?

BLEU and ROUGE scores Hard
A. The BLEU-4 score will be exactly 0, because the geometric mean calculation involves multiplying the precisions, and a 0 precision for 4-grams nullifies the entire score.
B. The BLEU-4 score will be the arithmetic average of the non-zero n-gram precisions, bypassing the 4-gram score.
C. The BLEU-4 score will compute the geometric mean of only the 1-gram, 2-gram, and 3-gram precisions, scaled by .
D. The BLEU-4 score evaluates to a highly penalized but non-zero value, as standard BLEU automatically applies add-one smoothing to zero counts.

45 In a soft attention mechanism, a temperature parameter can be introduced into the softmax function: . What is the effect on the expected context vector as ?

soft attention Hard
A. The attention distribution sharpens to a one-hot vector (ArgMax), meaning closely approximates the single encoder hidden state with the highest alignment score, simulating hard attention.
B. The attention distribution approaches a uniform distribution, making an unweighted average of all encoder hidden states.
C. The attention distribution becomes infinitely flat, causing the gradient of with respect to the encoder states to explode.
D. The softmax function becomes undefined, requiring the use of the REINFORCE algorithm to sample from the resulting probability distribution.

46 Which of the following describes the key structural difference in how the context vector is utilized to update the decoder's hidden state and predict the next word between the standard Bahdanau and standard Luong (global) architectures?

integrating attention into encoder–decoder networks Hard
A. Bahdanau processes through an additional Bidirectional RNN layer in the decoder, whereas Luong uses a standard Unidirectional RNN.
B. Bahdanau concatenates with the target input token before it passes through the decoder RNN, while Luong computes the decoder RNN state first and then concatenates it with to form an attentional hidden state.
C. Bahdanau uses exclusively to initialize the first hidden state of the decoder, while Luong recomputes at every time step.
D. Bahdanau sums with the decoder's cell state for LSTM variants, whereas Luong concatenates only at the final softmax layer.

47 Seq2Seq models often suffer from 'exposure bias' during inference. Which of the following best defines this problem and identifies a common technique used to mitigate it?

sequence-to-sequence models for machine translation and summarization Hard
A. The model is trained to predict the next token given the ground-truth previous token (teacher forcing), but at inference must rely on its own possibly erroneous predictions; mitigated by Scheduled Sampling.
B. The decoder is exposed to too much context from the encoder, causing vanishing gradients; mitigated by Truncated Backpropagation Through Time (TBPTT).
C. The attention weights become overly focused on a single token, limiting translation diversity; mitigated by applying a Coverage Penalty.
D. The model is exposed to out-of-vocabulary words during inference; mitigated by using Pointer-Generator Networks.

48 Assume an input sequence of length , an output sequence of length , and hidden states of dimension . What is the overall asymptotic time complexity of computing the standard soft attention alignments (e.g., Luong dot-product attention) across the entire decoding process?

attention in deep NLP Hard
A.
B.
C.
D.

49 Luong introduced a 'local attention' mechanism to reduce the computational cost of global attention. In the 'predictive' alignment local attention model (local-p), how is the aligned position determined?

Bahdanau and Luong attention Hard
A. It is assumed to be strictly monotonic, meaning for all decoding steps.
B. It is predicted by passing the current decoder state through a dense layer with a sigmoid activation, scaled by the source sentence length : .
C. It is computed by finding the moving average of the previous attention distributions .
D. It is selected via a hard ArgMax over the global alignment scores, making the model non-differentiable.

50 When initializing a unidirectional decoder's hidden state from a bidirectional LSTM (BiLSTM) encoder, a dimension mismatch occurs if both use hidden dimension . Which of the following is the standard rigorous mathematical approach to initialize the decoder's initial state ?

encoder–decoder architectures for NLP Hard
A. Concatenate the final forward state and the final backward state , and pass the concatenated -dimensional vector through a linear projection layer parameterized by .
B. Directly assign the final forward state to the decoder: , completely ignoring the backward state to preserve causality.
C. Take the element-wise average of all forward and backward hidden states from the encoder: .
D. Use the context vector generated by a zero-initialized attention query to map the encoder states into the dimensional decoder state.

51 Consider a candidate translation: 'the the the the'. There are two reference translations: Ref 1: 'the cat is on the mat' and Ref 2: 'there is a cat on the mat'. Using the modified n-gram precision for BLEU, what is the modified unigram precision for this candidate?

evaluation techniques Hard
A.
B.
C.
D.

52 In beam search decoding for seq2seq models, sequences of different lengths must be compared. Since log-probabilities are negative, longer sequences naturally have lower scores. To counteract this, length normalization is applied. Which of the following formulations is the standard Google Neural Machine Translation (GNMT) length penalty?

sequence-to-sequence models for machine translation and summarization Hard
A.
B.
C.
D.

53 A classical seq2seq model generates abstractive summaries but consistently struggles with Out-of-Vocabulary (OOV) entities (e.g., rare names) present in the source text. Which architectural extension mathematically defines a generation probability to dynamically choose between sampling from the vocabulary distribution and the attention distribution?

limitations of classical seq2seq models Hard
A. Transformer with relative position encodings
B. Local-m Attention Networks
C. Byte-Pair Encoding (BPE) subword tokenizers
D. Pointer-Generator Networks

54 When applying sequence-to-sequence models to abstractive summarization, models frequently repeat the same phrases. To address this, a 'coverage penalty' is often added to the loss function. If is the attention weight for source token at decoder step , and the coverage vector is , which of the following is the standard formulation of the coverage loss added at step ?

sequence-to-sequence models for machine translation and summarization Hard
A.
B.
C.
D.

55 While BLEU focuses on precision, ROUGE scores evaluate recall. ROUGE-L utilizes the Longest Common Subsequence (LCS). What is a known limitation of standard ROUGE-L compared to ROUGE-W (Weighted LCS)?

BLEU and ROUGE scores Hard
A. ROUGE-L calculates precision instead of recall, making it redundant when evaluated alongside BLEU-4.
B. ROUGE-L strictly requires n-grams to be contiguous, thereby failing to capture sequence-level similarity if a single word is inserted.
C. ROUGE-L assigns the same score to a candidate that matches a reference with spatial gaps as it does to a candidate with consecutive matches of the same length.
D. ROUGE-L cannot scale beyond sentence-level evaluation, causing it to crash on multi-document summarization tasks.

56 In Luong's 'input-feeding' approach to attention integration, how is the attentional vector structurally passed to subsequent time steps to ensure the network maintains a history of past alignment decisions?

integrating attention into encoder–decoder networks Hard
A. is multiplied by a learned decay matrix and fed as the initial state for the final softmax classifier at step .
B. is added to the cell state before being passed to the next LSTM step .
C. replaces the encoder's original context vector and is passed purely through the residual connections of the network.
D. is concatenated with the next target word input at step and fed into the decoder RNN.

57 Contrasting soft attention and hard attention, soft attention computes a deterministic weighted average of encoder states, making it differentiable. Hard attention samples a single state. From a mathematical optimization perspective, how must a hard attention mechanism be trained?

soft attention Hard
A. Using standard Backpropagation Through Time (BPTT) with the reparameterization trick on the categorical distribution.
B. Using reinforcement learning techniques, such as the REINFORCE algorithm, to maximize an expected reward since sampling from a categorical distribution is non-differentiable.
C. Using standard Backpropagation by approximating the argmax operation with a straight-through estimator exclusively.
D. Using a purely unsupervised Expectation-Maximization (EM) algorithm to maximize the lower bound of the attention marginal likelihood.

58 The METEOR metric was designed to fix some of the flaws in the BLEU score. Which of the following describes a specific computational phase in METEOR that structurally handles lexical variations ignored by BLEU?

evaluation techniques Hard
A. METEOR applies a TF-IDF weighting scheme over the BLEU unigram matches, de-emphasizing highly frequent function words like 'the'.
B. METEOR maps candidate words to reference words using exact match, stem match (via Porter stemmer), and synonym match (via WordNet), maximizing the alignment score.
C. METEOR completely replaces n-gram matching with character-level Levenshtein distance, natively capturing morphological variants.
D. METEOR calculates a Brevity Penalty based on the harmonic mean of lengths, effectively penalizing synonyms that have more characters.

59 In the Bahdanau attention model, the attention alignment function heavily relies on the previous decoder hidden state . If one were to replace the unidirectional RNN in the decoder with a bidirectional RNN (BiRNN) for sequence generation, why would the Bahdanau mechanism theoretically break or become illogical in an auto-regressive context?

Bahdanau and Luong attention Hard
A. The previous decoder state would mathematically cancel out the backward hidden state, resulting in a constant context vector .
B. The alignment function cannot accept non-linear concatenations produced by BiRNNs without blowing up the dimensionality of .
C. BiRNNs inherently compute hard attention, fundamentally contradicting Bahdanau's soft attention paradigm.
D. A BiRNN requires knowledge of future generated tokens to compute the backward pass, violating the auto-regressive property of generating sequences one token at a time.

60 Self-attention (intra-attention) differs mathematically from classical seq2seq attention. In a standard seq2seq attention model translating a sentence of length to length , what is the size of the attention weight matrix at a single decoding time step , and what is the size of the complete attention weight matrix for self-attention over the source sentence?

attention in deep NLP Hard
A. Seq2seq step : ; Self-attention:
B. Seq2seq step : ; Self-attention:
C. Seq2seq step : ; Self-attention:
D. Seq2seq step : ; Self-attention: