1 $What are the two primary components of an encoder-decoder model in NLP?$

encoder–decoder architectures for NLP Easy

A.

An Encoder and a Decoder

B.

A Transformer and an Optimizer

C.

A Generator and a Discriminator

D.

A Convolutional layer and a Pooling layer

2 $In a basic encoder-decoder architecture, what does the encoder produce to pass information to the decoder?$

encoder–decoder architectures for NLP Easy

A.

A fixed-size context vector

B.

A sparse dependency tree

C.

A continuous stream of tokens

D.

A one-hot encoded matrix

3 $Which of the following neural network types was classically used to build encoders and decoders for sequential text data?$

encoder–decoder architectures for NLP Easy

A.

Convolutional Neural Networks (CNNs)

B.

Recurrent Neural Networks (RNNs)

C.

Generative Adversarial Networks (GANs)

D.

Radial Basis Function Networks (RBFNs)

4 $What is the primary function of a sequence-to-sequence (seq2seq) model in machine translation?$

sequence-to-sequence models for machine translation and summarization Easy

A.

To cluster similar words together

B.

To classify the language of the input sentence

C.

To predict the sentiment of the input sentence

D.

To map a sequence of words in one language to a sequence in another language

5 $When a sequence-to-sequence model is used for text summarization, how does the output sequence typically compare to the input sequence?$

sequence-to-sequence models for machine translation and summarization Easy

A.

It is a translated version of the input sequence

B.

It is a much longer and more detailed sequence

C.

It is an exact copy of the input sequence

D.

It is a shorter sequence that captures the main points of the input

6 $What primary problem with basic seq2seq models does the attention mechanism solve?$

attention in deep NLP Easy

A.

The information bottleneck of using a single fixed-size context vector

B.

The lack of word embeddings

C.

The vanishing gradient problem in CNNs

D.

The inability to process numeric data

7 $What does the attention mechanism allow the decoder to do during generation?$

attention in deep NLP Easy

A.

Generate multiple tokens at the exact same time

B.

Ignore the encoder completely

C.

Translate words without any training data

D.

Focus on different, relevant parts of the input sequence for each output step

8 $Which of the following best describes 'soft attention'?$

soft attention Easy

A.

It applies a hard threshold to remove low-frequency words.

B.

It calculates a weighted average of all input hidden states.

C.

It selects only one single input word to focus on deterministically.

D.

It randomly drops attention weights to prevent overfitting.

9 $In soft attention, what must the attention weights for a given decoding step sum up to?$

soft attention Easy

A.

$1$

B.

The size of the hidden layer

C.

$0$

D.

The length of the input sequence

10 $Bahdanau attention is also commonly referred to by which of the following names?$

Bahdanau and Luong attention Easy

A.

Dot-product attention

B.

Self-attention

C.

Additive attention

D.

Multiplicative attention

11 $How does Luong attention (multiplicative attention) typically calculate the alignment score?$

Bahdanau and Luong attention Easy

A.

By taking the dot product between the decoder hidden state and encoder hidden states

B.

By using a Convolutional Neural Network

C.

By adding the encoder and decoder states together

D.

By computing the Euclidean distance between words

12 $Which attention mechanism fundamentally introduced the concept of aligning and translating jointly using a feed-forward network to score alignments?$

Bahdanau and Luong attention Easy

A.

Multi-Head attention

B.

Scaled Dot-Product attention

C.

Luong attention

D.

Bahdanau attention

13 $When integrating attention into an encoder-decoder network, what function is typically applied to the alignment scores to obtain the final attention weights?$

integrating attention into encoder–decoder networks Easy

A.

Sigmoid

B.

ReLU

C.

Softmax

D.

Tanh

14 $In an attention-equipped seq2seq model, what is the 'context vector' composed of?$

integrating attention into encoder–decoder networks Easy

A.

A weighted sum of the encoder's hidden states

B.

The very last hidden state of the decoder

C.

A randomly initialized vector

D.

The sum of all word embeddings in the target language

15 $Why is evaluating sequence-to-sequence generation tasks (like translation) fundamentally harder than evaluating classification tasks?$

evaluation techniques Easy

A.

Classification models don't use loss functions.

B.

Text generation can have multiple valid and correct outputs for the same input.

C.

There is usually only one correct answer in text generation.

D.

Accuracy metrics can only be applied to numerical data.

16 $Which evaluation metric is most heavily used for Machine Translation and relies on modified n-gram precision?$

BLEU and ROUGE scores Easy

A.

Perplexity

B.

F1-Score

C.

ROUGE

D.

BLEU

17 $Which metric is commonly used for evaluating Text Summarization and relies heavily on n-gram recall?$

BLEU and ROUGE scores Easy

A.

Word Error Rate (WER)

B.

ROUGE

C.

Accuracy

D.

BLEU

18 $What is the purpose of the 'Brevity Penalty' (BP) in the BLEU score calculation?$

BLEU and ROUGE scores Easy

A.

To penalize the use of rare words.

B.

To penalize models that take too long to translate.

C.

To penalize candidate translations that are excessively long.

D.

To penalize candidate translations that are too short compared to the reference.

19 $What is a major limitation of standard seq2seq models that do NOT use attention?$

limitations of classical seq2seq models Easy

A.

They can only generate one word at a time.

B.

Performance drops significantly when processing long input sequences.

C.

They cannot handle any sequence data.

D.

They require paired training data.

20 $Why do classical RNN-based seq2seq models suffer from slow training speeds compared to modern architectures like Transformers?$

limitations of classical seq2seq models Easy

A.

They use too many convolutional filters.

B.

They can only be trained on CPUs.

C.

They must process tokens sequentially, which prevents parallelization.

D.

They require massive amounts of memory for attention weights.

21 $In a basic RNN-based encoder-decoder architecture without attention, how does the encoder transfer information to the decoder?$

encoder–decoder architectures for NLP Medium

A.

By sharing the same weight matrices for both encoding and decoding.

B.

By using a continuous feedback loop between the encoder and decoder.

C.

By passing all its hidden states simultaneously to the decoder.

D.

By passing its final hidden state as the initial context vector to the decoder.

22 $When generating an output sequence in an encoder-decoder model during inference, what is typically used as the input to the decoder at time step ?$

encoder–decoder architectures for NLP Medium

A.

The entire original input sequence.

B.

The actual ground-truth token from time step .

C.

The predicted token from time step .

D.

A fixed context vector re-encoded at every step.

23 $Why is Teacher Forcing commonly used during the training of sequence-to-sequence models for machine translation?$

sequence-to-sequence models for machine translation and summarization Medium

A.

It automatically corrects the weights of the encoder using external dictionaries.

B.

It prevents the model from relying on the encoder context.

C.

It allows the model to learn without requiring any target output data.

D.

It stabilizes and speeds up training by feeding the true previous target token as input to the decoder.

24 $In the context of seq2seq models for text summarization, what issue often arises when using a standard model without a pointer-generator network?$

sequence-to-sequence models for machine translation and summarization Medium

A.

The model generates summaries that are longer than the original text.

B.

The model struggles to accurately reproduce out-of-vocabulary (OOV) words like proper nouns.

C.

The model fails to process sequences longer than 10 tokens.

D.

The model perfectly memorizes the source text but cannot generate new words.

25 $Which search strategy is generally preferred during inference in seq2seq models to balance computational efficiency and translation quality?$

sequence-to-sequence models for machine translation and summarization Medium

A.

Greedy Search

B.

Beam Search

C.

Exhaustive Search

D.

Random Sampling

26 $What primary problem does the introduction of attention mechanisms solve in deep NLP?$

attention in deep NLP Medium

A.

The inability of RNNs to process discrete token inputs.

B.

The information bottleneck caused by compressing long input sequences into a fixed-length context vector.

C.

The need for Teacher Forcing during the training phase.

D.

The vanishing gradient problem in the decoder network.

27 $In the context of attention, the alignment score is calculated between which two components?$

attention in deep NLP Medium

A.

The current decoder hidden state and the encoder hidden states.

B.

The current decoder hidden state and all other decoder hidden states.

C.

The current encoder hidden state and the previous encoder hidden state.

D.

The input word embeddings and the output word embeddings.

28 $Which mathematical operation is applied to the raw alignment scores to produce soft attention weights?$

soft attention Medium

A.

ReLU activation

B.

Softmax function

C.

Argmax operation

D.

Sigmoid function

29 $How does soft attention differ from hard attention computationally?$

soft attention Medium

A.

Soft attention has higher variance in gradients than hard attention.

B.

Soft attention is fully differentiable, while hard attention requires reinforcement learning techniques to train.

C.

Soft attention cannot be used with sequence-to-sequence models.

D.

Soft attention selects exactly one input word, whereas hard attention averages them.

30 $In Bahdanau (Additive) attention, how is the alignment score typically computed?$

Bahdanau and Luong attention Medium

A.

By computing the cosine similarity between encoder and decoder states.

B.

By taking the dot product of the encoder and decoder hidden states.

C.

By passing the states through a multi-head self-attention block.

D.

By using a feed-forward neural network with a single hidden layer.

31 $Which of the following best describes the scoring function used in Luong's general (multiplicative) attention?$

Bahdanau and Luong attention Medium

A.

B.

C.

D.

32 $A key difference between Bahdanau and Luong attention mechanisms lies in when the context vector is used. How does Luong's global attention utilize the context vector?$

Bahdanau and Luong attention Medium

A.

It concatenates the context vector with the decoder's current hidden state to compute the final attentional hidden state.

B.

It feeds the context vector exclusively to the encoder to update embeddings.

C.

It uses the context vector to predict the previous hidden state.

D.

It uses the context vector only to compute the next encoder state.

33 $After computing the attention weights in an encoder-decoder network, how is the context vector generated?$

integrating attention into encoder–decoder networks Medium

A.

By calculating the unweighted average of the encoder hidden states.

B.

By computing a weighted sum of the encoder hidden states using the attention weights.

C.

By taking the dot product of the attention weights and the decoder hidden state.

D.

By applying a max-pooling operation over the encoder hidden states.

34 $When integrating attention into an RNN-based seq2seq model, what is the impact on the model's computational complexity per decoding step with respect to the input sequence length ?$

integrating attention into encoder–decoder networks Medium

A.

The complexity becomes independent of .

B.

The complexity becomes .

C.

The complexity becomes .

D.

The complexity becomes .

35 $Which of the following is a significant drawback of -gram based evaluation metrics like BLEU and ROUGE?$

evaluation techniques Medium

A.

They cannot evaluate models that use attention mechanisms.

B.

They require computationally expensive neural network forward passes to evaluate.

C.

They penalize models for outputting sequences of different lengths than the reference.

D.

They fail to account for semantic similarity and synonyms if exact word matches do not occur.

36 $In the BLEU metric, what is the purpose of the Brevity Penalty (BP)?$

BLEU and ROUGE scores Medium

A.

To penalize candidate translations that are too long compared to the reference.

B.

To penalize candidate translations that are shorter than the reference translation.

C.

To penalize candidate translations that use too many low-frequency words.

D.

To penalize candidate translations that contain grammatically incorrect -grams.

37 $While BLEU focuses primarily on precision, ROUGE is typically designed to emphasize which metric for tasks like summarization?$

BLEU and ROUGE scores Medium

A.

Specificity

B.

F1-Score

C.

Recall

D.

Accuracy

38 $What does ROUGE-L specifically measure when evaluating a generated text sequence?$

BLEU and ROUGE scores Medium

A.

The overlap of unigrams and bigrams combined.

B.

The average precision of all -grams up to length .

C.

The Longest Common Subsequence (LCS) between the candidate and reference texts.

D.

The semantic distance using word embeddings of length .

39 $Which of the following is a primary limitation of classical seq2seq models without attention?$

limitations of classical seq2seq models Medium

A.

They require hand-crafted features for syntactic parsing.

B.

They cannot generate text in different languages.

C.

They cannot be trained using standard backpropagation through time (BPTT).

D.

They suffer from an information bottleneck when encoding long sequences.

40 $Due to the sequential nature of classical RNN-based seq2seq models, which of the following computational bottlenecks occurs during training?$

limitations of classical seq2seq models Medium

A.

The inability to parallelize operations across time steps.

B.

The need to invert large vocabulary matrices at every step.

C.

The excessive memory consumption caused by large attention matrices.

D.

The impossibility of computing gradients for the decoder network.

41 $In a classical encoder-decoder architecture without attention, the entire input sequence is compressed into a fixed-length context vector. From an information-theoretic and optimization perspective, which of the following best describes the primary consequence of this architectural constraint on long sequences?$

limitations of classical seq2seq models Hard

A.

The model suffers from the information bottleneck problem and aggravated vanishing gradients during backpropagation through time (BPTT), leading to a rapid decay in the decoder's ability to recall early encoder tokens.

B.

The lack of attention reduces the time complexity of the decoding phase from to, but exponentially increases the space complexity required for the hidden state.

C.

The fixed-length vector strictly limits the vocabulary size the decoder can generate, causing an increase in out-of-vocabulary (OOV) errors for sequences longer than the context vector dimensionality.

D.

The context vector causes the decoder to overfit on the beginning of the sequence, completely ignoring the latter half of the input tokens during the generation phase.

42 $Consider the alignment models in Bahdanau and Luong attention mechanisms. Let be the encoder hidden state and be the decoder hidden state. Which of the following correctly identifies the fundamental mathematical difference in how the alignment score is computed?$

Bahdanau and Luong attention Hard

A.

Bahdanau uses an additive feed-forward network, whereas Luong evaluates multiplicative functions such as .

B.

Bahdanau calculates the context vector using hard attention sampling, whereas Luong uses deterministic soft attention based on the cosine similarity between and .

C.

Bahdanau computes scores using (current decoder state) via a dot product, whereas Luong computes scores using via an additive feed-forward network.

D.

Bahdanau requires computing a self-attention matrix over before comparing with, whereas Luong directly computes without trainable weights.

43 $In the computation of the BLEU score, the Brevity Penalty (BP) is used to penalize short translations. Suppose a candidate translation has length, and there are three reference translations with lengths,, and . If the effective reference length is chosen as the closest reference length to (with ties broken by selecting the shorter length), what is the value of the BP?$

BLEU and ROUGE scores Hard

A.

B.

C.

D.

44 $A candidate summary contains 0 matches for 4-grams against the reference summary, but has non-zero matches for 1-gram, 2-gram, and 3-gram. When calculating the standard un-smoothed BLEU-4 score (using a uniform weight distribution), what will be the resulting BLEU-4 score?$

BLEU and ROUGE scores Hard

A.

The BLEU-4 score will be exactly 0, because the geometric mean calculation involves multiplying the precisions, and a 0 precision for 4-grams nullifies the entire score.

B.

The BLEU-4 score will compute the geometric mean of only the 1-gram, 2-gram, and 3-gram precisions, scaled by .

C.

The BLEU-4 score will be the arithmetic average of the non-zero n-gram precisions, bypassing the 4-gram score.

D.

The BLEU-4 score evaluates to a highly penalized but non-zero value, as standard BLEU automatically applies add-one smoothing to zero counts.

45 $In a soft attention mechanism, a temperature parameter can be introduced into the softmax function: . What is the effect on the expected context vector as ?$

soft attention Hard

A.

The softmax function becomes undefined, requiring the use of the REINFORCE algorithm to sample from the resulting probability distribution.

B.

The attention distribution becomes infinitely flat, causing the gradient of with respect to the encoder states to explode.

C.

The attention distribution approaches a uniform distribution, making an unweighted average of all encoder hidden states.

D.

The attention distribution sharpens to a one-hot vector (ArgMax), meaning closely approximates the single encoder hidden state with the highest alignment score, simulating hard attention.

46 $Which of the following describes the key structural difference in how the context vector is utilized to update the decoder's hidden state and predict the next word between the standard Bahdanau and standard Luong (global) architectures?$

integrating attention into encoder–decoder networks Hard

A.

Bahdanau concatenates with the target input token before it passes through the decoder RNN, while Luong computes the decoder RNN state first and then concatenates it with to form an attentional hidden state.

B.

Bahdanau sums with the decoder's cell state for LSTM variants, whereas Luong concatenates only at the final softmax layer.

C.

Bahdanau uses exclusively to initialize the first hidden state of the decoder, while Luong recomputes at every time step.

D.

Bahdanau processes through an additional Bidirectional RNN layer in the decoder, whereas Luong uses a standard Unidirectional RNN.

47 $Seq2Seq models often suffer from 'exposure bias' during inference. Which of the following best defines this problem and identifies a common technique used to mitigate it?$

sequence-to-sequence models for machine translation and summarization Hard

A.

The attention weights become overly focused on a single token, limiting translation diversity; mitigated by applying a Coverage Penalty.

B.

The model is trained to predict the next token given the ground-truth previous token (teacher forcing), but at inference must rely on its own possibly erroneous predictions; mitigated by Scheduled Sampling.

C.

The decoder is exposed to too much context from the encoder, causing vanishing gradients; mitigated by Truncated Backpropagation Through Time (TBPTT).

D.

The model is exposed to out-of-vocabulary words during inference; mitigated by using Pointer-Generator Networks.

48 $Assume an input sequence of length, an output sequence of length, and hidden states of dimension . What is the overall asymptotic time complexity of computing the standard soft attention alignments (e.g., Luong dot-product attention) across the entire decoding process?$

attention in deep NLP Hard

A.

B.

C.

D.

49 $Luong introduced a 'local attention' mechanism to reduce the computational cost of global attention. In the 'predictive' alignment local attention model (local-p), how is the aligned position determined?$

Bahdanau and Luong attention Hard

A.

It is computed by finding the moving average of the previous attention distributions .

B.

It is selected via a hard ArgMax over the global alignment scores, making the model non-differentiable.

C.

It is assumed to be strictly monotonic, meaning for all decoding steps.

D.

It is predicted by passing the current decoder state through a dense layer with a sigmoid activation, scaled by the source sentence length : .

50 $When initializing a unidirectional decoder's hidden state from a bidirectional LSTM (BiLSTM) encoder, a dimension mismatch occurs if both use hidden dimension . Which of the following is the standard rigorous mathematical approach to initialize the decoder's initial state ?$

encoder–decoder architectures for NLP Hard

A.

Concatenate the final forward state and the final backward state, and pass the concatenated -dimensional vector through a linear projection layer parameterized by .

B.

Use the context vector generated by a zero-initialized attention query to map the encoder states into the dimensional decoder state.

C.

Directly assign the final forward state to the decoder:, completely ignoring the backward state to preserve causality.

D.

Take the element-wise average of all forward and backward hidden states from the encoder: .

51 $Consider a candidate translation: 'the the the the'. There are two reference translations: Ref 1: 'the cat is on the mat' and Ref 2: 'there is a cat on the mat'. Using the modified n-gram precision for BLEU, what is the modified unigram precision for this candidate?$

evaluation techniques Hard

A.

B.

C.

D.

52 $In beam search decoding for seq2seq models, sequences of different lengths must be compared. Since log-probabilities are negative, longer sequences naturally have lower scores. To counteract this, length normalization is applied. Which of the following formulations is the standard Google Neural Machine Translation (GNMT) length penalty?$

sequence-to-sequence models for machine translation and summarization Hard

A.

B.

C.

D.

53 $A classical seq2seq model generates abstractive summaries but consistently struggles with Out-of-Vocabulary (OOV) entities (e.g., rare names) present in the source text. Which architectural extension mathematically defines a generation probability to dynamically choose between sampling from the vocabulary distribution and the attention distribution?$

limitations of classical seq2seq models Hard

A.

Byte-Pair Encoding (BPE) subword tokenizers

B.

Local-m Attention Networks

C.

Transformer with relative position encodings

D.

Pointer-Generator Networks

54 $When applying sequence-to-sequence models to abstractive summarization, models frequently repeat the same phrases. To address this, a 'coverage penalty' is often added to the loss function. If is the attention weight for source token at decoder step, and the coverage vector is, which of the following is the standard formulation of the coverage loss added at step ?$

sequence-to-sequence models for machine translation and summarization Hard

A.

B.

C.

D.

55 $While BLEU focuses on precision, ROUGE scores evaluate recall. ROUGE-L utilizes the Longest Common Subsequence (LCS). What is a known limitation of standard ROUGE-L compared to ROUGE-W (Weighted LCS)?$

BLEU and ROUGE scores Hard

A.

ROUGE-L assigns the same score to a candidate that matches a reference with spatial gaps as it does to a candidate with consecutive matches of the same length.

B.

ROUGE-L cannot scale beyond sentence-level evaluation, causing it to crash on multi-document summarization tasks.

C.

ROUGE-L strictly requires n-grams to be contiguous, thereby failing to capture sequence-level similarity if a single word is inserted.

D.

ROUGE-L calculates precision instead of recall, making it redundant when evaluated alongside BLEU-4.

56 $In Luong's 'input-feeding' approach to attention integration, how is the attentional vector structurally passed to subsequent time steps to ensure the network maintains a history of past alignment decisions?$

integrating attention into encoder–decoder networks Hard

A.

is added to the cell state before being passed to the next LSTM step .

B.

replaces the encoder's original context vector and is passed purely through the residual connections of the network.

C.

is concatenated with the next target word input at step and fed into the decoder RNN.

D.

is multiplied by a learned decay matrix and fed as the initial state for the final softmax classifier at step .

57 $Contrasting soft attention and hard attention, soft attention computes a deterministic weighted average of encoder states, making it differentiable. Hard attention samples a single state. From a mathematical optimization perspective, how must a hard attention mechanism be trained?$

soft attention Hard

A.

Using standard Backpropagation by approximating the argmax operation with a straight-through estimator exclusively.

B.

Using reinforcement learning techniques, such as the REINFORCE algorithm, to maximize an expected reward since sampling from a categorical distribution is non-differentiable.

C.

Using standard Backpropagation Through Time (BPTT) with the reparameterization trick on the categorical distribution.

D.

Using a purely unsupervised Expectation-Maximization (EM) algorithm to maximize the lower bound of the attention marginal likelihood.

58 $The METEOR metric was designed to fix some of the flaws in the BLEU score. Which of the following describes a specific computational phase in METEOR that structurally handles lexical variations ignored by BLEU?$

evaluation techniques Hard

A.

METEOR maps candidate words to reference words using exact match, stem match (via Porter stemmer), and synonym match (via WordNet), maximizing the alignment score.

B.

METEOR completely replaces n-gram matching with character-level Levenshtein distance, natively capturing morphological variants.

C.

METEOR applies a TF-IDF weighting scheme over the BLEU unigram matches, de-emphasizing highly frequent function words like 'the'.

D.

METEOR calculates a Brevity Penalty based on the harmonic mean of lengths, effectively penalizing synonyms that have more characters.

59 $In the Bahdanau attention model, the attention alignment function heavily relies on the previous decoder hidden state . If one were to replace the unidirectional RNN in the decoder with a bidirectional RNN (BiRNN) for sequence generation, why would the Bahdanau mechanism theoretically break or become illogical in an auto-regressive context?$

Bahdanau and Luong attention Hard

A.

The previous decoder state would mathematically cancel out the backward hidden state, resulting in a constant context vector .

B.

The alignment function cannot accept non-linear concatenations produced by BiRNNs without blowing up the dimensionality of .

C.

A BiRNN requires knowledge of future generated tokens to compute the backward pass, violating the auto-regressive property of generating sequences one token at a time.

D.

BiRNNs inherently compute hard attention, fundamentally contradicting Bahdanau's soft attention paradigm.

60 $Self-attention (intra-attention) differs mathematically from classical seq2seq attention. In a standard seq2seq attention model translating a sentence of length to length, what is the size of the attention weight matrix at a single decoding time step, and what is the size of the complete attention weight matrix for self-attention over the source sentence?$

attention in deep NLP Hard

A.

Seq2seq step :; Self-attention:

B.

Seq2seq step :; Self-attention:

C.

Seq2seq step :; Self-attention:

D.

Seq2seq step :; Self-attention:

Unit 4 - Practice Quiz