1Which of the following is the best example of sequential text data?
sequential text data
Easy
A.A single, isolated pixel value in a digital image
B.A single categorical variable like 'Color'
C.A sentence in natural language where word order matters
D.A randomized, unordered bag of words
Correct Answer: A sentence in natural language where word order matters
Explanation:
Sequential text data refers to data where the order of elements (like words in a sentence) is crucial for understanding its meaning.
Incorrect! Try again.
2What is the primary purpose of a Recurrent Neural Network (RNN)?
recurrent neural networks
Easy
A.To compress tabular data into lower dimensions
B.To process static images using spatial convolutional filters
C.To cluster unlabeled data points into groups
D.To process sequential data by maintaining an internal hidden state
Correct Answer: To process sequential data by maintaining an internal hidden state
Explanation:
RNNs are designed specifically to handle sequential data by using a hidden state that captures information from previous time steps.
Incorrect! Try again.
3In a standard RNN, the hidden state at time step depends on:
recurrent neural networks
Easy
A.All future inputs from time to the end of the sequence
B.Only the current input at time
C.Only the previous hidden state at time
D.The current input at time and the hidden state at time
Correct Answer: The current input at time and the hidden state at time
Explanation:
At any time step , an RNN cell takes both the new input at and the hidden state from the previous step to compute the new hidden state.
Incorrect! Try again.
4Which major issue in standard RNNs does the Long Short-Term Memory (LSTM) network aim to solve?
long short term memory networks
Easy
A.The need for massive amounts of labeled data
B.Vanishing and exploding gradients during training
C.The inability to process multi-channel images
D.The extremely high inference speed of standard RNNs
Correct Answer: Vanishing and exploding gradients during training
Explanation:
LSTMs were introduced to overcome the vanishing and exploding gradient problems, allowing networks to learn long-term dependencies in sequences.
Incorrect! Try again.
5Which of the following components helps an LSTM cell control the flow of information?
long short term memory networks
Easy
A.Convolutional filters (like kernels)
B.Gates (such as forget, input, and output gates)
C.Softmax activation functions applied to every weight
D.Max-pooling layers
Correct Answer: Gates (such as forget, input, and output gates)
Explanation:
An LSTM uses a gating mechanism (forget, input, and output gates) to regulate what information is kept, updated, or discarded from the cell state.
Incorrect! Try again.
6How does a Gated Recurrent Unit (GRU) primarily differ from an LSTM in terms of architecture?
gated recurrent units
Easy
A.A GRU does not use any gates and relies solely on convolutions
B.A GRU simplifies the architecture by combining the forget and input gates into a single update gate
C.A GRU uses a separate memory cell state distinct from its hidden state
D.A GRU has significantly more gates than an LSTM
Correct Answer: A GRU simplifies the architecture by combining the forget and input gates into a single update gate
Explanation:
GRUs have a simpler architecture than LSTMs because they merge the forget and input gates into a single update gate, and they do not have a separate cell state.
Incorrect! Try again.
7Which of the following gates is explicitly present in a GRU?
gated recurrent units
Easy
A.Forget gate
B.Output gate
C.Cell gate
D.Update gate
Correct Answer: Update gate
Explanation:
A GRU features two main gates: the update gate and the reset gate. It lacks the separate output and forget gates found in LSTMs.
Incorrect! Try again.
8What is the main advantage of a Bidirectional RNN over a standard unidirectional RNN?
bidirectional RNNs
Easy
A.It has access to both past and future context for a given time step
B.It trains twice as fast due to parallel processing
C.It requires exactly half the number of learnable parameters
D.It does not require backpropagation to update its weights
Correct Answer: It has access to both past and future context for a given time step
Explanation:
Bidirectional RNNs process data in both forward and backward directions, allowing the model to understand the context of a word based on both the words that precede it and the words that follow it.
Incorrect! Try again.
9How are the hidden states typically combined at each time step in a Bidirectional RNN?
bidirectional RNNs
Easy
A.They are subtracted from each other to find the difference
B.The forward and backward hidden states are concatenated or summed together
C.Only the forward hidden state is kept for final predictions
D.The backward state completely overwrites the forward state
Correct Answer: The forward and backward hidden states are concatenated or summed together
Explanation:
In a Bidirectional RNN, the outputs of the forward and backward passes at time step are usually concatenated (or sometimes summed) to form the combined output representation for that step.
Incorrect! Try again.
10Which of the following is a classic application of sequence modeling in NLP?
sequence modeling applications
Easy
A.Machine translation
B.K-means clustering of pixels
C.Predicting housing prices from tabular data
D.Image segmentation
Correct Answer: Machine translation
Explanation:
Machine translation involves translating a sequence of words in one language to a sequence of words in another, making it a classic sequence modeling task.
Incorrect! Try again.
11In sequence modeling, generating a shorter, concise version of a long document is known as:
sequence modeling applications
Easy
A.Topic modeling
B.Named entity recognition
C.Text summarization
D.Part-of-speech tagging
Correct Answer: Text summarization
Explanation:
Text summarization is the task of automatically generating a shorter version of a text that retains its most important information.
Incorrect! Try again.
12What is the primary goal of sentiment classification?
sentiment classification
Easy
A.To translate text from English to French
B.To determine the emotional tone or polarity (e.g., positive, negative) of a text
C.To correct grammatical errors in a sentence
D.To predict the next word in a sequence of text
Correct Answer: To determine the emotional tone or polarity (e.g., positive, negative) of a text
Explanation:
Sentiment classification focuses on analyzing text to determine whether the expressed sentiment is positive, negative, or neutral.
Incorrect! Try again.
13When using a standard RNN for sentence-level sentiment classification, which output is typically used to make the final prediction?
sentiment classification
Easy
A.The hidden state of the first time step
B.The output of the forget gate
C.The hidden state of the final time step
D.The randomly initialized word embeddings
Correct Answer: The hidden state of the final time step
Explanation:
For text classification tasks like sentiment analysis, the hidden state at the final time step is typically used because it has encoded information from the entire sequence.
Incorrect! Try again.
14Assigning an incoming email to either a 'Spam' or 'Not Spam' folder is an example of:
text classification
Easy
A.Text classification
B.Speech recognition
C.Machine translation
D.Text generation
Correct Answer: Text classification
Explanation:
Spam filtering is a binary text classification problem where a piece of text (the email) is assigned a categorical label.
Incorrect! Try again.
15In the context of preparing data for sequence models, what is 'padding'?
sequence training techniques
Easy
A.Removing the most frequent words from the text corpus
B.Increasing the size of the neural network's hidden layers
C.Adding dummy tokens to sequences so that all sequences in a batch have the same length
D.Multiplying the learning rate by a constant factor during training
Correct Answer: Adding dummy tokens to sequences so that all sequences in a batch have the same length
Explanation:
Neural networks process data in batches, which require uniform tensor dimensions. Padding adds a special token (like <PAD>) to shorter sequences to match the length of the longest sequence in the batch.
Incorrect! Try again.
16What does the 'teacher forcing' technique do during the training of an RNN for sequence generation?
teacher forcing
Easy
A.It randomizes the inputs to make the model more robust
B.It feeds the actual ground truth output from the previous time step as the input to the current time step
C.It forces the model to memorize the entire training dataset by removing dropout
D.It uses a larger, pre-trained model to teach a smaller model
Correct Answer: It feeds the actual ground truth output from the previous time step as the input to the current time step
Explanation:
Teacher forcing improves training speed and stability by providing the model with the correct previous token (the ground truth), rather than the model's own potentially incorrect prediction.
Incorrect! Try again.
17Why is gradient clipping often used when training sequence models like RNNs?
sequence training techniques
Easy
A.To increase the learning rate automatically when it gets too low
B.To skip the backpropagation step entirely for efficiency
C.To prevent exploding gradients by capping them at a maximum threshold
D.To force the model to converge to exactly zero
Correct Answer: To prevent exploding gradients by capping them at a maximum threshold
Explanation:
RNNs are prone to exploding gradients during backpropagation through time. Gradient clipping bounds the gradients to a maximum value to prevent the network weights from updating too drastically.
Incorrect! Try again.
18What is Truncated Backpropagation Through Time (TBPTT)?
truncated backpropagation through time
Easy
A.A technique to completely remove backpropagation from the training loop
B.An algorithm used to randomly initialize the weights of an RNN
C.A method to train the network exclusively on future time steps
D.A method to limit the number of time steps gradients are propagated backward to save memory and compute
Correct Answer: A method to limit the number of time steps gradients are propagated backward to save memory and compute
Explanation:
TBPTT unrolls the RNN for a fixed number of steps rather than the entire sequence, making training on very long sequences computationally feasible.
Incorrect! Try again.
19Why is standard Backpropagation Through Time (BPTT) problematic for extremely long sequences?
truncated backpropagation through time
Easy
A.It requires too much memory and suffers heavily from vanishing or exploding gradients
B.It is impossible to write in modern deep learning frameworks
C.It only works on image datasets, not text
D.It ignores the sequential nature of the text completely
Correct Answer: It requires too much memory and suffers heavily from vanishing or exploding gradients
Explanation:
Standard BPTT unrolls the network for the entire sequence length. For very long sequences, this consumes vast amounts of memory and exacerbates gradient issues.
Incorrect! Try again.
20Which metric is most commonly used to evaluate the overall correctness of a binary text classification model (e.g., positive vs. negative sentiment)?
evaluation metrics for sequence tasks
Easy
A.Accuracy or F1-score
B.Mean Squared Error (MSE)
C.Word Error Rate (WER)
D.Bilingual Evaluation Understudy (BLEU)
Correct Answer: Accuracy or F1-score
Explanation:
For classification tasks, metrics like Accuracy (percentage of correct predictions) or F1-score (harmonic mean of precision and recall) are standard.
Incorrect! Try again.
21Why is treating sentences merely as a bag-of-words insufficient for modeling sequential text data in complex NLP tasks?
sequential text data
Medium
A.It ignores the structural order and contextual dependency of words.
B.It requires significantly more memory compared to recurrent models.
C.It inherently suffers from the vanishing gradient problem.
D.It cannot be used to handle out-of-vocabulary words.
Correct Answer: It ignores the structural order and contextual dependency of words.
Explanation:
Bag-of-words models represent text as an unordered collection of words, discarding sequence and grammar. Sequential text data models are necessary to capture the structural order and context, which dictate meaning.
Incorrect! Try again.
22In a standard Recurrent Neural Network (RNN), the hidden state at time step , denoted as , is calculated using which of the following inputs?
recurrent neural networks
Medium
A.Only the current input
B.The previous input and the current hidden state
C.The current input and the previous hidden state
D.The current input and the previous output
Correct Answer: The current input and the previous hidden state
Explanation:
An RNN updates its hidden state by applying a function (often involving a activation) to a linear combination of the current input and the previous hidden state .
Incorrect! Try again.
23What is the primary mathematical cause of the vanishing gradient problem when training standard RNNs on long sequences using Backpropagation Through Time?
recurrent neural networks
Medium
A.The use of ReLU activation functions in the output layer
B.Overfitting to the temporal dimension due to a lack of training data
C.Repeated multiplication of weight matrices with eigenvalues less than 1
D.The gradients being truncated too early during backpropagation
Correct Answer: Repeated multiplication of weight matrices with eigenvalues less than 1
Explanation:
During backpropagation through time, gradients are computed via the chain rule, resulting in repeated multiplication of the recurrent weight matrix. If the eigenvalues of this matrix are less than 1, the gradients shrink exponentially, causing them to vanish.
Incorrect! Try again.
24Which gate in a Long Short-Term Memory (LSTM) network is specifically responsible for deciding what information to discard from the internal cell state?
long short term memory networks
Medium
A.Forget gate
B.Update gate
C.Input gate
D.Output gate
Correct Answer: Forget gate
Explanation:
The forget gate outputs a value between 0 and 1 for each number in the cell state . A 0 means completely discard the information, while a 1 means keep it entirely.
Incorrect! Try again.
25An LSTM cell computes the new cell state using the old cell state , the forget gate , the input gate , and the candidate cell state . Which equation correctly represents this update?
long short term memory networks
Medium
A.
B.
C.
D.
Correct Answer:
Explanation:
The new cell state is formed by pointwise multiplying the old state by the forget gate , and adding it to the new candidate values scaled by how much we decided to update them via the input gate .
Incorrect! Try again.
26How does the internal cell state in an LSTM help mitigate the vanishing gradient problem?
long short term memory networks
Medium
A.It provides a more direct, uninterrupted gradient flow path through addition operations.
B.It resets the gradients to 1 at each time step to prevent shrinkage.
C.It relies exclusively on orthogonal weight initialization to maintain gradient scale.
D.It uses a step function to artificially boost gradients at every time step.
Correct Answer: It provides a more direct, uninterrupted gradient flow path through addition operations.
Explanation:
The cell state acts like a conveyor belt. Its update involves mostly linear operations (addition), allowing gradients to flow back through time with minimal modification, which severely reduces the vanishing gradient issue.
Incorrect! Try again.
27Compared to an LSTM, which of the following accurately describes a structural simplification made by a Gated Recurrent Unit (GRU)?
gated recurrent units
Medium
A.A GRU adds an extra memory gate to replace the cell state.
B.A GRU relies exclusively on peephole connections.
C.A GRU eliminates the recurrent hidden state entirely.
D.A GRU combines the forget and input gates into a single update gate.
Correct Answer: A GRU combines the forget and input gates into a single update gate.
Explanation:
A GRU simplifies the LSTM architecture by merging the cell state and hidden state, and combining the input and forget gates into a single "update gate".
Incorrect! Try again.
28If a GRU's update gate is close to 1, what does this mathematically imply about the updated hidden state ?
gated recurrent units
Medium
A.The hidden state will be completely reset to zero.
B.The new hidden state will heavily rely on the previous hidden state .
C.The network will ignore the previous hidden state entirely.
D.The new hidden state will heavily rely on the candidate hidden state .
Correct Answer: The new hidden state will heavily rely on the previous hidden state .
Explanation:
In a GRU, the update equation is . If is close to 1, the model retains the old hidden state and ignores the new candidate state .
Incorrect! Try again.
29In a BiLSTM used for Named Entity Recognition, how is the final contextual representation for the -th word in a sentence typically obtained?
bidirectional RNNs
Medium
A.By averaging the hidden states of all surrounding words in a fixed window.
B.By subtracting from .
C.By concatenating the forward hidden state and backward hidden state .
D.By using only the forward hidden state for the first half of the sequence.
Correct Answer: By concatenating the forward hidden state and backward hidden state .
Explanation:
Bidirectional RNNs process the sequence in both directions. The representation of a token at step is formed by concatenating the hidden state from the forward RNN and the hidden state from the backward RNN, capturing both past and future context.
Incorrect! Try again.
30Why are Bidirectional RNNs generally unsuitable for real-time autoregressive text generation tasks?
bidirectional RNNs
Medium
A.They cannot use LSTM or GRU cells, leading to poor memory retention.
B.They require access to future tokens in the sequence which have not yet been generated.
C.They have a strict limitation on the vocabulary size they can output.
D.They suffer from exploding gradients significantly more than standard RNNs.
Correct Answer: They require access to future tokens in the sequence which have not yet been generated.
Explanation:
Autoregressive generation predicts the next word based only on previous words. A bidirectional model requires reading the sequence backwards (from future to past), which is impossible in real-time generation where future tokens do not exist yet.
Incorrect! Try again.
31Which specific sequence modeling architecture is most appropriate for a Machine Translation task?
sequence modeling applications
Medium
A.One-to-one
B.Many-to-one
C.Many-to-many (Encoder-Decoder)
D.One-to-many
Correct Answer: Many-to-many (Encoder-Decoder)
Explanation:
Machine Translation requires mapping an input sequence of variable length to an output sequence of variable length, which is best handled by a Many-to-many (Encoder-Decoder or Seq2Seq) architecture.
Incorrect! Try again.
32In a standard encoder-decoder architecture for sequence-to-sequence tasks, what is the primary role of the context vector?
sequence modeling applications
Medium
A.It applies dropout to prevent overfitting during decoding.
B.It directly predicts the final output class of the sequence.
C.It forces the decoder to output the exact tokens of the input sequence.
D.It encapsulates the information from the entire input sequence to initialize the decoder.
Correct Answer: It encapsulates the information from the entire input sequence to initialize the decoder.
Explanation:
The context vector is the final hidden state of the encoder. Its purpose is to compress and summarize the entire input sequence into a fixed-size representation that the decoder uses to begin generating the output.
Incorrect! Try again.
33When using a unidirectional RNN for document-level sentiment classification, which hidden state is typically passed to the final dense classification layer?
sentiment classification
Medium
A.The hidden state of the first time step.
B.A randomly sampled hidden state from the sequence.
C.The hidden state of the final time step.
D.The hidden state with the highest gradient magnitude.
Correct Answer: The hidden state of the final time step.
Explanation:
In a many-to-one architecture like sentiment analysis, the hidden state at the final time step theoretically contains the compressed information (context) of the entire sequence, making it the appropriate input for the classification layer.
Incorrect! Try again.
34To improve a text classification RNN's robust understanding of a sentence and avoid the information bottleneck of just using the final hidden state, one could apply:
C.Global max pooling or average pooling over all hidden states of the sequence.
D.The raw word embeddings directly concatenated to the final output.
Correct Answer: Global max pooling or average pooling over all hidden states of the sequence.
Explanation:
Applying pooling (max or average) across the hidden states of all time steps allows the model to capture the most important features from the entire sequence, alleviating the bottleneck of relying solely on the final hidden state.
Incorrect! Try again.
35What is a prominent and simple technique used to prevent the exploding gradient problem when training deep sequence models?
sequence training techniques
Medium
A.Teacher forcing
B.Adding more recurrent layers
C.Gradient clipping
D.Label smoothing
Correct Answer: Gradient clipping
Explanation:
Gradient clipping prevents gradients from growing too large by capping them at a maximum threshold during backpropagation, which directly resolves the exploding gradient problem.
Incorrect! Try again.
36Which regularization technique is specifically adapted for recurrent connections in sequence models by applying the exact same dropout mask across all time steps?
sequence training techniques
Medium
A.Batch Normalization
B.Standard Dropout
C.Variational Dropout
D.L2 Regularization
Correct Answer: Variational Dropout
Explanation:
Variational Dropout (proposed by Y. Gal and Z. Ghahramani) applies the same dropout mask at every time step for both inputs and recurrent states, which properly regularizes the RNN without disrupting its ability to retain long-term memory.
Incorrect! Try again.
37During the training of a sequence generation model using Teacher Forcing, what input is fed to the decoder at time step ?
teacher forcing
Medium
A.The model's own predicted token from time step
B.A random token sampled from the vocabulary
C.The ground truth token from time step
D.The context vector from the encoder
Correct Answer: The ground truth token from time step
Explanation:
Teacher forcing is a training strategy where the ground truth (actual) previous token is fed as the input to the next time step, regardless of what the model actually predicted at the previous step. This stabilizes and speeds up training.
Incorrect! Try again.
38What is the primary computational advantage of using Truncated Backpropagation Through Time (TBPTT) over standard BPTT for very long sequences?
truncated backpropagation through time
Medium
A.It limits the number of time steps the gradient flows backward, significantly reducing memory usage and computation time.
B.It entirely eliminates the vanishing gradient problem for all recurrent architectures.
C.It automatically tunes the learning rate hyperparameters of the RNN.
D.It allows the model to process sequences of infinite length without losing any past information.
Correct Answer: It limits the number of time steps the gradient flows backward, significantly reducing memory usage and computation time.
Explanation:
Standard BPTT requires keeping all time steps in memory to calculate gradients. TBPTT truncates this backward pass to a fixed number of steps, which keeps memory constraints manageable and speeds up updates.
Incorrect! Try again.
39In TBPTT, when a long document is split into chunks of length , what happens to the hidden states and gradients at the boundary between one chunk and the next?
truncated backpropagation through time
Medium
A.The hidden state is reset to zero at the start of every chunk to keep chunks entirely independent.
B.The hidden state and the gradients are both passed through to the previous chunk until the start of the document.
C.The hidden state is passed forward to the next chunk to retain context, but gradients are stopped from flowing backward into the previous chunk.
D.The hidden state is discarded, but the gradients flow backward across all chunk boundaries.
Correct Answer: The hidden state is passed forward to the next chunk to retain context, but gradients are stopped from flowing backward into the previous chunk.
Explanation:
In TBPTT, to maintain the forward flow of information, hidden states are carried over between chunks. However, to save computation and memory, the backward pass (gradients) is truncated and does not cross the chunk boundaries.
Incorrect! Try again.
40Which evaluation metric, commonly used for sequence generation tasks like Machine Translation, computes the geometric mean of n-gram precision multiplied by a brevity penalty?
evaluation metrics for sequence tasks
Medium
A.Perplexity
B.ROUGE
C.F1-Score
D.BLEU
Correct Answer: BLEU
Explanation:
The BLEU (Bilingual Evaluation Understudy) score evaluates generated text by calculating the precision of matching n-grams against reference texts, applying a brevity penalty to penalize overly short translations.
Incorrect! Try again.
41In a vanilla Recurrent Neural Network (RNN) with hidden state , the vanishing gradient problem occurs during Backpropagation Through Time (BPTT). Mathematically, what is the primary condition that causes the gradient to vanish exponentially with respect to the sequence length?
recurrent neural networks
Hard
A.The input sequences have a variance approaching zero, causing the derivative to saturate.
B.The Frobenius norm of the input weight matrix exceeds the sequence length .
C.The spectral radius (largest absolute eigenvalue) of the weight matrix is less than 1.
D.The dominant eigenvalue of the Jacobian matrix is strictly greater than 1.
Correct Answer: The spectral radius (largest absolute eigenvalue) of the weight matrix is less than 1.
Explanation:
During BPTT, gradients are computed via repeated multiplication of the Jacobian matrix . If the largest singular value (or spectral radius) of the recurrent weight matrix is less than 1, these repeated multiplications cause the gradient to decay exponentially, leading to the vanishing gradient problem.
Incorrect! Try again.
42Consider an LSTM cell where the forget gate is artificially clamped to $1$ (vector of ones) and the input gate is clamped to $0$ (vector of zeros) for all time steps . Assuming the cell state , how does the LSTM behave?
long short term memory networks
Hard
A.The cell state will remain $0$ for all , and the hidden state will only depend on the output gate .
B.The cell state will accumulate gradients linearly over time, leading to exploding gradients.
C.The network behaves identically to a standard vanilla RNN, rendering the gating mechanisms useless.
D.The hidden state will be entirely dependent on the current input .
Correct Answer: The cell state will remain $0$ for all , and the hidden state will only depend on the output gate .
Explanation:
The cell state update is . If and , then . Since , remains $0$ forever. The hidden state is , which simplifies to . Thus, evaluates to zero, effectively halting information flow.
Incorrect! Try again.
43Assume a GRU and an LSTM both process an input of dimension and have a hidden state of dimension . Ignoring biases for simplicity, what is the exact ratio of the number of trainable weight parameters in the GRU to the number of trainable weight parameters in the LSTM?
gated recurrent units
Hard
A.
B.
C.
D.
Correct Answer:
Explanation:
An LSTM has 4 sets of weights (for the forget gate, input gate, output gate, and cell state candidate), each of size . A GRU has 3 sets of weights (update gate, reset gate, and hidden state candidate), each of size . Therefore, the ratio of parameters (excluding biases) is exactly .
Incorrect! Try again.
44Why is a Bidirectional RNN (BiRNN) fundamentally unsuitable for standard autoregressive causal language modeling (e.g., predicting the next word given )?
bidirectional RNNs
Hard
A.The concatenated hidden state dimension is too large for the softmax layer.
B.BiRNNs cannot handle variable-length sequences during inference.
C.BPTT cannot be applied to the backward pass in a real-time sequential data stream.
D.The backward RNN requires access to future tokens, which leaks the target information () during the prediction at step .
Correct Answer: The backward RNN requires access to future tokens, which leaks the target information () during the prediction at step .
Explanation:
Causal language modeling relies on predicting the future based strictly on the past. A BiRNN processes the sequence in both directions, meaning the backward representation at step has already seen , etc. This violates the causal masking required for autoregressive generation.
Incorrect! Try again.
45In sequence-to-sequence training, 'exposure bias' occurs when a model is trained with Teacher Forcing but tested in a free-running mode. Which of the following best describes how 'Scheduled Sampling' aims to mitigate this specific issue?
teacher forcing
Hard
A.During training, it stochastically decides whether to feed the ground-truth previous token or the model's own previous prediction to the next step, with the probability of using ground-truth decaying over time.
B.It applies dropout to the decoder's recurrent connections with a probability that decreases over time.
C.It randomly masks out tokens in the input sequence to force the model to rely on its hidden state.
D.It dynamically alters the loss function from Cross-Entropy to Reinforcement Learning (e.g., REINFORCE) as training progresses.
Correct Answer: During training, it stochastically decides whether to feed the ground-truth previous token or the model's own previous prediction to the next step, with the probability of using ground-truth decaying over time.
Explanation:
Scheduled Sampling bridges the gap between training and inference by gradually shifting the model inputs during training from the gold-standard tokens (Teacher Forcing) to the model's own predictions, allowing the model to learn how to recover from its own mistakes.
Incorrect! Try again.
46In a standard implementation of Truncated Backpropagation Through Time (TBPTT) configured with forward step and backward step (where ), what happens to the hidden state and the computational graph at the boundary between chunks?
truncated backpropagation through time
Hard
A.The hidden state is passed forward to the next chunk, but it is detached from the computational graph, preventing gradients from flowing back into the previous chunk.
B.The hidden state is passed forward, and gradients are accumulated indefinitely, effectively making it equivalent to full BPTT.
C.The hidden state is reset to zero, and the computational graph is retained to allow gradient flow.
D.Both the hidden state and the computational graph are discarded, treating each chunk as an independent sequence.
Correct Answer: The hidden state is passed forward to the next chunk, but it is detached from the computational graph, preventing gradients from flowing back into the previous chunk.
Explanation:
In TBPTT, the forward pass continues seamlessly by transferring the hidden state to the next chunk (maintaining statefulness). However, to limit memory and computational cost, the hidden state is detached from the autodiff graph (e.g., h.detach() in PyTorch), stopping the backward pass at the chunk boundary.
Incorrect! Try again.
47To prevent exploding gradients in deep RNNs, practitioners use gradient clipping. Consider global norm clipping versus value clipping. Why is global norm clipping generally preferred over value-based clipping for sequence models?
sequence training techniques
Hard
A.Global norm clipping guarantees that the loss will decrease monotonically, whereas value clipping does not.
B.Value clipping cannot prevent the hidden state from exploding during the forward pass.
C.Value clipping requires calculating the norm of the gradient, which is computationally expensive for large sequence models.
D.Global norm clipping scales all parameter gradients by the same factor, preserving the overall direction of the gradient vector.
Correct Answer: Global norm clipping scales all parameter gradients by the same factor, preserving the overall direction of the gradient vector.
Explanation:
Global norm clipping shrinks the entire gradient vector if its norm exceeds a threshold, multiplying it by . This preserves the direction of the gradient in the parameter space. Value clipping clips each gradient component individually, which can drastically alter the direction of the update vector, destabilizing training.
Incorrect! Try again.
48When designing an RNN for document-level text classification, taking the final hidden state can create a bottleneck. If you instead apply max-pooling over the sequence of hidden states , what distinct representational advantage does this provide over the final hidden state?
text classification
Hard
A.It reduces the dimensionality of the hidden state before passing it to the linear classification layer.
B.It completely eliminates the vanishing gradient problem for the early tokens in the document.
C.It forces the RNN to behave like a Bag-of-Words model, discarding positional information entirely.
D.It captures the most salient features (highest activation values) across the entire sequence, regardless of their position, mitigating the bias towards recent tokens.
Correct Answer: It captures the most salient features (highest activation values) across the entire sequence, regardless of their position, mitigating the bias towards recent tokens.
Explanation:
Max-pooling across the time dimension extracts the maximum value for each feature channel over the entire sequence. This allows the model to identify strong signals (salient features) anywhere in the text, avoiding the standard RNN bottleneck where the final state heavily favors the most recent (end of sequence) tokens.
Incorrect! Try again.
49For a sequence , the perplexity (PPL) of a language model is defined as . If the model calculates the average categorical cross-entropy loss using natural logarithms, what is the exact mathematical relationship between and PPL?
evaluation metrics for sequence tasks
Hard
A.
B.
C.
D.
Correct Answer:
Explanation:
Perplexity is the inverse probability of the test set, normalized by the number of words. Since the loss is the negative log-likelihood (base ) averaged over tokens, . Therefore, , which simplifies to .
Incorrect! Try again.
50In a variant of the LSTM known as 'LSTM with peephole connections' (Gers & Schmidhuber, 2000), the gates are allowed to inspect the cell state. Specifically, which cell state is used to compute the forget gate and input gate , versus the output gate ?
long short term memory networks
Hard
A. and use ; uses .
B.All three gates use .
C. and use ; uses .
D.All three gates use .
Correct Answer: and use ; uses .
Explanation:
In peephole LSTMs, the forget and input gates are calculated before the cell state is updated, so they must use the previous cell state . The output gate is calculated after the cell state is updated, so it is allowed to 'peek' at the current, newly updated cell state .
Incorrect! Try again.
51In a GRU, the candidate hidden state is computed as . If the reset gate approaches a zero vector, what is the structural implication for the sequence model at time ?
gated recurrent units
Hard
A.The network architecture degrades into a strictly linear transformation of the input sequence.
B.The candidate state acts as if it is reading the first symbol of a new sequence, ignoring previous hidden states.
C.The update gate is forced to 1, causing the hidden state to remain exactly .
D.The GRU completely forgets its entire history, and the final hidden state is determined solely by .
Correct Answer: The candidate state acts as if it is reading the first symbol of a new sequence, ignoring previous hidden states.
Explanation:
When the reset gate is near 0, the term vanishes. The candidate hidden state relies only on the current input . This allows the model to drop irrelevant history and essentially treat as the start of a new context or sequence.
Incorrect! Try again.
52Why is the ReLU activation function rarely used in vanilla Recurrent Neural Networks compared to , despite ReLU's success in mitigating vanishing gradients in deep Feedforward and Convolutional networks?
recurrent neural networks
Hard
A.ReLU completely prevents gradient flow for negative inputs, rendering BPTT impossible across multiple time steps.
B.ReLU causes the hidden state to become non-differentiable at exactly zero, which crashes recurrent autodifferentiation engines.
C.Because RNNs reuse the same weight matrix at every time step, the unbounded positive output of ReLU often leads to exponentially exploding activations and gradients.
D.The memory requirements for caching ReLU activations across time steps are significantly higher than for .
Correct Answer: Because RNNs reuse the same weight matrix at every time step, the unbounded positive output of ReLU often leads to exponentially exploding activations and gradients.
Explanation:
Unlike feedforward networks that use different weights for each layer, an RNN multiplies the same recurrent weight matrix repeatedly. With an unbounded activation function like ReLU ( for ), values can easily compound and blow up to infinity (exploding activations/gradients) if the largest eigenvalue of is greater than 1. squashes values to , keeping activations bounded.
Incorrect! Try again.
53A -th order Markov model assumes the probability of the next word depends only on the previous words. How does an unrolled standard RNN theoretically bypass this -th order Markov assumption for sequential text data?
sequential text data
Hard
A.It does not bypass it; an RNN is mathematically equivalent to a 1st-order Markov model on the input space.
B.By updating its recurrent weights dynamically based on the length of the input sequence.
C.By utilizing attention mechanisms that provide direct access to all past words.
D.By maintaining a continuous hidden state vector that acts as a lossy compression of the entire unbounded history .
Correct Answer: By maintaining a continuous hidden state vector that acts as a lossy compression of the entire unbounded history .
Explanation:
An RNN's hidden state is a recursive function of and . Because contains information about , and so on back to , theoretically holds information from the infinite past (the entire sequence up to ), breaking the strict finite-window limitation of a -th order Markov model.
Incorrect! Try again.
54Assume a sequence of length . You train an RNN using Truncated BPTT with forward chunk size and backward window . What is the total number of unrolled time steps processed during the backward passes for one complete epoch of this single sequence?
truncated backpropagation through time
Hard
A.$1000$
B.$2500$
C.$50000$
D.$50$
Correct Answer: $1000$
Explanation:
The sequence of length $1000$ is divided into chunks. For each chunk, the backward pass unrolls for steps. Therefore, the total number of backward steps processed is . It matches the sequence length, ensuring linear complexity overall.
Incorrect! Try again.
55In the BLEU score calculation for sequence tasks, the Brevity Penalty (BP) is introduced to penalize short translations. Let be the length of the candidate translation and be the effective reference corpus length. Under what exact condition is ?
evaluation metrics for sequence tasks
Hard
A. only
B.
C.
D. only
Correct Answer:
Explanation:
The Brevity Penalty in BLEU is defined as if , and if . This ensures that no penalty is applied (multiplier of 1) as long as the candidate length is at least as long as the reference length .
Incorrect! Try again.
56In a Hierarchical Attention Network (HAN) used for document-level sentiment classification, the architecture consists of word-level and sentence-level encoders. Which of the following best represents the sequence of operations for generating the final document vector ?
B.Word embedding Sentence-level BiGRU Word Attention Word-level BiGRU Document Attention.
C.Word embedding Word-level BiGRU Word Attention Sentence-level BiGRU Document Attention.
D.Word embedding Word Attention Word-level BiGRU Sentence Attention Sentence-level BiGRU.
Correct Answer: Word embedding Word-level BiGRU Word Attention Sentence-level BiGRU Document Attention.
Explanation:
The HAN processes data hierarchically: First, words are embedded and passed through a word-level BiGRU. Then, word-level attention aggregates them into sentence vectors. These sentence vectors are passed through a sentence-level BiGRU. Finally, document-level attention aggregates the sentence hidden states into a single document vector.
Incorrect! Try again.
57When performing Named Entity Recognition (NER) on a sequence of tokens using a Bidirectional RNN, the output for token relies on the concatenated state . If an entity spans from to , how does capture the dependency on ?
bidirectional RNNs
Hard
A.Through a subsequent CRF layer, since the BiRNN hidden states themselves cannot capture future dependencies.
B.Through the forward hidden state , which has processed before reaching .
C.Through the backward hidden state , which processes sequences from down to $1$ and thus incorporates before reaching .
D.By employing teacher forcing during inference to feed into .
Correct Answer: Through the backward hidden state , which processes sequences from down to $1$ and thus incorporates before reaching .
Explanation:
The backward RNN processes tokens in reverse order (). Therefore, when computing , the backward RNN has already processed tokens and . This future context is thus successfully captured in the concatenated state at step .
Incorrect! Try again.
58In the standard sequence-to-sequence (Seq2Seq) model without attention (e.g., Cho et al., 2014), the entire source sentence is compressed into a single context vector . How is this context vector applied during the decoding phase?
sequence modeling applications
Hard
A.It is used to mask out out-of-vocabulary words during the final softmax projection.
B.It is only used to initialize the first hidden state of the decoder .
C.It replaces the word embedding input at every time step in the decoder.
D.It is provided as an additional input to the decoder at every time step , alongside the previous target token and previous hidden state.
Correct Answer: It is provided as an additional input to the decoder at every time step , alongside the previous target token and previous hidden state.
Explanation:
In the original RNN Encoder-Decoder architecture (Cho et al., 2014), the static context vector is fed into the decoder at every single time step. The decoder's hidden state update function is formulated as , ensuring the source information is explicitly available throughout generation.
Incorrect! Try again.
59While LSTMs successfully mitigate the vanishing gradient problem through their additive cell state , they are still susceptible to exploding gradients. Mathematically, why does the LSTM design not prevent exploding gradients?
long short term memory networks
Hard
A.The input gate can take negative values, flipping the sign of the gradients during BPTT.
B.The forget gate can exceed 1, causing exponential growth of the cell state over time.
C.The backpropagation through the output gate and the activation involves matrix multiplications with and at each time step, which can compound and explode.
D.The additive cell state requires gradients to be summed over time, and the sum of gradients will always approach infinity for large .
Correct Answer: The backpropagation through the output gate and the activation involves matrix multiplications with and at each time step, which can compound and explode.
Explanation:
Although the cell state update provides an uninterrupted gradient path (preventing vanishing gradients), the gradients still flow through the nonlinear gates (input, output, forget) at each step. These gate derivatives involve standard matrix multiplications with the recurrent weights (e.g., ). If the spectral radius of these matrices is large, the gradients traversing through the gates can still compound and explode.
Incorrect! Try again.
60When training sequence models with Cross-Entropy loss, 'Label Smoothing' is often applied by converting the hard target distribution (one-hot vector) into a soft target distribution. What is the primary theoretical justification for using Label Smoothing in autoregressive sequence models?
sequence training techniques
Hard
A.It allows the model to bypass the need for an attention mechanism by implicitly modeling word similarities.
B.It prevents the model from predicting the <EOS> token too early in sequence generation.
C.It completely eliminates exposure bias by allowing the model to sample incorrect tokens during Teacher Forcing.
D.It prevents the softmax logits from growing infinitely large, which reduces overconfidence and improves generalization (regularization).
Correct Answer: It prevents the softmax logits from growing infinitely large, which reduces overconfidence and improves generalization (regularization).
Explanation:
With one-hot targets, the Cross-Entropy loss is minimized only when the logit for the correct class approaches infinity compared to others, leading to overconfident models and overfitting. Label Smoothing assigns a small probability to incorrect classes, encouraging finite logits, thereby acting as a regularizer and improving generalization.