Unit 3 - Practice Quiz

CSE472 60 Questions
0 Correct 0 Wrong 60 Left
0/60

1 Which of the following is the best example of sequential text data?

sequential text data Easy
A. A single, isolated pixel value in a digital image
B. A single categorical variable like 'Color'
C. A sentence in natural language where word order matters
D. A randomized, unordered bag of words

2 What is the primary purpose of a Recurrent Neural Network (RNN)?

recurrent neural networks Easy
A. To compress tabular data into lower dimensions
B. To process static images using spatial convolutional filters
C. To cluster unlabeled data points into groups
D. To process sequential data by maintaining an internal hidden state

3 In a standard RNN, the hidden state at time step depends on:

recurrent neural networks Easy
A. All future inputs from time to the end of the sequence
B. Only the current input at time
C. Only the previous hidden state at time
D. The current input at time and the hidden state at time

4 Which major issue in standard RNNs does the Long Short-Term Memory (LSTM) network aim to solve?

long short term memory networks Easy
A. The need for massive amounts of labeled data
B. Vanishing and exploding gradients during training
C. The inability to process multi-channel images
D. The extremely high inference speed of standard RNNs

5 Which of the following components helps an LSTM cell control the flow of information?

long short term memory networks Easy
A. Convolutional filters (like kernels)
B. Gates (such as forget, input, and output gates)
C. Softmax activation functions applied to every weight
D. Max-pooling layers

6 How does a Gated Recurrent Unit (GRU) primarily differ from an LSTM in terms of architecture?

gated recurrent units Easy
A. A GRU does not use any gates and relies solely on convolutions
B. A GRU simplifies the architecture by combining the forget and input gates into a single update gate
C. A GRU uses a separate memory cell state distinct from its hidden state
D. A GRU has significantly more gates than an LSTM

7 Which of the following gates is explicitly present in a GRU?

gated recurrent units Easy
A. Forget gate
B. Output gate
C. Cell gate
D. Update gate

8 What is the main advantage of a Bidirectional RNN over a standard unidirectional RNN?

bidirectional RNNs Easy
A. It has access to both past and future context for a given time step
B. It trains twice as fast due to parallel processing
C. It requires exactly half the number of learnable parameters
D. It does not require backpropagation to update its weights

9 How are the hidden states typically combined at each time step in a Bidirectional RNN?

bidirectional RNNs Easy
A. They are subtracted from each other to find the difference
B. The forward and backward hidden states are concatenated or summed together
C. Only the forward hidden state is kept for final predictions
D. The backward state completely overwrites the forward state

10 Which of the following is a classic application of sequence modeling in NLP?

sequence modeling applications Easy
A. Machine translation
B. K-means clustering of pixels
C. Predicting housing prices from tabular data
D. Image segmentation

11 In sequence modeling, generating a shorter, concise version of a long document is known as:

sequence modeling applications Easy
A. Topic modeling
B. Named entity recognition
C. Text summarization
D. Part-of-speech tagging

12 What is the primary goal of sentiment classification?

sentiment classification Easy
A. To translate text from English to French
B. To determine the emotional tone or polarity (e.g., positive, negative) of a text
C. To correct grammatical errors in a sentence
D. To predict the next word in a sequence of text

13 When using a standard RNN for sentence-level sentiment classification, which output is typically used to make the final prediction?

sentiment classification Easy
A. The hidden state of the first time step
B. The output of the forget gate
C. The hidden state of the final time step
D. The randomly initialized word embeddings

14 Assigning an incoming email to either a 'Spam' or 'Not Spam' folder is an example of:

text classification Easy
A. Text classification
B. Speech recognition
C. Machine translation
D. Text generation

15 In the context of preparing data for sequence models, what is 'padding'?

sequence training techniques Easy
A. Removing the most frequent words from the text corpus
B. Increasing the size of the neural network's hidden layers
C. Adding dummy tokens to sequences so that all sequences in a batch have the same length
D. Multiplying the learning rate by a constant factor during training

16 What does the 'teacher forcing' technique do during the training of an RNN for sequence generation?

teacher forcing Easy
A. It randomizes the inputs to make the model more robust
B. It feeds the actual ground truth output from the previous time step as the input to the current time step
C. It forces the model to memorize the entire training dataset by removing dropout
D. It uses a larger, pre-trained model to teach a smaller model

17 Why is gradient clipping often used when training sequence models like RNNs?

sequence training techniques Easy
A. To increase the learning rate automatically when it gets too low
B. To skip the backpropagation step entirely for efficiency
C. To prevent exploding gradients by capping them at a maximum threshold
D. To force the model to converge to exactly zero

18 What is Truncated Backpropagation Through Time (TBPTT)?

truncated backpropagation through time Easy
A. A technique to completely remove backpropagation from the training loop
B. An algorithm used to randomly initialize the weights of an RNN
C. A method to train the network exclusively on future time steps
D. A method to limit the number of time steps gradients are propagated backward to save memory and compute

19 Why is standard Backpropagation Through Time (BPTT) problematic for extremely long sequences?

truncated backpropagation through time Easy
A. It requires too much memory and suffers heavily from vanishing or exploding gradients
B. It is impossible to write in modern deep learning frameworks
C. It only works on image datasets, not text
D. It ignores the sequential nature of the text completely

20 Which metric is most commonly used to evaluate the overall correctness of a binary text classification model (e.g., positive vs. negative sentiment)?

evaluation metrics for sequence tasks Easy
A. Accuracy or F1-score
B. Mean Squared Error (MSE)
C. Word Error Rate (WER)
D. Bilingual Evaluation Understudy (BLEU)

21 Why is treating sentences merely as a bag-of-words insufficient for modeling sequential text data in complex NLP tasks?

sequential text data Medium
A. It ignores the structural order and contextual dependency of words.
B. It requires significantly more memory compared to recurrent models.
C. It inherently suffers from the vanishing gradient problem.
D. It cannot be used to handle out-of-vocabulary words.

22 In a standard Recurrent Neural Network (RNN), the hidden state at time step , denoted as , is calculated using which of the following inputs?

recurrent neural networks Medium
A. Only the current input
B. The previous input and the current hidden state
C. The current input and the previous hidden state
D. The current input and the previous output

23 What is the primary mathematical cause of the vanishing gradient problem when training standard RNNs on long sequences using Backpropagation Through Time?

recurrent neural networks Medium
A. The use of ReLU activation functions in the output layer
B. Overfitting to the temporal dimension due to a lack of training data
C. Repeated multiplication of weight matrices with eigenvalues less than 1
D. The gradients being truncated too early during backpropagation

24 Which gate in a Long Short-Term Memory (LSTM) network is specifically responsible for deciding what information to discard from the internal cell state?

long short term memory networks Medium
A. Forget gate
B. Update gate
C. Input gate
D. Output gate

25 An LSTM cell computes the new cell state using the old cell state , the forget gate , the input gate , and the candidate cell state . Which equation correctly represents this update?

long short term memory networks Medium
A.
B.
C.
D.

26 How does the internal cell state in an LSTM help mitigate the vanishing gradient problem?

long short term memory networks Medium
A. It provides a more direct, uninterrupted gradient flow path through addition operations.
B. It resets the gradients to 1 at each time step to prevent shrinkage.
C. It relies exclusively on orthogonal weight initialization to maintain gradient scale.
D. It uses a step function to artificially boost gradients at every time step.

27 Compared to an LSTM, which of the following accurately describes a structural simplification made by a Gated Recurrent Unit (GRU)?

gated recurrent units Medium
A. A GRU adds an extra memory gate to replace the cell state.
B. A GRU relies exclusively on peephole connections.
C. A GRU eliminates the recurrent hidden state entirely.
D. A GRU combines the forget and input gates into a single update gate.

28 If a GRU's update gate is close to 1, what does this mathematically imply about the updated hidden state ?

gated recurrent units Medium
A. The hidden state will be completely reset to zero.
B. The new hidden state will heavily rely on the previous hidden state .
C. The network will ignore the previous hidden state entirely.
D. The new hidden state will heavily rely on the candidate hidden state .

29 In a BiLSTM used for Named Entity Recognition, how is the final contextual representation for the -th word in a sentence typically obtained?

bidirectional RNNs Medium
A. By averaging the hidden states of all surrounding words in a fixed window.
B. By subtracting from .
C. By concatenating the forward hidden state and backward hidden state .
D. By using only the forward hidden state for the first half of the sequence.

30 Why are Bidirectional RNNs generally unsuitable for real-time autoregressive text generation tasks?

bidirectional RNNs Medium
A. They cannot use LSTM or GRU cells, leading to poor memory retention.
B. They require access to future tokens in the sequence which have not yet been generated.
C. They have a strict limitation on the vocabulary size they can output.
D. They suffer from exploding gradients significantly more than standard RNNs.

31 Which specific sequence modeling architecture is most appropriate for a Machine Translation task?

sequence modeling applications Medium
A. One-to-one
B. Many-to-one
C. Many-to-many (Encoder-Decoder)
D. One-to-many

32 In a standard encoder-decoder architecture for sequence-to-sequence tasks, what is the primary role of the context vector?

sequence modeling applications Medium
A. It applies dropout to prevent overfitting during decoding.
B. It directly predicts the final output class of the sequence.
C. It forces the decoder to output the exact tokens of the input sequence.
D. It encapsulates the information from the entire input sequence to initialize the decoder.

33 When using a unidirectional RNN for document-level sentiment classification, which hidden state is typically passed to the final dense classification layer?

sentiment classification Medium
A. The hidden state of the first time step.
B. A randomly sampled hidden state from the sequence.
C. The hidden state of the final time step.
D. The hidden state with the highest gradient magnitude.

34 To improve a text classification RNN's robust understanding of a sentence and avoid the information bottleneck of just using the final hidden state, one could apply:

text classification Medium
A. Only the cell state while completely ignoring .
B. A one-hot encoded bag-of-words vector exclusively.
C. Global max pooling or average pooling over all hidden states of the sequence.
D. The raw word embeddings directly concatenated to the final output.

35 What is a prominent and simple technique used to prevent the exploding gradient problem when training deep sequence models?

sequence training techniques Medium
A. Teacher forcing
B. Adding more recurrent layers
C. Gradient clipping
D. Label smoothing

36 Which regularization technique is specifically adapted for recurrent connections in sequence models by applying the exact same dropout mask across all time steps?

sequence training techniques Medium
A. Batch Normalization
B. Standard Dropout
C. Variational Dropout
D. L2 Regularization

37 During the training of a sequence generation model using Teacher Forcing, what input is fed to the decoder at time step ?

teacher forcing Medium
A. The model's own predicted token from time step
B. A random token sampled from the vocabulary
C. The ground truth token from time step
D. The context vector from the encoder

38 What is the primary computational advantage of using Truncated Backpropagation Through Time (TBPTT) over standard BPTT for very long sequences?

truncated backpropagation through time Medium
A. It limits the number of time steps the gradient flows backward, significantly reducing memory usage and computation time.
B. It entirely eliminates the vanishing gradient problem for all recurrent architectures.
C. It automatically tunes the learning rate hyperparameters of the RNN.
D. It allows the model to process sequences of infinite length without losing any past information.

39 In TBPTT, when a long document is split into chunks of length , what happens to the hidden states and gradients at the boundary between one chunk and the next?

truncated backpropagation through time Medium
A. The hidden state is reset to zero at the start of every chunk to keep chunks entirely independent.
B. The hidden state and the gradients are both passed through to the previous chunk until the start of the document.
C. The hidden state is passed forward to the next chunk to retain context, but gradients are stopped from flowing backward into the previous chunk.
D. The hidden state is discarded, but the gradients flow backward across all chunk boundaries.

40 Which evaluation metric, commonly used for sequence generation tasks like Machine Translation, computes the geometric mean of n-gram precision multiplied by a brevity penalty?

evaluation metrics for sequence tasks Medium
A. Perplexity
B. ROUGE
C. F1-Score
D. BLEU

41 In a vanilla Recurrent Neural Network (RNN) with hidden state , the vanishing gradient problem occurs during Backpropagation Through Time (BPTT). Mathematically, what is the primary condition that causes the gradient to vanish exponentially with respect to the sequence length?

recurrent neural networks Hard
A. The input sequences have a variance approaching zero, causing the derivative to saturate.
B. The Frobenius norm of the input weight matrix exceeds the sequence length .
C. The spectral radius (largest absolute eigenvalue) of the weight matrix is less than 1.
D. The dominant eigenvalue of the Jacobian matrix is strictly greater than 1.

42 Consider an LSTM cell where the forget gate is artificially clamped to $1$ (vector of ones) and the input gate is clamped to $0$ (vector of zeros) for all time steps . Assuming the cell state , how does the LSTM behave?

long short term memory networks Hard
A. The cell state will remain $0$ for all , and the hidden state will only depend on the output gate .
B. The cell state will accumulate gradients linearly over time, leading to exploding gradients.
C. The network behaves identically to a standard vanilla RNN, rendering the gating mechanisms useless.
D. The hidden state will be entirely dependent on the current input .

43 Assume a GRU and an LSTM both process an input of dimension and have a hidden state of dimension . Ignoring biases for simplicity, what is the exact ratio of the number of trainable weight parameters in the GRU to the number of trainable weight parameters in the LSTM?

gated recurrent units Hard
A.
B.
C.
D.

44 Why is a Bidirectional RNN (BiRNN) fundamentally unsuitable for standard autoregressive causal language modeling (e.g., predicting the next word given )?

bidirectional RNNs Hard
A. The concatenated hidden state dimension is too large for the softmax layer.
B. BiRNNs cannot handle variable-length sequences during inference.
C. BPTT cannot be applied to the backward pass in a real-time sequential data stream.
D. The backward RNN requires access to future tokens, which leaks the target information () during the prediction at step .

45 In sequence-to-sequence training, 'exposure bias' occurs when a model is trained with Teacher Forcing but tested in a free-running mode. Which of the following best describes how 'Scheduled Sampling' aims to mitigate this specific issue?

teacher forcing Hard
A. During training, it stochastically decides whether to feed the ground-truth previous token or the model's own previous prediction to the next step, with the probability of using ground-truth decaying over time.
B. It applies dropout to the decoder's recurrent connections with a probability that decreases over time.
C. It randomly masks out tokens in the input sequence to force the model to rely on its hidden state.
D. It dynamically alters the loss function from Cross-Entropy to Reinforcement Learning (e.g., REINFORCE) as training progresses.

46 In a standard implementation of Truncated Backpropagation Through Time (TBPTT) configured with forward step and backward step (where ), what happens to the hidden state and the computational graph at the boundary between chunks?

truncated backpropagation through time Hard
A. The hidden state is passed forward to the next chunk, but it is detached from the computational graph, preventing gradients from flowing back into the previous chunk.
B. The hidden state is passed forward, and gradients are accumulated indefinitely, effectively making it equivalent to full BPTT.
C. The hidden state is reset to zero, and the computational graph is retained to allow gradient flow.
D. Both the hidden state and the computational graph are discarded, treating each chunk as an independent sequence.

47 To prevent exploding gradients in deep RNNs, practitioners use gradient clipping. Consider global norm clipping versus value clipping. Why is global norm clipping generally preferred over value-based clipping for sequence models?

sequence training techniques Hard
A. Global norm clipping guarantees that the loss will decrease monotonically, whereas value clipping does not.
B. Value clipping cannot prevent the hidden state from exploding during the forward pass.
C. Value clipping requires calculating the norm of the gradient, which is computationally expensive for large sequence models.
D. Global norm clipping scales all parameter gradients by the same factor, preserving the overall direction of the gradient vector.

48 When designing an RNN for document-level text classification, taking the final hidden state can create a bottleneck. If you instead apply max-pooling over the sequence of hidden states , what distinct representational advantage does this provide over the final hidden state?

text classification Hard
A. It reduces the dimensionality of the hidden state before passing it to the linear classification layer.
B. It completely eliminates the vanishing gradient problem for the early tokens in the document.
C. It forces the RNN to behave like a Bag-of-Words model, discarding positional information entirely.
D. It captures the most salient features (highest activation values) across the entire sequence, regardless of their position, mitigating the bias towards recent tokens.

49 For a sequence , the perplexity (PPL) of a language model is defined as . If the model calculates the average categorical cross-entropy loss using natural logarithms, what is the exact mathematical relationship between and PPL?

evaluation metrics for sequence tasks Hard
A.
B.
C.
D.

50 In a variant of the LSTM known as 'LSTM with peephole connections' (Gers & Schmidhuber, 2000), the gates are allowed to inspect the cell state. Specifically, which cell state is used to compute the forget gate and input gate , versus the output gate ?

long short term memory networks Hard
A. and use ; uses .
B. All three gates use .
C. and use ; uses .
D. All three gates use .

51 In a GRU, the candidate hidden state is computed as . If the reset gate approaches a zero vector, what is the structural implication for the sequence model at time ?

gated recurrent units Hard
A. The network architecture degrades into a strictly linear transformation of the input sequence.
B. The candidate state acts as if it is reading the first symbol of a new sequence, ignoring previous hidden states.
C. The update gate is forced to 1, causing the hidden state to remain exactly .
D. The GRU completely forgets its entire history, and the final hidden state is determined solely by .

52 Why is the ReLU activation function rarely used in vanilla Recurrent Neural Networks compared to , despite ReLU's success in mitigating vanishing gradients in deep Feedforward and Convolutional networks?

recurrent neural networks Hard
A. ReLU completely prevents gradient flow for negative inputs, rendering BPTT impossible across multiple time steps.
B. ReLU causes the hidden state to become non-differentiable at exactly zero, which crashes recurrent autodifferentiation engines.
C. Because RNNs reuse the same weight matrix at every time step, the unbounded positive output of ReLU often leads to exponentially exploding activations and gradients.
D. The memory requirements for caching ReLU activations across time steps are significantly higher than for .

53 A -th order Markov model assumes the probability of the next word depends only on the previous words. How does an unrolled standard RNN theoretically bypass this -th order Markov assumption for sequential text data?

sequential text data Hard
A. It does not bypass it; an RNN is mathematically equivalent to a 1st-order Markov model on the input space.
B. By updating its recurrent weights dynamically based on the length of the input sequence.
C. By utilizing attention mechanisms that provide direct access to all past words.
D. By maintaining a continuous hidden state vector that acts as a lossy compression of the entire unbounded history .

54 Assume a sequence of length . You train an RNN using Truncated BPTT with forward chunk size and backward window . What is the total number of unrolled time steps processed during the backward passes for one complete epoch of this single sequence?

truncated backpropagation through time Hard
A. $1000$
B. $2500$
C. $50000$
D. $50$

55 In the BLEU score calculation for sequence tasks, the Brevity Penalty (BP) is introduced to penalize short translations. Let be the length of the candidate translation and be the effective reference corpus length. Under what exact condition is ?

evaluation metrics for sequence tasks Hard
A. only
B.
C.
D. only

56 In a Hierarchical Attention Network (HAN) used for document-level sentiment classification, the architecture consists of word-level and sentence-level encoders. Which of the following best represents the sequence of operations for generating the final document vector ?

sentiment classification Hard
A. Word embedding Word-level BiGRU Sentence Attention Sentence-level BiGRU Document Attention.
B. Word embedding Sentence-level BiGRU Word Attention Word-level BiGRU Document Attention.
C. Word embedding Word-level BiGRU Word Attention Sentence-level BiGRU Document Attention.
D. Word embedding Word Attention Word-level BiGRU Sentence Attention Sentence-level BiGRU.

57 When performing Named Entity Recognition (NER) on a sequence of tokens using a Bidirectional RNN, the output for token relies on the concatenated state . If an entity spans from to , how does capture the dependency on ?

bidirectional RNNs Hard
A. Through a subsequent CRF layer, since the BiRNN hidden states themselves cannot capture future dependencies.
B. Through the forward hidden state , which has processed before reaching .
C. Through the backward hidden state , which processes sequences from down to $1$ and thus incorporates before reaching .
D. By employing teacher forcing during inference to feed into .

58 In the standard sequence-to-sequence (Seq2Seq) model without attention (e.g., Cho et al., 2014), the entire source sentence is compressed into a single context vector . How is this context vector applied during the decoding phase?

sequence modeling applications Hard
A. It is used to mask out out-of-vocabulary words during the final softmax projection.
B. It is only used to initialize the first hidden state of the decoder .
C. It replaces the word embedding input at every time step in the decoder.
D. It is provided as an additional input to the decoder at every time step , alongside the previous target token and previous hidden state.

59 While LSTMs successfully mitigate the vanishing gradient problem through their additive cell state , they are still susceptible to exploding gradients. Mathematically, why does the LSTM design not prevent exploding gradients?

long short term memory networks Hard
A. The input gate can take negative values, flipping the sign of the gradients during BPTT.
B. The forget gate can exceed 1, causing exponential growth of the cell state over time.
C. The backpropagation through the output gate and the activation involves matrix multiplications with and at each time step, which can compound and explode.
D. The additive cell state requires gradients to be summed over time, and the sum of gradients will always approach infinity for large .

60 When training sequence models with Cross-Entropy loss, 'Label Smoothing' is often applied by converting the hard target distribution (one-hot vector) into a soft target distribution. What is the primary theoretical justification for using Label Smoothing in autoregressive sequence models?

sequence training techniques Hard
A. It allows the model to bypass the need for an attention mechanism by implicitly modeling word similarities.
B. It prevents the model from predicting the <EOS> token too early in sequence generation.
C. It completely eliminates exposure bias by allowing the model to sample incorrect tokens during Teacher Forcing.
D. It prevents the softmax logits from growing infinitely large, which reduces overconfidence and improves generalization (regularization).