1 $Which of the following is the best example of sequential text data?$

sequential text data Easy

A.

A randomized, unordered bag of words

B.

A sentence in natural language where word order matters

C.

A single, isolated pixel value in a digital image

D.

A single categorical variable like 'Color'

2 $What is the primary purpose of a Recurrent Neural Network (RNN)?$

recurrent neural networks Easy

A.

To process sequential data by maintaining an internal hidden state

B.

To process static images using spatial convolutional filters

C.

To compress tabular data into lower dimensions

D.

To cluster unlabeled data points into groups

3 $In a standard RNN, the hidden state at time step depends on:$

recurrent neural networks Easy

A.

The current input at time and the hidden state at time

B.

Only the current input at time

C.

All future inputs from time to the end of the sequence

D.

Only the previous hidden state at time

4 $Which major issue in standard RNNs does the Long Short-Term Memory (LSTM) network aim to solve?$

long short term memory networks Easy

A.

Vanishing and exploding gradients during training

B.

The extremely high inference speed of standard RNNs

C.

The inability to process multi-channel images

D.

The need for massive amounts of labeled data

5 $Which of the following components helps an LSTM cell control the flow of information?$

long short term memory networks Easy

A.

Convolutional filters (like kernels)

B.

Max-pooling layers

C.

Softmax activation functions applied to every weight

D.

Gates (such as forget, input, and output gates)

6 $How does a Gated Recurrent Unit (GRU) primarily differ from an LSTM in terms of architecture?$

gated recurrent units Easy

A.

A GRU does not use any gates and relies solely on convolutions

B.

A GRU has significantly more gates than an LSTM

C.

A GRU uses a separate memory cell state distinct from its hidden state

D.

A GRU simplifies the architecture by combining the forget and input gates into a single update gate

7 $Which of the following gates is explicitly present in a GRU?$

gated recurrent units Easy

A.

Output gate

B.

Cell gate

C.

Forget gate

D.

Update gate

8 $What is the main advantage of a Bidirectional RNN over a standard unidirectional RNN?$

bidirectional RNNs Easy

A.

It has access to both past and future context for a given time step

B.

It does not require backpropagation to update its weights

C.

It requires exactly half the number of learnable parameters

D.

It trains twice as fast due to parallel processing

9 $How are the hidden states typically combined at each time step in a Bidirectional RNN?$

bidirectional RNNs Easy

A.

They are subtracted from each other to find the difference

B.

The forward and backward hidden states are concatenated or summed together

C.

The backward state completely overwrites the forward state

D.

Only the forward hidden state is kept for final predictions

10 $Which of the following is a classic application of sequence modeling in NLP?$

sequence modeling applications Easy

A.

Machine translation

B.

Image segmentation

C.

K-means clustering of pixels

D.

Predicting housing prices from tabular data

11 $In sequence modeling, generating a shorter, concise version of a long document is known as:$

sequence modeling applications Easy

A.

Named entity recognition

B.

Topic modeling

C.

Text summarization

D.

Part-of-speech tagging

12 $What is the primary goal of sentiment classification?$

sentiment classification Easy

A.

To predict the next word in a sequence of text

B.

To translate text from English to French

C.

To determine the emotional tone or polarity (e.g., positive, negative) of a text

D.

To correct grammatical errors in a sentence

13 $When using a standard RNN for sentence-level sentiment classification, which output is typically used to make the final prediction?$

sentiment classification Easy

A.

The hidden state of the first time step

B.

The hidden state of the final time step

C.

The output of the forget gate

D.

The randomly initialized word embeddings

14 $Assigning an incoming email to either a 'Spam' or 'Not Spam' folder is an example of:$

text classification Easy

A.

Machine translation

B.

Speech recognition

C.

Text classification

D.

Text generation

15 $In the context of preparing data for sequence models, what is 'padding'?$

sequence training techniques Easy

A.

Multiplying the learning rate by a constant factor during training

B.

Increasing the size of the neural network's hidden layers

C.

Removing the most frequent words from the text corpus

D.

Adding dummy tokens to sequences so that all sequences in a batch have the same length

16 $What does the 'teacher forcing' technique do during the training of an RNN for sequence generation?$

teacher forcing Easy

A.

It randomizes the inputs to make the model more robust

B.

It uses a larger, pre-trained model to teach a smaller model

C.

It feeds the actual ground truth output from the previous time step as the input to the current time step

D.

It forces the model to memorize the entire training dataset by removing dropout

17 $Why is gradient clipping often used when training sequence models like RNNs?$

sequence training techniques Easy

A.

To increase the learning rate automatically when it gets too low

B.

To prevent exploding gradients by capping them at a maximum threshold

C.

To force the model to converge to exactly zero

D.

To skip the backpropagation step entirely for efficiency

18 $What is Truncated Backpropagation Through Time (TBPTT)?$

truncated backpropagation through time Easy

A.

A method to train the network exclusively on future time steps

B.

A method to limit the number of time steps gradients are propagated backward to save memory and compute

C.

A technique to completely remove backpropagation from the training loop

D.

An algorithm used to randomly initialize the weights of an RNN

19 $Why is standard Backpropagation Through Time (BPTT) problematic for extremely long sequences?$

truncated backpropagation through time Easy

A.

It is impossible to write in modern deep learning frameworks

B.

It only works on image datasets, not text

C.

It ignores the sequential nature of the text completely

D.

It requires too much memory and suffers heavily from vanishing or exploding gradients

20 $Which metric is most commonly used to evaluate the overall correctness of a binary text classification model (e.g., positive vs. negative sentiment)?$

evaluation metrics for sequence tasks Easy

A.

Bilingual Evaluation Understudy (BLEU)

B.

Word Error Rate (WER)

C.

Accuracy or F1-score

D.

Mean Squared Error (MSE)

21 $Why is treating sentences merely as a bag-of-words insufficient for modeling sequential text data in complex NLP tasks?$

sequential text data Medium

A.

It ignores the structural order and contextual dependency of words.

B.

It inherently suffers from the vanishing gradient problem.

C.

It cannot be used to handle out-of-vocabulary words.

D.

It requires significantly more memory compared to recurrent models.

22 $In a standard Recurrent Neural Network (RNN), the hidden state at time step, denoted as, is calculated using which of the following inputs?$

recurrent neural networks Medium

A.

The current input and the previous hidden state

B.

The current input and the previous output

C.

Only the current input

D.

The previous input and the current hidden state

23 $What is the primary mathematical cause of the vanishing gradient problem when training standard RNNs on long sequences using Backpropagation Through Time?$

recurrent neural networks Medium

A.

Overfitting to the temporal dimension due to a lack of training data

B.

Repeated multiplication of weight matrices with eigenvalues less than 1

C.

The gradients being truncated too early during backpropagation

D.

The use of ReLU activation functions in the output layer

24 $Which gate in a Long Short-Term Memory (LSTM) network is specifically responsible for deciding what information to discard from the internal cell state?$

long short term memory networks Medium

A.

Input gate

B.

Forget gate

C.

Output gate

D.

Update gate

25 $An LSTM cell computes the new cell state using the old cell state, the forget gate, the input gate, and the candidate cell state . Which equation correctly represents this update?$

long short term memory networks Medium

A.

B.

C.

D.

26 $How does the internal cell state in an LSTM help mitigate the vanishing gradient problem?$

long short term memory networks Medium

A.

It provides a more direct, uninterrupted gradient flow path through addition operations.

B.

It relies exclusively on orthogonal weight initialization to maintain gradient scale.

C.

It uses a step function to artificially boost gradients at every time step.

D.

It resets the gradients to 1 at each time step to prevent shrinkage.

27 $Compared to an LSTM, which of the following accurately describes a structural simplification made by a Gated Recurrent Unit (GRU)?$

gated recurrent units Medium

A.

A GRU combines the forget and input gates into a single update gate.

B.

A GRU adds an extra memory gate to replace the cell state.

C.

A GRU eliminates the recurrent hidden state entirely.

D.

A GRU relies exclusively on peephole connections.

28 $If a GRU's update gate is close to 1, what does this mathematically imply about the updated hidden state ?$

gated recurrent units Medium

A.

The network will ignore the previous hidden state entirely.

B.

The new hidden state will heavily rely on the candidate hidden state .

C.

The hidden state will be completely reset to zero.

D.

The new hidden state will heavily rely on the previous hidden state .

29 $In a BiLSTM used for Named Entity Recognition, how is the final contextual representation for the -th word in a sentence typically obtained?$

bidirectional RNNs Medium

A.

By subtracting from .

B.

By averaging the hidden states of all surrounding words in a fixed window.

C.

By concatenating the forward hidden state and backward hidden state .

D.

By using only the forward hidden state for the first half of the sequence.

30 $Why are Bidirectional RNNs generally unsuitable for real-time autoregressive text generation tasks?$

bidirectional RNNs Medium

A.

They cannot use LSTM or GRU cells, leading to poor memory retention.

B.

They suffer from exploding gradients significantly more than standard RNNs.

C.

They have a strict limitation on the vocabulary size they can output.

D.

They require access to future tokens in the sequence which have not yet been generated.

31 $Which specific sequence modeling architecture is most appropriate for a Machine Translation task?$

sequence modeling applications Medium

A.

Many-to-many (Encoder-Decoder)

B.

One-to-many

C.

Many-to-one

D.

One-to-one

32 $In a standard encoder-decoder architecture for sequence-to-sequence tasks, what is the primary role of the context vector?$

sequence modeling applications Medium

A.

It forces the decoder to output the exact tokens of the input sequence.

B.

It encapsulates the information from the entire input sequence to initialize the decoder.

C.

It directly predicts the final output class of the sequence.

D.

It applies dropout to prevent overfitting during decoding.

33 $When using a unidirectional RNN for document-level sentiment classification, which hidden state is typically passed to the final dense classification layer?$

sentiment classification Medium

A.

A randomly sampled hidden state from the sequence.

B.

The hidden state of the final time step.

C.

The hidden state of the first time step.

D.

The hidden state with the highest gradient magnitude.

34 $To improve a text classification RNN's robust understanding of a sentence and avoid the information bottleneck of just using the final hidden state, one could apply:$

text classification Medium

A.

The raw word embeddings directly concatenated to the final output.

B.

Only the cell state while completely ignoring .

C.

Global max pooling or average pooling over all hidden states of the sequence.

D.

A one-hot encoded bag-of-words vector exclusively.

35 $What is a prominent and simple technique used to prevent the exploding gradient problem when training deep sequence models?$

sequence training techniques Medium

A.

Gradient clipping

B.

Teacher forcing

C.

Adding more recurrent layers

D.

Label smoothing

36 $Which regularization technique is specifically adapted for recurrent connections in sequence models by applying the exact same dropout mask across all time steps?$

sequence training techniques Medium

A.

Variational Dropout

B.

L2 Regularization

C.

Batch Normalization

D.

Standard Dropout

37 $During the training of a sequence generation model using Teacher Forcing, what input is fed to the decoder at time step ?$

teacher forcing Medium

A.

The context vector from the encoder

B.

A random token sampled from the vocabulary

C.

The model's own predicted token from time step

D.

The ground truth token from time step

38 $What is the primary computational advantage of using Truncated Backpropagation Through Time (TBPTT) over standard BPTT for very long sequences?$

truncated backpropagation through time Medium

A.

It limits the number of time steps the gradient flows backward, significantly reducing memory usage and computation time.

B.

It entirely eliminates the vanishing gradient problem for all recurrent architectures.

C.

It allows the model to process sequences of infinite length without losing any past information.

D.

It automatically tunes the learning rate hyperparameters of the RNN.

39 $In TBPTT, when a long document is split into chunks of length, what happens to the hidden states and gradients at the boundary between one chunk and the next?$

truncated backpropagation through time Medium

A.

The hidden state is reset to zero at the start of every chunk to keep chunks entirely independent.

B.

The hidden state is discarded, but the gradients flow backward across all chunk boundaries.

C.

The hidden state is passed forward to the next chunk to retain context, but gradients are stopped from flowing backward into the previous chunk.

D.

The hidden state and the gradients are both passed through to the previous chunk until the start of the document.

40 $Which evaluation metric, commonly used for sequence generation tasks like Machine Translation, computes the geometric mean of n-gram precision multiplied by a brevity penalty?$

evaluation metrics for sequence tasks Medium

A.

ROUGE

B.

F1-Score

C.

Perplexity

D.

BLEU

41 $In a vanilla Recurrent Neural Network (RNN) with hidden state, the vanishing gradient problem occurs during Backpropagation Through Time (BPTT). Mathematically, what is the primary condition that causes the gradient to vanish exponentially with respect to the sequence length?$

recurrent neural networks Hard

A.

The input sequences have a variance approaching zero, causing the derivative to saturate.

B.

The dominant eigenvalue of the Jacobian matrix is strictly greater than 1.

C.

The spectral radius (largest absolute eigenvalue) of the weight matrix is less than 1.

D.

The Frobenius norm of the input weight matrix exceeds the sequence length .

42 $Consider an LSTM cell where the forget gate is artificially clamped to $1$ (vector of ones) and the input gate is clamped to $0$ (vector of zeros) for all time steps . Assuming the cell state, how does the LSTM behave?$

long short term memory networks Hard

A.

The cell state will remain $0$ for all, and the hidden state will only depend on the output gate .

B.

The hidden state will be entirely dependent on the current input .

C.

The cell state will accumulate gradients linearly over time, leading to exploding gradients.

D.

The network behaves identically to a standard vanilla RNN, rendering the gating mechanisms useless.

43 $Assume a GRU and an LSTM both process an input of dimension and have a hidden state of dimension . Ignoring biases for simplicity, what is the exact ratio of the number of trainable weight parameters in the GRU to the number of trainable weight parameters in the LSTM?$

gated recurrent units Hard

A.

B.

C.

D.

44 $Why is a Bidirectional RNN (BiRNN) fundamentally unsuitable for standard autoregressive causal language modeling (e.g., predicting the next word given)?$

bidirectional RNNs Hard

A.

The backward RNN requires access to future tokens, which leaks the target information () during the prediction at step .

B.

BiRNNs cannot handle variable-length sequences during inference.

C.

The concatenated hidden state dimension is too large for the softmax layer.

D.

BPTT cannot be applied to the backward pass in a real-time sequential data stream.

45 $In sequence-to-sequence training, 'exposure bias' occurs when a model is trained with Teacher Forcing but tested in a free-running mode. Which of the following best describes how 'Scheduled Sampling' aims to mitigate this specific issue?$

teacher forcing Hard

A.

It dynamically alters the loss function from Cross-Entropy to Reinforcement Learning (e.g., REINFORCE) as training progresses.

B.

It applies dropout to the decoder's recurrent connections with a probability that decreases over time.

C.

During training, it stochastically decides whether to feed the ground-truth previous token or the model's own previous prediction to the next step, with the probability of using ground-truth decaying over time.

D.

It randomly masks out tokens in the input sequence to force the model to rely on its hidden state.

46 $In a standard implementation of Truncated Backpropagation Through Time (TBPTT) configured with forward step and backward step (where), what happens to the hidden state and the computational graph at the boundary between chunks?$

truncated backpropagation through time Hard

A.

The hidden state is passed forward, and gradients are accumulated indefinitely, effectively making it equivalent to full BPTT.

B.

The hidden state is passed forward to the next chunk, but it is detached from the computational graph, preventing gradients from flowing back into the previous chunk.

C.

Both the hidden state and the computational graph are discarded, treating each chunk as an independent sequence.

D.

The hidden state is reset to zero, and the computational graph is retained to allow gradient flow.

47 $To prevent exploding gradients in deep RNNs, practitioners use gradient clipping. Consider global norm clipping versus value clipping. Why is global norm clipping generally preferred over value-based clipping for sequence models?$

sequence training techniques Hard

A.

Value clipping requires calculating the norm of the gradient, which is computationally expensive for large sequence models.

B.

Value clipping cannot prevent the hidden state from exploding during the forward pass.

C.

Global norm clipping scales all parameter gradients by the same factor, preserving the overall direction of the gradient vector.

D.

Global norm clipping guarantees that the loss will decrease monotonically, whereas value clipping does not.

48 $When designing an RNN for document-level text classification, taking the final hidden state can create a bottleneck. If you instead apply max-pooling over the sequence of hidden states, what distinct representational advantage does this provide over the final hidden state?$

text classification Hard

A.

It reduces the dimensionality of the hidden state before passing it to the linear classification layer.

B.

It forces the RNN to behave like a Bag-of-Words model, discarding positional information entirely.

C.

It captures the most salient features (highest activation values) across the entire sequence, regardless of their position, mitigating the bias towards recent tokens.

D.

It completely eliminates the vanishing gradient problem for the early tokens in the document.

49 $For a sequence, the perplexity (PPL) of a language model is defined as . If the model calculates the average categorical cross-entropy loss using natural logarithms, what is the exact mathematical relationship between and PPL?$

evaluation metrics for sequence tasks Hard

A.

B.

C.

D.

50 $In a variant of the LSTM known as 'LSTM with peephole connections' (Gers & Schmidhuber, 2000), the gates are allowed to inspect the cell state. Specifically, which cell state is used to compute the forget gate and input gate, versus the output gate ?$

long short term memory networks Hard

A.

and use; uses .

B.

All three gates use .

C.

All three gates use .

D.

and use; uses .

51 $In a GRU, the candidate hidden state is computed as . If the reset gate approaches a zero vector, what is the structural implication for the sequence model at time ?$

gated recurrent units Hard

A.

The update gate is forced to 1, causing the hidden state to remain exactly .

B.

The GRU completely forgets its entire history, and the final hidden state is determined solely by .

C.

The candidate state acts as if it is reading the first symbol of a new sequence, ignoring previous hidden states.

D.

The network architecture degrades into a strictly linear transformation of the input sequence.

52 $Why is the ReLU activation function rarely used in vanilla Recurrent Neural Networks compared to, despite ReLU's success in mitigating vanishing gradients in deep Feedforward and Convolutional networks?$

recurrent neural networks Hard

A.

The memory requirements for caching ReLU activations across time steps are significantly higher than for .

B.

Because RNNs reuse the same weight matrix at every time step, the unbounded positive output of ReLU often leads to exponentially exploding activations and gradients.

C.

ReLU completely prevents gradient flow for negative inputs, rendering BPTT impossible across multiple time steps.

D.

ReLU causes the hidden state to become non-differentiable at exactly zero, which crashes recurrent autodifferentiation engines.

53 $A -th order Markov model assumes the probability of the next word depends only on the previous words. How does an unrolled standard RNN theoretically bypass this -th order Markov assumption for sequential text data?$

sequential text data Hard

A.

By updating its recurrent weights dynamically based on the length of the input sequence.

B.

By utilizing attention mechanisms that provide direct access to all past words.

C.

By maintaining a continuous hidden state vector that acts as a lossy compression of the entire unbounded history .

D.

It does not bypass it; an RNN is mathematically equivalent to a 1st-order Markov model on the input space.

54 $Assume a sequence of length . You train an RNN using Truncated BPTT with forward chunk size and backward window . What is the total number of unrolled time steps processed during the backward passes for one complete epoch of this single sequence?$

truncated backpropagation through time Hard

A.

$50000$

B.

$2500$

C.

$1000$

D.

$50$

55 $In the BLEU score calculation for sequence tasks, the Brevity Penalty (BP) is introduced to penalize short translations. Let be the length of the candidate translation and be the effective reference corpus length. Under what exact condition is ?$

evaluation metrics for sequence tasks Hard

A.

only

B.

only

C.

D.

56 $In a Hierarchical Attention Network (HAN) used for document-level sentiment classification, the architecture consists of word-level and sentence-level encoders. Which of the following best represents the sequence of operations for generating the final document vector ?$

sentiment classification Hard

A.

Word embedding Sentence-level BiGRU Word Attention Word-level BiGRU Document Attention.

B.

Word embedding Word-level BiGRU Sentence Attention Sentence-level BiGRU Document Attention.

C.

Word embedding Word-level BiGRU Word Attention Sentence-level BiGRU Document Attention.

D.

Word embedding Word Attention Word-level BiGRU Sentence Attention Sentence-level BiGRU.

57 $When performing Named Entity Recognition (NER) on a sequence of tokens using a Bidirectional RNN, the output for token relies on the concatenated state . If an entity spans from to, how does capture the dependency on ?$

bidirectional RNNs Hard

A.

Through a subsequent CRF layer, since the BiRNN hidden states themselves cannot capture future dependencies.

B.

Through the backward hidden state, which processes sequences from down to $1$ and thus incorporates before reaching .

C.

By employing teacher forcing during inference to feed into .

D.

Through the forward hidden state, which has processed before reaching .

58 $In the standard sequence-to-sequence (Seq2Seq) model without attention (e.g., Cho et al., 2014), the entire source sentence is compressed into a single context vector . How is this context vector applied during the decoding phase?$

sequence modeling applications Hard

A.

It is provided as an additional input to the decoder at every time step, alongside the previous target token and previous hidden state.

B.

It replaces the word embedding input at every time step in the decoder.

C.

It is used to mask out out-of-vocabulary words during the final softmax projection.

D.

It is only used to initialize the first hidden state of the decoder .

59 $While LSTMs successfully mitigate the vanishing gradient problem through their additive cell state, they are still susceptible to exploding gradients. Mathematically, why does the LSTM design not prevent exploding gradients?$

long short term memory networks Hard

A.

The forget gate can exceed 1, causing exponential growth of the cell state over time.

B.

The input gate can take negative values, flipping the sign of the gradients during BPTT.

C.

The backpropagation through the output gate and the activation involves matrix multiplications with and at each time step, which can compound and explode.

D.

The additive cell state requires gradients to be summed over time, and the sum of gradients will always approach infinity for large .

60 $When training sequence models with Cross-Entropy loss, 'Label Smoothing' is often applied by converting the hard target distribution (one-hot vector) into a soft target distribution. What is the primary theoretical justification for using Label Smoothing in autoregressive sequence models?$

sequence training techniques Hard

A.

It prevents the softmax logits from growing infinitely large, which reduces overconfidence and improves generalization (regularization).

B.

It completely eliminates exposure bias by allowing the model to sample incorrect tokens during Teacher Forcing.

C.

It allows the model to bypass the need for an attention mechanism by implicitly modeling word similarities.

D.

It prevents the model from predicting the <EOS> token too early in sequence generation.

Unit 3 - Practice Quiz