1What is the primary characteristic of a Sequence Model compared to a standard Feedforward Neural Network?
A.It assumes all inputs are independent of each other
B.It processes inputs of fixed length only
C.It takes the order of inputs into account and can handle variable-length inputs
D.It uses only Convolutional layers
Correct Answer: It takes the order of inputs into account and can handle variable-length inputs
Explanation:Sequence models, like RNNs, are designed to handle sequential data where the current output depends on previous inputs, allowing for variable-length input sequences.
Incorrect! Try again.
2Which of the following data types is best suited for a Sequence Model?
A.Static image classification
B.Tabular housing price data
C.Sentiment analysis of movie reviews
D.Iris flower categorization
Correct Answer: Sentiment analysis of movie reviews
Explanation:Sentiment analysis involves text, which is sequential data where the order of words matters, making it ideal for sequence models.
Incorrect! Try again.
3In a Recurrent Neural Network (RNN), what is the function of the 'hidden state'?
A.To store the final output class
B.To act as a memory that captures information about previous time steps
C.To reset the network weights after every epoch
D.To visualize the attention weights
Correct Answer: To act as a memory that captures information about previous time steps
Explanation:The hidden state in an RNN passes information from one time step to the next, effectively acting as the network's memory of the sequence seen so far.
Incorrect! Try again.
4What is the phenomenon called when the gradients become extremely small during the backpropagation through time in an RNN, preventing weights from updating?
A.Exploding Gradient
B.Vanishing Gradient
C.Gradient Clipping
D.Overfitting
Correct Answer: Vanishing Gradient
Explanation:The Vanishing Gradient problem occurs when gradients shrink exponentially as they propagate back through many time steps, causing earlier layers to stop learning.
Incorrect! Try again.
5Which algorithm is typically used to train Recurrent Neural Networks?
A.Standard Backpropagation
B.Backpropagation Through Time (BPTT)
C.K-Means Clustering
D.Random Forest
Correct Answer: Backpropagation Through Time (BPTT)
Explanation:BPTT is a specific variation of backpropagation used for RNNs, where the network is unrolled for all time steps, and errors are propagated back through the sequence.
Incorrect! Try again.
6Which activation function is most commonly used for the hidden state in a simple RNN to help regulate values?
A.ReLU
B.Softmax
C.Tanh
D.Linear
Correct Answer: Tanh
Explanation:The Tanh (Hyperbolic Tangent) function is commonly used in RNN hidden states because it keeps values between -1 and 1, preventing values from blowing up too quickly.
Incorrect! Try again.
7What is the primary architectural solution designed to solve the Vanishing Gradient problem in standard RNNs?
A.Convolutional Neural Network (CNN)
B.Long Short-Term Memory (LSTM)
C.Perceptron
D.Autoencoder
Correct Answer: Long Short-Term Memory (LSTM)
Explanation:LSTMs introduce a cell state and gating mechanisms specifically designed to maintain long-term dependencies and mitigate the vanishing gradient problem.
Incorrect! Try again.
8In an LSTM unit, which gate is responsible for deciding what information to discard from the cell state?
A.Input Gate
B.Output Gate
C.Forget Gate
D.Update Gate
Correct Answer: Forget Gate
Explanation:The Forget Gate uses a sigmoid layer to look at the previous hidden state and current input to decide which information to remove (output 0) or keep (output 1) in the cell state.
Incorrect! Try again.
9What represents the 'long-term memory' component in an LSTM architecture?
A.Hidden State (h_t)
B.Cell State (C_t)
C.Input Gate
D.Output Gate
Correct Answer: Cell State (C_t)
Explanation:The Cell State runs down the entire chain of the LSTM with only minor linear interactions, acting as the conveyor belt for long-term information.
Incorrect! Try again.
10In an LSTM, what is the range of values output by the sigmoid activation function used in gates?
A.-1 to 1
B.0 to 1
C.-infinity to +infinity
D.0 to 100
Correct Answer: 0 to 1
Explanation:Sigmoid functions output values between 0 and 1, which effectively act as a switch or percentage to let information pass through or block it.
Incorrect! Try again.
11Which task involves assigning a grammatical category (like Noun, Verb, Adjective) to every word in a sentence?
A.Named Entity Recognition
B.Part-of-Speech (POS) Tagging
C.Machine Translation
D.Sentiment Analysis
Correct Answer: Part-of-Speech (POS) Tagging
Explanation:POS Tagging is the process of marking up a word in a text as corresponding to a particular part of speech based on its definition and context.
Incorrect! Try again.
12Named Entity Recognition (NER) is primarily concerned with identifying:
A.Grammatical errors
B.Sentiment of the text
C.Real-world objects like people, organizations, and locations
D.The translation of the text
Correct Answer: Real-world objects like people, organizations, and locations
Explanation:NER is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories.
Incorrect! Try again.
13What type of Sequence problem is POS Tagging?
A.One-to-One
B.One-to-Many
C.Many-to-One
D.Many-to-Many (Synced)
Correct Answer: Many-to-Many (Synced)
Explanation:In POS tagging, a sequence of words (input) is mapped to a sequence of tags (output) of the same length, making it a Many-to-Many synced problem.
Incorrect! Try again.
14In the context of NER, what does the 'BIO' or 'IOB' tagging scheme stand for?
A.Binary-Input-Output
B.Beginning-Inside-Outside
C.Basic-Input-Operation
D.Backward-Inward-Onward
Correct Answer: Beginning-Inside-Outside
Explanation:BIO stands for Beginning (first token of an entity), Inside (subsequent tokens of an entity), and Outside (not part of an entity), a common format for tagging spans of text.
Incorrect! Try again.
15What is the core architecture used in Neural Machine Translation (NMT) before the introduction of Attention?
A.Encoder-Only
B.Encoder-Decoder (Seq2Seq)
C.Decoder-Only
D.Random Forest
Correct Answer: Encoder-Decoder (Seq2Seq)
Explanation:Traditional NMT uses an Encoder-Decoder architecture where the encoder processes the source language and the decoder generates the target language.
Incorrect! Try again.
16In a traditional Seq2Seq model, what is the role of the Encoder?
A.To generate the output sequence
B.To calculate the loss function
C.To compress the input sequence into a fixed-length context vector
D.To visualize the data
Correct Answer: To compress the input sequence into a fixed-length context vector
Explanation:The encoder reads the input sequence and summarizes the information into a final hidden state, often called the context vector, which is passed to the decoder.
Incorrect! Try again.
17What is the 'Context Vector' in a traditional RNN-based Encoder-Decoder model?
A.The first hidden state of the encoder
B.The last hidden state of the encoder
C.The average of all input vectors
D.The weights of the output layer
Correct Answer: The last hidden state of the encoder
Explanation:In a standard Seq2Seq model, the final hidden state of the encoder RNN represents the context vector containing the summary of the entire input sequence.
Incorrect! Try again.
18Which of the following is a major bottleneck of the traditional Seq2Seq model?
A.It cannot handle numeric data
B.It requires too much RAM
C.Performance degrades significantly for long sentences due to the fixed-length context vector
D.It can only translate into English
Correct Answer: Performance degrades significantly for long sentences due to the fixed-length context vector
Explanation:Compressing a long sentence into a single fixed-length vector causes information loss, making it difficult for the decoder to generate accurate translations for long sequences.
Incorrect! Try again.
19What is 'Teacher Forcing' in the context of training sequence models?
A.Forcing the model to stop training early
B.Using the actual ground truth output from the previous time step as input for the current step during training
C.Using the model's predicted output as input for the next step during training
D.Manually setting the weights of the network
Correct Answer: Using the actual ground truth output from the previous time step as input for the current step during training
Explanation:Teacher forcing speeds up convergence by feeding the correct previous token to the decoder during training, rather than the token the model predicted (which might be wrong).
Incorrect! Try again.
20Which search strategy explores multiple possible output sequences simultaneously to find the most likely translation?
A.Greedy Search
B.Beam Search
C.Linear Search
D.Binary Search
Correct Answer: Beam Search
Explanation:Beam search maintains a 'beam' of the top 'k' most probable partial sequences at each step, offering a better approximation of the global best sequence than Greedy Search.
Incorrect! Try again.
21The Attention Mechanism was primarily introduced to solve which problem?
A.Overfitting in CNNs
B.The information bottleneck of the fixed-length context vector in NMT
C.Slow training of Linear Regression
D.The inability of RNNs to process images
Correct Answer: The information bottleneck of the fixed-length context vector in NMT
Explanation:Attention allows the decoder to look at different parts of the input sequence dynamically, removing the reliance on a single static context vector.
Incorrect! Try again.
22How does the Attention Mechanism calculate the context vector for each time step in the decoder?
A.By taking the last state of the encoder only
B.By computing a weighted sum of all encoder hidden states
C.By randomly selecting an encoder state
D.By averaging all input words
Correct Answer: By computing a weighted sum of all encoder hidden states
Explanation:Attention computes a weighted sum of encoder states, where the weights represent the importance (alignment) of that specific encoder state to the current decoder step.
Incorrect! Try again.
23In Attention, what do the 'alignment scores' (or attention weights) represent?
A.The error rate of the model
B.How relevant a specific input word is to the word currently being generated
C.The magnitude of the gradient
D.The number of hidden layers
Correct Answer: How relevant a specific input word is to the word currently being generated
Explanation:Alignment scores indicate how much focus the decoder should place on a specific encoder hidden state (input word) when generating the current output word.
Incorrect! Try again.
24What mathematical function is typically applied to alignment scores to convert them into probabilities that sum to 1?
A.ReLU
B.Sigmoid
C.Softmax
D.Tanh
Correct Answer: Softmax
Explanation:The Softmax function normalizes the raw attention scores into a probability distribution, ensuring the attention weights sum to 1.
Incorrect! Try again.
25In the context of RNNs, what is 'Backpropagation Through Time' (BPTT)?
A.A method to predict future stock prices
B.Unfolding the RNN across time steps and applying backpropagation
C.Training the network in reverse order
D.Using future data to predict past data
Correct Answer: Unfolding the RNN across time steps and applying backpropagation
Explanation:BPTT involves unrolling the recurrent network for the duration of the sequence and calculating gradients across this unrolled graph.
Incorrect! Try again.
26Which of the following is NOT a gate in a standard LSTM?
A.Forget Gate
B.Input Gate
C.Output Gate
D.Attention Gate
Correct Answer: Attention Gate
Explanation:Standard LSTMs have Forget, Input, and Output gates. Attention is a separate mechanism usually applied externally to the RNN layers.
Incorrect! Try again.
27What is the shape of the input data for a basic RNN layer in Keras/TensorFlow?
A.(Batch Size, Features)
B.(Batch Size, Timesteps, Features)
C.(Timesteps, Batch Size)
D.(Features, Labels)
Correct Answer: (Batch Size, Timesteps, Features)
Explanation:RNNs require 3D input tensors: the number of samples (batch size), the length of the sequence (timesteps), and the dimensionality of the data at each step (features).
Incorrect! Try again.
28Why are Bidirectional RNNs (BiRNNs) useful?
A.They train faster than standard RNNs
B.They allow the network to have context from both the past and the future
C.They use fewer parameters
D.They eliminate the need for backpropagation
Correct Answer: They allow the network to have context from both the past and the future
Explanation:BiRNNs process the sequence in both forward and backward directions, providing the current state with context from both preceding and succeeding words.
Incorrect! Try again.
29In a Many-to-One sequence model (e.g., Sentiment Analysis), where is the output typically taken?
A.At every time step
B.At the first time step
C.At the last time step
D.Randomly sampled
Correct Answer: At the last time step
Explanation:For Many-to-One tasks like classification, the final hidden state after processing the whole sequence is used to generate the single output label.
Incorrect! Try again.
30What does the 'candidate cell state' in an LSTM do?
A.It decides what to forget
B.It proposes new values that could be added to the state
C.It outputs the final prediction
D.It clears the memory
Correct Answer: It proposes new values that could be added to the state
Explanation:The candidate cell state (usually created with a Tanh layer) creates a vector of new candidate values that might be added to the cell state, regulated by the input gate.
Incorrect! Try again.
31Which issue leads to the 'Exploding Gradient' problem?
Explanation:If the derivatives are larger than 1, repeated multiplication during backpropagation causes the gradients to grow exponentially, leading to instability.
Incorrect! Try again.
32A solution to the Exploding Gradient problem is:
A.Gradient Clipping
B.Using ReLU
C.Increasing the learning rate
D.Removing the hidden layer
Correct Answer: Gradient Clipping
Explanation:Gradient Clipping involves capping the gradients at a specific threshold value to prevent them from becoming too large during training.
Incorrect! Try again.
33In an Attention model, the vector c_t is often referred to as:
A.The Forget Vector
B.The Context Vector
C.The Bias Vector
D.The Noise Vector
Correct Answer: The Context Vector
Explanation:The context vector c_t is the dynamic summary of the input sequence tailored for the specific decoding time step t via attention weights.
Incorrect! Try again.
34Sequence-to-Sequence models are most commonly associated with:
A.Image Segmentation
B.Text Summarization
C.Linear Regression
D.Cluster Analysis
Correct Answer: Text Summarization
Explanation:Text summarization transforms an input sequence (text) into an output sequence (summary), fitting the Seq2Seq paradigm perfectly.
Incorrect! Try again.
35What is 'Global Attention'?
A.Attention applied to a single word
B.Attention that considers all hidden states of the encoder
C.Attention that considers only a window of hidden states
D.Attention applied without weights
Correct Answer: Attention that considers all hidden states of the encoder
Explanation:Global attention computes the alignment scores against all encoder hidden states, as opposed to Local attention which focuses on a small window.
Incorrect! Try again.
36Which of the following describes 'Greedy Decoding'?
A.Choosing the word with the highest probability at each step immediately
B.Considering all possible future sequences
C.Choosing a random word based on distribution
D.Waiting until the end to choose words
Correct Answer: Choosing the word with the highest probability at each step immediately
Explanation:Greedy decoding selects the token with the highest probability at the current step without considering how this choice impacts future probabilities.
Incorrect! Try again.
37In POS tagging, if a word is ambiguous (e.g., 'book' can be a noun or verb), how does an RNN resolve it?
A.It flips a coin
B.It uses the context provided by surrounding words stored in the hidden state
C.It always picks the most common usage
D.It cannot resolve ambiguity
Correct Answer: It uses the context provided by surrounding words stored in the hidden state
Explanation:The RNN's hidden state captures context (e.g., 'read a book' vs 'book a flight'), allowing it to disambiguate based on the surrounding sequence.
Incorrect! Try again.
38What is the typical loss function for a multi-class classification problem like POS Tagging or NMT?
A.Mean Squared Error (MSE)
B.Categorical Cross-Entropy
C.Hinge Loss
D.Absolute Error
Correct Answer: Categorical Cross-Entropy
Explanation:Cross-entropy loss is the standard for multi-class classification tasks where the model outputs a probability distribution over the vocabulary/tags.
Incorrect! Try again.
39What does GRU stand for?
A.Gated Recurrent Unit
B.General Regression Unit
C.Global Recurrent Update
D.Gradient Rectified Unit
Correct Answer: Gated Recurrent Unit
Explanation:GRU is a simplified variation of LSTM that merges the cell state and hidden state and uses fewer gates.
Incorrect! Try again.
40The 'Input Gate' in an LSTM is usually controlled by which activation function?
A.Tanh
B.Sigmoid
C.ReLU
D.Linear
Correct Answer: Sigmoid
Explanation:The gate controller uses a Sigmoid function to output values between 0 (block) and 1 (pass), determining how much new information enters the state.
Incorrect! Try again.
41What visual tool is often used to interpret what an Attention model has learned?
A.Pie Chart
B.Attention Heatmap
C.Scatter Plot
D.Histogram
Correct Answer: Attention Heatmap
Explanation:Heatmaps visualize the alignment matrix, showing which input words correspond strongly to which output words (e.g., in translation).
Incorrect! Try again.
42Which limitation of RNNs prevents parallelization during training?
A.Large memory footprint
B.Sequential dependency of the hidden state
C.Use of sigmoid functions
D.Complex loss functions
Correct Answer: Sequential dependency of the hidden state
Explanation:Because the state at time t depends on t-1, computations must happen sequentially, preventing parallel processing on GPUs.
Incorrect! Try again.
43In a sequence model, 'padding' is used to:
A.Increase the learning rate
B.Make all sequences in a batch the same length
C.Remove stopwords
D.Add noise to the data
Correct Answer: Make all sequences in a batch the same length
Explanation:Since neural networks require fixed-size tensor inputs for batching, shorter sequences are padded (usually with zeros) to match the length of the longest sequence.
Incorrect! Try again.
44Which of these is a 'many-to-one' application of sequence models?
A.Machine Translation
B.Video Captioning
C.Music Generation
D.Sentiment Classification
Correct Answer: Sentiment Classification
Explanation:Input is a sequence of words (many), output is a single sentiment label (one).
Incorrect! Try again.
45In an NER task, identifying 'Apple' as an Organization rather than a Fruit relies on:
A.The spelling of the word
B.The capitalization
C.The contextual information in the sequence
D.The length of the word
Correct Answer: The contextual information in the sequence
Explanation:While capitalization helps, the context (e.g., 'Apple released a new phone' vs 'I ate an apple') is crucial for correct entity classification.
Incorrect! Try again.
46Why is the traditional Encoder-Decoder model often described as having 'amnesia'?
A.It forgets the weights after training
B.It struggles to retain information from the beginning of a long sequence at the decoding stage
C.It cannot learn new words
D.It uses a forget gate
Correct Answer: It struggles to retain information from the beginning of a long sequence at the decoding stage
Explanation:Standard Seq2Seq models rely on the final hidden state. For long sentences, information from the start of the sentence is often diluted or 'forgotten' by the time the vector is formed.
Incorrect! Try again.
47In the attention equation score(h_t, h_s), what are h_t and h_s?
A.Input and Output Gates
B.Decoder hidden state and Encoder hidden state
C.Weight and Bias
D.Learning rate and Loss
Correct Answer: Decoder hidden state and Encoder hidden state
Explanation:The score function calculates the similarity/compatibility between the current decoder state (h_t) and a specific encoder state (h_s).
Incorrect! Try again.
48Which mechanism allows a model to focus on 'local' parts of the input sequence based on the current decoding step?
A.Max Pooling
B.Attention Mechanism
C.Dropout
D.Batch Normalization
Correct Answer: Attention Mechanism
Explanation:Attention allows the model to dynamically weigh and focus on specific parts (local areas) of the input sequence relevant to the current prediction.
Incorrect! Try again.
49In sequence labeling, what does the output layer usually consist of?
A.A single neuron
B.A Softmax layer over the tag set for each time step
C.A linear regression layer
D.A clustering algorithm
Correct Answer: A Softmax layer over the tag set for each time step
Explanation:For tasks like POS or NER, the model outputs a probability distribution over all possible tags for every word in the sequence.
Incorrect! Try again.
50Which of the following best describes the 'Seq2Seq' mapping?