1 $What is the primary characteristic of a Sequence Model compared to a standard Feedforward Neural Network?$

A.

It assumes all inputs are independent of each other

B.

It processes inputs of fixed length only

C.

It takes the order of inputs into account and can handle variable-length inputs

D.

It uses only Convolutional layers

2 $Which of the following data types is best suited for a Sequence Model?$

A.

Static image classification

B.

Tabular housing price data

C.

Sentiment analysis of movie reviews

D.

Iris flower categorization

3 $In a Recurrent Neural Network (RNN), what is the function of the 'hidden state'?$

A.

To store the final output class

B.

To act as a memory that captures information about previous time steps

C.

To reset the network weights after every epoch

D.

To visualize the attention weights

4 $What is the phenomenon called when the gradients become extremely small during the backpropagation through time in an RNN, preventing weights from updating?$

A.

Exploding Gradient

B.

Vanishing Gradient

C.

Gradient Clipping

D.

Overfitting

5 $Which algorithm is typically used to train Recurrent Neural Networks?$

A.

Standard Backpropagation

B.

Backpropagation Through Time (BPTT)

C.

K-Means Clustering

D.

Random Forest

6 $Which activation function is most commonly used for the hidden state in a simple RNN to help regulate values?$

A.

ReLU

B.

Softmax

C.

Tanh

D.

Linear

7 $What is the primary architectural solution designed to solve the Vanishing Gradient problem in standard RNNs?$

A.

Convolutional Neural Network (CNN)

B.

Long Short-Term Memory (LSTM)

C.

Perceptron

D.

Autoencoder

8 $In an LSTM unit, which gate is responsible for deciding what information to discard from the cell state?$

A.

Input Gate

B.

Output Gate

C.

Forget Gate

D.

Update Gate

9 $What represents the 'long-term memory' component in an LSTM architecture?$

A.

Hidden State (h_t)

B.

Cell State (C_t)

C.

Input Gate

D.

Output Gate

10 $In an LSTM, what is the range of values output by the sigmoid activation function used in gates?$

A.

-1 to 1

B.

0 to 1

C.

-infinity to +infinity

D.

0 to 100

11 $Which task involves assigning a grammatical category (like Noun, Verb, Adjective) to every word in a sentence?$

A.

Named Entity Recognition

B.

Part-of-Speech (POS) Tagging

C.

Machine Translation

D.

Sentiment Analysis

12 $Named Entity Recognition (NER) is primarily concerned with identifying:$

A.

Grammatical errors

B.

Sentiment of the text

C.

Real-world objects like people, organizations, and locations

D.

The translation of the text

13 $What type of Sequence problem is POS Tagging?$

A.

One-to-One

B.

One-to-Many

C.

Many-to-One

D.

Many-to-Many (Synced)

14 $In the context of NER, what does the 'BIO' or 'IOB' tagging scheme stand for?$

A.

Binary-Input-Output

B.

Beginning-Inside-Outside

C.

Basic-Input-Operation

D.

Backward-Inward-Onward

15 $What is the core architecture used in Neural Machine Translation (NMT) before the introduction of Attention?$

A.

Encoder-Only

B.

Encoder-Decoder (Seq2Seq)

C.

Decoder-Only

D.

Random Forest

16 $In a traditional Seq2Seq model, what is the role of the Encoder?$

A.

To generate the output sequence

B.

To calculate the loss function

C.

To compress the input sequence into a fixed-length context vector

D.

To visualize the data

17 $What is the 'Context Vector' in a traditional RNN-based Encoder-Decoder model?$

A.

The first hidden state of the encoder

B.

The last hidden state of the encoder

C.

The average of all input vectors

D.

The weights of the output layer

18 $Which of the following is a major bottleneck of the traditional Seq2Seq model?$

A.

It cannot handle numeric data

B.

It requires too much RAM

C.

Performance degrades significantly for long sentences due to the fixed-length context vector

D.

It can only translate into English

19 $What is 'Teacher Forcing' in the context of training sequence models?$

A.

Forcing the model to stop training early

B.

Using the actual ground truth output from the previous time step as input for the current step during training

C.

Using the model's predicted output as input for the next step during training

D.

Manually setting the weights of the network

20 $Which search strategy explores multiple possible output sequences simultaneously to find the most likely translation?$

A.

Greedy Search

B.

Beam Search

C.

Linear Search

D.

Binary Search

21 $The Attention Mechanism was primarily introduced to solve which problem?$

A.

Overfitting in CNNs

B.

The information bottleneck of the fixed-length context vector in NMT

C.

Slow training of Linear Regression

D.

The inability of RNNs to process images

22 $How does the Attention Mechanism calculate the context vector for each time step in the decoder?$

A.

By taking the last state of the encoder only

B.

By computing a weighted sum of all encoder hidden states

C.

By randomly selecting an encoder state

D.

By averaging all input words

23 $In Attention, what do the 'alignment scores' (or attention weights) represent?$

A.

The error rate of the model

B.

How relevant a specific input word is to the word currently being generated

C.

The magnitude of the gradient

D.

The number of hidden layers

24 $What mathematical function is typically applied to alignment scores to convert them into probabilities that sum to 1?$

A.

ReLU

B.

Sigmoid

C.

Softmax

D.

Tanh

25 $In the context of RNNs, what is 'Backpropagation Through Time' (BPTT)?$

A.

A method to predict future stock prices

B.

Unfolding the RNN across time steps and applying backpropagation

C.

Training the network in reverse order

D.

Using future data to predict past data

26 $Which of the following is NOT a gate in a standard LSTM?$

A.

Forget Gate

B.

Input Gate

C.

Output Gate

D.

Attention Gate

27 $What is the shape of the input data for a basic RNN layer in Keras/TensorFlow?$

A.

(Batch Size, Features)

B.

(Batch Size, Timesteps, Features)

C.

(Timesteps, Batch Size)

D.

(Features, Labels)

28 $Why are Bidirectional RNNs (BiRNNs) useful?$

A.

They train faster than standard RNNs

B.

They allow the network to have context from both the past and the future

C.

They use fewer parameters

D.

They eliminate the need for backpropagation

29 $In a Many-to-One sequence model (e.g., Sentiment Analysis), where is the output typically taken?$

A.

At every time step

B.

At the first time step

C.

At the last time step

D.

Randomly sampled

30 $What does the 'candidate cell state' in an LSTM do?$

A.

It decides what to forget

B.

It proposes new values that could be added to the state

C.

It outputs the final prediction

D.

It clears the memory

31 $Which issue leads to the 'Exploding Gradient' problem?$

A.

Weights initialized to zero

B.

Gradients > 1 accumulating multiplicatively

C.

Gradients < 1 accumulating multiplicatively

D.

Learning rate being too low

32 $A solution to the Exploding Gradient problem is:$

A.

Gradient Clipping

B.

Using ReLU

C.

Increasing the learning rate

D.

Removing the hidden layer

33 $In an Attention model, the vector c_t is often referred to as:$

A.

The Forget Vector

B.

The Context Vector

C.

The Bias Vector

D.

The Noise Vector

34 $Sequence-to-Sequence models are most commonly associated with:$

A.

Image Segmentation

B.

Text Summarization

C.

Linear Regression

D.

Cluster Analysis

35 $What is 'Global Attention'?$

A.

Attention applied to a single word

B.

Attention that considers all hidden states of the encoder

C.

Attention that considers only a window of hidden states

D.

Attention applied without weights

36 $Which of the following describes 'Greedy Decoding'?$

A.

Choosing the word with the highest probability at each step immediately

B.

Considering all possible future sequences

C.

Choosing a random word based on distribution

D.

Waiting until the end to choose words

37 $In POS tagging, if a word is ambiguous (e.g., 'book' can be a noun or verb), how does an RNN resolve it?$

A.

It flips a coin

B.

It uses the context provided by surrounding words stored in the hidden state

C.

It always picks the most common usage

D.

It cannot resolve ambiguity

38 $What is the typical loss function for a multi-class classification problem like POS Tagging or NMT?$

A.

Mean Squared Error (MSE)

B.

Categorical Cross-Entropy

C.

Hinge Loss

D.

Absolute Error

39 $What does GRU stand for?$

A.

Gated Recurrent Unit

B.

General Regression Unit

C.

Global Recurrent Update

D.

Gradient Rectified Unit

40 $The 'Input Gate' in an LSTM is usually controlled by which activation function?$

A.

Tanh

B.

Sigmoid

C.

ReLU

D.

Linear

41 $What visual tool is often used to interpret what an Attention model has learned?$

A.

Pie Chart

B.

Attention Heatmap

C.

Scatter Plot

D.

Histogram

42 $Which limitation of RNNs prevents parallelization during training?$

A.

Large memory footprint

B.

Sequential dependency of the hidden state

C.

Use of sigmoid functions

D.

Complex loss functions

43 $In a sequence model, 'padding' is used to:$

A.

Increase the learning rate

B.

Make all sequences in a batch the same length

C.

Remove stopwords

D.

Add noise to the data

44 $Which of these is a 'many-to-one' application of sequence models?$

A.

Machine Translation

B.

Video Captioning

C.

Music Generation

D.

Sentiment Classification

45 $In an NER task, identifying 'Apple' as an Organization rather than a Fruit relies on:$

A.

The spelling of the word

B.

The capitalization

C.

The contextual information in the sequence

D.

The length of the word

46 $Why is the traditional Encoder-Decoder model often described as having 'amnesia'?$

A.

It forgets the weights after training

B.

It struggles to retain information from the beginning of a long sequence at the decoding stage

C.

It cannot learn new words

D.

It uses a forget gate

47 $In the attention equation score(h_t, h_s), what are h_t and h_s ?$

A.

Input and Output Gates

B.

Decoder hidden state and Encoder hidden state

C.

Weight and Bias

D.

Learning rate and Loss

48 $Which mechanism allows a model to focus on 'local' parts of the input sequence based on the current decoding step?$

A.

Max Pooling

B.

Attention Mechanism

C.

Dropout

D.

Batch Normalization

49 $In sequence labeling, what does the output layer usually consist of?$

A.

A single neuron

B.

A Softmax layer over the tag set for each time step

C.

A linear regression layer

D.

A clustering algorithm

50 $Which of the following best describes the 'Seq2Seq' mapping?$

A.

Fixed Input Size -> Fixed Output Size

B.

Variable Input Size -> Fixed Output Size

C.

Variable Input Size -> Variable Output Size

D.

Fixed Input Size -> Variable Output Size

Unit 5 - Practice Quiz