Unit3 - Subjective Questions
CSE472 • Practice Questions with Detailed Answers
Explain the characteristics of sequential text data and discuss why traditional feedforward neural networks struggle to process it effectively.
Characteristics of Sequential Text Data:
- Temporal/Sequential Ordering: In text, the order of words carries meaning (e.g., "Dog bites man" vs. "Man bites dog"). Sequential data has a defined order where current data points depend on previous ones.
- Variable Length: Sentences, paragraphs, and documents naturally come in varying lengths, unlike fixed-size inputs required by traditional models.
- Contextual Dependencies: Words often depend on distant words in the same sequence to resolve ambiguities (e.g., pronouns referring back to a noun mentioned earlier).
Why Feedforward Neural Networks Struggle:
- Fixed Input Size: Feedforward networks require a fixed-dimensional input vector, making it difficult to process sequences of arbitrary lengths.
- No Memory of Past Inputs: They process each input independently. There is no mechanism to maintain state or "memory" of previous inputs in the sequence, which is essential for understanding text.
- Parameter Inefficiency: If we try to feed a whole sequence by concatenating words, the model would require an enormous number of parameters and would not share learned features across different positions in the sequence.
Describe the architecture of a standard Recurrent Neural Network (RNN). Provide the mathematical equations for computing the hidden state and the output at a given time step .
Architecture of a Standard RNN:
A Recurrent Neural Network (RNN) is designed to handle sequential data by introducing a hidden state that acts as a memory. At each time step , the RNN takes an input and the hidden state from the previous time step to produce a new hidden state and optionally an output . The same weights are shared across all time steps.
Mathematical Equations:
-
Hidden State Update:
The hidden state at time is computed as:
Where:- is the weight matrix for the hidden state.
- is the weight matrix for the input.
- is the bias term for the hidden state.
- is an activation function, typically or ReLU.
-
Output Computation:
The output at time (if required by the task) is computed as:
Where:- is the weight matrix for the output.
- is the bias term for the output.
- is an activation function like softmax (for classification).
Explain the vanishing and exploding gradient problems in traditional RNNs. How do they affect the learning of long-term dependencies in sequence modeling?
Vanishing and Exploding Gradient Problems:
During training, RNNs use the Backpropagation Through Time (BPTT) algorithm, which involves computing gradients by applying the chain rule continuously backwards through time.
- Vanishing Gradients: If the weights in the network are small (specifically, if the largest singular value of the recurrent weight matrix is less than 1), the gradients shrink exponentially as they are propagated backward through time.
- Exploding Gradients: Conversely, if the weights are large (singular value greater than 1), the gradients can grow exponentially, leading to numerical instability and NaN values.
Effect on Long-Term Dependencies:
Because of the vanishing gradient problem, the gradients from later time steps barely affect the weights of earlier time steps. Consequently, the RNN struggles to learn connections between distant elements in a sequence (long-term dependencies). For example, in the sentence "I grew up in France... I speak fluent [French]", a standard RNN might fail to connect "France" to "French" if the gap is too long.
Detail the architecture of a Long Short-Term Memory (LSTM) network. Explain the role of the forget, input, and output gates along with their respective mathematical equations.
LSTM Architecture:
LSTMs are an advanced RNN architecture designed to overcome the vanishing gradient problem. They achieve this using a "cell state" () that runs straight down the entire chain with minor linear interactions, and a hidden state (). Information flow is regulated by three gates: forget, input, and output.
1. Forget Gate: Decides what information to throw away from the cell state.
2. Input Gate: Decides what new information to store in the cell state. It has two parts: a sigmoid layer deciding which values to update (), and a tanh layer creating candidate values ().
Cell State Update: The old cell state is updated to the new cell state .
3. Output Gate: Decides what to output based on the filtered cell state.
(Note: represents the sigmoid activation function, and denotes element-wise multiplication.)
Compare and contrast traditional Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks.
Comparison between RNNs and LSTMs:
-
Architecture Complexity:
- RNNs: Have a very simple repeating module, typically consisting of a single or ReLU layer.
- LSTMs: Have a much more complex repeating module containing four interacting layers (three sigmoid gates and one layer).
-
Memory Mechanism:
- RNNs: Only possess a hidden state () which is heavily overwritten at each time step, making it hard to retain old information.
- LSTMs: Introduce a separate cell state () that acts as an "information highway," allowing gradients to flow unchanged, coupled with a hidden state ().
-
Handling of Gradients:
- RNNs: Highly susceptible to the vanishing and exploding gradient problems, failing to learn long-term dependencies.
- LSTMs: The additive nature of the cell state update mitigates the vanishing gradient problem, allowing the network to capture long-term dependencies effectively.
-
Computational Cost:
- RNNs: Faster to train per epoch due to fewer parameters.
- LSTMs: Computationally heavier and require more memory due to the multiple gates and weight matrices.
Explain the architecture of a Gated Recurrent Unit (GRU). Provide the mathematical formulations for its update and reset gates.
Gated Recurrent Unit (GRU) Architecture:
A GRU is a streamlined variation of the LSTM. It aims to solve the vanishing gradient problem but uses fewer parameters. Unlike LSTMs, GRUs do not have a separate cell state; they merge the cell state and hidden state into a single hidden state (). They use two gates instead of three: the update gate and the reset gate.
Mathematical Formulations:
1. Update Gate ():
Determines how much of the past knowledge needs to be passed along into the future. It acts similarly to a combination of the forget and input gates in an LSTM.
2. Reset Gate ():
Determines how much of the past information to forget.
3. Candidate Hidden State ():
Computes the new candidate memory content, utilizing the reset gate to drop irrelevant past information.
4. Final Hidden State Update ():
Interpolates between the previous hidden state and the new candidate state using the update gate.
Compare LSTMs and GRUs. In what scenarios might one be preferred over the other for sequence modeling?
LSTMs vs. GRUs:
Similarities:
- Both are designed to mitigate the vanishing gradient problem of standard RNNs.
- Both use gating mechanisms to control the flow of information through time.
Differences:
- Gates: LSTMs have three gates (Input, Forget, Output), whereas GRUs have two gates (Update, Reset).
- State: LSTMs maintain an internal cell state () separate from the exposed hidden state (). GRUs merge these into a single hidden state ().
- Complexity: GRUs have fewer tensor operations and parameters, making them faster to train and less memory-intensive.
When to Prefer Which:
- Prefer GRU: When computational resources are limited, fast training is desired, or the dataset is relatively small (less prone to overfitting due to fewer parameters).
- Prefer LSTM: When dealing with very complex sequences where strictly separating the memory (cell state) from the output (hidden state) is beneficial, and when plenty of training data and computational resources are available.
What are Bidirectional RNNs? Explain their architecture and how they capture context from both past and future states in sequential text data.
Bidirectional RNNs (BiRNNs):
A Bidirectional RNN is an architecture designed to capture sequential context from both the past (left-to-right) and the future (right-to-left) simultaneously.
Architecture & Mechanism:
- A BiRNN consists of two independent RNN layers (which can be standard RNNs, LSTMs, or GRUs).
- Forward RNN: Processes the sequence in the forward direction from time step $1$ to , producing a sequence of forward hidden states .
- Backward RNN: Processes the sequence in the reverse direction from time step to $1$, producing a sequence of backward hidden states .
- Combination: At any given time step , the forward and backward hidden states are concatenated (or sometimes added) to form the final hidden state representation for that token:
Why it is effective:
In natural language, the meaning of a word often depends on the words that follow it, not just the words that precede it. BiRNNs ensure that the representation contains complete contextual information from the entire sequence.
Discuss a specific NLP application where Bidirectional RNNs significantly outperform unidirectional RNNs, and explain the reasoning behind this.
Application: Named Entity Recognition (NER)
Named Entity Recognition is the task of identifying and classifying entities (like Person, Organization, Location) within a text.
Why BiRNNs Outperform Unidirectional RNNs in NER:
- In a unidirectional RNN, the model only has access to the words prior to the current word.
- Consider the sentence: "Washington is a beautiful place to live." vs. "Washington went to the store."
- When processing the word "Washington", a forward-only RNN has no prior context to determine if it's a Location or a Person.
- A Bidirectional RNN, however, looks ahead. In the first sentence, it sees "is a beautiful place", strongly hinting at a Location. In the second, it sees "went to the store", hinting at a Person.
- By capturing context from both the left (past) and the right (future) simultaneously, BiRNNs provide a much richer representation, resulting in significantly higher accuracy for token-level classification tasks like NER or Part-of-Speech (POS) tagging.
Describe how sequence models can be applied to text classification tasks. Explain the typical pipeline from raw text to class probabilities.
Sequence Models for Text Classification:
Text classification involves assigning a predefined category to a sequence of text (e.g., spam detection, topic categorization).
Typical Pipeline:
- Tokenization: The raw text is split into smaller units (tokens), such as words or subwords.
- Embedding Layer: Each token is mapped to a dense, fixed-size vector representation (using pre-trained embeddings like Word2Vec/GloVe or learned end-to-end). This converts the text into a sequence of vectors: .
- Sequence Modeling (RNN/LSTM/GRU): The sequence of vectors is fed into the sequence model one time step at a time. The model updates its hidden state at each step.
- Aggregation/Pooling: To classify the whole sequence, we need a single vector representation. Commonly, the final hidden state is used, as it theoretically summarizes the entire sequence. Alternatively, max-pooling or average-pooling across all hidden states can be used.
- Fully Connected Layer & Output: The aggregated sequence vector is passed through a dense (feedforward) layer followed by a softmax (for multi-class) or sigmoid (for binary) activation function to output the final class probabilities.
How is sentiment classification formulated as a sequence modeling problem? Discuss the architecture choices for an RNN-based sentiment classifier.
Formulation:
Sentiment classification is formulated as a "Many-to-One" sequence modeling problem. The input is a sequence of words (a sentence or a document), and the output is a single discrete label indicating the sentiment polarity (e.g., Positive, Negative, Neutral).
Architecture Choices for an RNN-based Sentiment Classifier:
- Input Representation: Words are converted into embeddings (e.g., 300-dimensional GloVe vectors) to capture semantic meaning.
- Sequence Processing:
- A Bidirectional LSTM (BiLSTM) is highly recommended. It captures context from both directions, which is crucial for sentiment (e.g., "The movie was not good at all" - "not" negates "good").
- Context Aggregation:
- Take the final hidden state of the forward and backward passes and concatenate them.
- Alternatively, use an Attention Mechanism over all the RNN hidden states to let the model focus on the most sentiment-bearing words (like "amazing" or "terrible").
- Classifier Head: The summarized context vector is passed through one or two dense layers with Dropout for regularization, ending with a softmax/sigmoid layer to output the sentiment probability.
What is Teacher Forcing in the context of training sequence models? Explain its advantages and potential drawbacks.
Teacher Forcing:
Teacher forcing is a training strategy used in sequence-to-sequence models (or autoregressive sequence generation). Instead of feeding the model's own predicted output from time step as the input to time step , teacher forcing provides the actual ground truth target from the training dataset as the input for time step .
Advantages:
- Faster Convergence: By feeding the correct sequence at every step, the model learns the correct mapping faster. Without it, early in training, the model would predict garbage, feed that garbage back into itself, and compound errors, making learning extremely slow.
- Stability: It prevents early mistakes from cascading through the sequence and destroying the gradient signal.
Drawbacks (Exposure Bias):
- Mismatch between Training and Inference: During inference (testing), the ground truth is not available, so the model must rely on its own previous predictions. Because it was never exposed to its own mistakes during training, it can become brittle and compound errors severely when generating sequences in the real world.
Explain the concept of Exposure Bias in sequence training. How is it related to Teacher Forcing, and how can it be mitigated?
Exposure Bias:
Exposure bias occurs in autoregressive sequence generation models when there is a discrepancy between how the model is trained and how it is used during inference.
Relation to Teacher Forcing:
During training with Teacher Forcing, the model is always given the true previous token to predict the next token. However, during inference, the true tokens are unavailable, so the model must consume its own previously generated tokens. Because the model was never exposed to its own incorrect predictions during training, a single mistake during inference can send the model into an unseen state space, leading to a cascade of errors. This over-reliance on perfect past context is exposure bias.
Mitigation Strategies:
- Scheduled Sampling: Gradually transition from using true previous tokens (Teacher Forcing) to using the model's own predictions as training progresses.
- Sequence-level Training (Reinforcement Learning): Train the model using metrics like BLEU or ROUGE over the entire generated sequence, allowing the model to explore and learn from its own generated sequences.
Describe the Backpropagation Through Time (BPTT) algorithm used in sequence training. What are its computational challenges?
Backpropagation Through Time (BPTT):
BPTT is the standard algorithm for training Recurrent Neural Networks. It is an extension of standard backpropagation applied to sequential data.
- The RNN is "unrolled" through time, converting it into a deep feedforward network where each time step corresponds to a layer.
- The forward pass computes the outputs and loss for all time steps.
- In the backward pass, gradients are computed by applying the chain rule starting from the last time step back to the first. Gradients for the shared weights are summed across all time steps.
Computational Challenges:
- Memory Intensive: BPTT requires storing the hidden states and inputs for every single time step in the sequence to compute the gradients. For long sequences (e.g., thousands of words), this consumes an enormous amount of RAM.
- Vanishing/Exploding Gradients: Because the chain rule multiplies the weight matrix repeatedly over time steps, gradients either shrink to zero (preventing learning) or explode to infinity (causing instability).
- Slow Processing: The entire sequence must be processed before a single weight update can occur.
Explain Truncated Backpropagation Through Time (TBPTT). How does it solve the challenges of standard BPTT while maintaining the ability to learn sequences?
Truncated Backpropagation Through Time (TBPTT):
TBPTT is a modified version of the BPTT algorithm designed to handle very long sequences efficiently.
Mechanism:
Instead of unrolling the RNN for the entire length of a massive sequence, the sequence is divided into smaller chunks (or "windows") of a fixed length .
- Forward Pass: The RNN processes a chunk of time steps. The hidden state at the end of the chunk is saved and passed as the initial hidden state for the next chunk.
- Backward Pass: Once the forward pass for the chunk is complete, backpropagation is performed, but the gradients are only propagated backward for a fixed number of steps (often ).
How it Solves BPTT Challenges:
- Memory Efficiency: Since gradients are only tracked for steps, memory usage is strictly bounded and much lower than full BPTT.
- Faster Weight Updates: Weights are updated after every chunk, rather than waiting for the entire document to finish, speeding up convergence.
- Mitigates Gradient Issues: By artificially cutting off the backward flow of gradients, it naturally prevents the exploding and vanishing gradient problems from accumulating over infinitely long sequences.
Trade-off: The model cannot easily learn dependencies that span longer than the truncation length .
Discuss the evaluation metrics commonly used for text classification tasks. Provide the mathematical formulas for Precision, Recall, and F1-Score.
Evaluation Metrics for Text Classification:
When classifying text (e.g., identifying spam, sentiment analysis), accuracy alone can be misleading, especially with imbalanced datasets. Therefore, Precision, Recall, and F1-score are heavily utilized.
Let:
- TP = True Positives
- TN = True Negatives
- FP = False Positives
- FN = False Negatives
1. Precision: Measures the accuracy of the positive predictions. Out of all instances the model predicted as positive, how many were actually positive?
2. Recall (Sensitivity): Measures the ability of the model to find all the positive instances. Out of all actual positive instances, how many did the model correctly identify?
3. F1-Score: The harmonic mean of Precision and Recall. It provides a single metric that balances both concerns, which is especially useful when you need to strike a balance between false positives and false negatives.
4. Accuracy: The ratio of correctly predicted observations to the total observations.
Explain the Perplexity metric used in sequence modeling. How is it related to cross-entropy loss, and what does a lower perplexity indicate?
Perplexity in Sequence Modeling:
Perplexity is a standard evaluation metric for language models (which are generative sequence models). It measures how well a probability model predicts a sample. Intuitively, it represents the average number of choices the model believes it has to choose from at each time step.
Mathematical Formulation:
For a sequence of words, , perplexity is the inverse probability of the test set, normalized by the number of words:
Relationship to Cross-Entropy Loss:
Perplexity is mathematically equivalent to the exponentiation of the categorical cross-entropy loss (calculated using base ).
If is the cross-entropy loss per word:
What a lower perplexity indicates:
A lower perplexity means the model assigns a higher probability to the true sequence of words. It indicates that the model is "less surprised" by the sequence, meaning it has learned a better representation of the language. Therefore, a lower perplexity score signifies a better-performing language model.
Discuss the difference between Many-to-One and Many-to-Many RNN architectures. Provide examples of NLP tasks that utilize each configuration.
1. Many-to-One Architecture:
- Mechanism: The RNN processes a sequence of inputs over multiple time steps, but only outputs a single prediction at the very end of the sequence. The final hidden state is typically used to generate this output.
- Examples:
- Sentiment Classification: Processing an entire movie review (sequence of words) and outputting a single label (Positive/Negative).
- Text Categorization: Classifying a news article into "Sports", "Politics", etc.
2. Many-to-Many Architecture:
- Mechanism: The RNN processes a sequence of inputs and generates an output at multiple time steps. This can be strictly aligned (an output for every single input step) or misaligned (an encoder-decoder setup where the entire input is read before output begins).
- Examples:
- Aligned: Part-of-Speech (POS) tagging or Named Entity Recognition (NER), where every word in the input sequence receives a tag.
- Misaligned (Encoder-Decoder): Machine Translation (e.g., English to French), where the input sequence length differs from the output sequence length.
Discuss the role of word embeddings when used as input features for Deep Learning sequence models in NLP.
Role of Word Embeddings in Sequence Models:
Before feeding text into an RNN, LSTM, or GRU, the discrete text tokens must be converted into numerical format. While one-hot encoding is possible, it is highly inefficient and lacks semantic meaning. Word embeddings (like Word2Vec, GloVe, or dynamically learned embedding layers) are crucial for the following reasons:
- Dense Representation: Embeddings convert words into dense, continuous vectors of fixed size (e.g., 100 to 300 dimensions). This drastically reduces the dimensionality compared to one-hot encoding (which scales with vocabulary size).
- Semantic Similarity: Embeddings capture semantic and syntactic relationships. Words with similar meanings (e.g., "king" and "queen", "run" and "walk") are mapped to nearby points in the vector space. This allows the sequence model to generalize better to unseen or rare words.
- Continuous Input for Gradients: Neural networks require continuous inputs to compute gradients effectively. Embeddings provide a continuous space where small adjustments can be made via backpropagation, allowing the model to fine-tune the representations specifically for the downstream task (like sentiment analysis).
Explain the technique of Gradient Clipping in sequence training. Why is it necessary when training recurrent neural networks?
Gradient Clipping:
Gradient clipping is a practical technique used during the training of neural networks, particularly RNNs, to deal with the exploding gradient problem.
Mechanism:
Before applying the gradients to update the network weights, their norms are computed. If the norm of the gradient vector exceeds a predefined threshold (e.g., 5.0), the gradient vector is scaled down (clipped) so that its norm is exactly equal to the threshold, while maintaining its direction.
Why it is Necessary for RNNs:
- Because RNNs share weights across all time steps, Backpropagation Through Time (BPTT) continuously multiplies these weights. If the weights are large, the gradients can grow exponentially (exploding gradients).
- Exploding gradients result in massive weight updates that destroy the model's learned parameters, often causing the loss function to spike to NaN (Not a Number).
- Gradient clipping ensures that the update step remains within a reasonable boundary, keeping the training process stable even when the loss surface contains very steep cliffs.