Unit4 - Subjective Questions
CSE472 • Practice Questions with Detailed Answers
Explain the standard Encoder-Decoder architecture used in Natural Language Processing.
Encoder-Decoder Architecture:
The encoder-decoder architecture is a standard framework for sequence-to-sequence (seq2seq) tasks in NLP.
1. Encoder:
- The encoder processes the input sequence step-by-step.
- It is typically a Recurrent Neural Network (RNN), LSTM, or GRU.
- At each time step , it updates its hidden state .
- After reading the entire sequence, the final hidden state (or a transformation of it) becomes the context vector .
2. Context Vector:
- This vector is a fixed-length mathematical summary of the entire input sequence.
- It acts as the initial hidden state for the decoder.
3. Decoder:
- The decoder is another RNN that generates the output sequence one token at a time.
- Its initial hidden state is initialized with the context vector .
- At each time step , it computes its hidden state and predicts the probability distribution for the next word using a softmax layer: .
Describe how sequence-to-sequence models are applied to Machine Translation. What are the key steps involved?
Machine Translation using Seq2Seq Models:
Machine translation involves converting text from a source language to a target language. Seq2Seq models handle this effectively by mapping varying-length input sequences to varying-length output sequences.
Key Steps:
- Tokenization & Embedding: The source sentence is tokenized into words or subwords and converted into dense vector embeddings.
- Encoding: The source embeddings are fed into the Encoder RNN/LSTM one by one. The network compresses the semantic meaning of the sentence into a fixed-length context vector.
- Decoding: The Decoder receives the context vector and a special start-of-sequence token (e.g.,
<SOS>). - Generation: The decoder predicts the first target word, which is then fed back as input into the next time step of the decoder (often using Teacher Forcing during training).
- Termination: The process continues until the decoder outputs an end-of-sequence token (
<EOS>).
Advantages: It captures the global context of the sentence, allowing for better translations of idioms and complex grammatical structures compared to traditional phrase-based systems.
Discuss the application of sequence-to-sequence models in text summarization. Differentiate between abstractive and extractive summarization in this context.
Seq2Seq for Text Summarization:
Text summarization aims to produce a concise and fluent summary of a longer document.
Application:
- The source document is fed into the encoder, generating a semantic representation.
- The decoder generates a summary token-by-token. Attention mechanisms are almost always necessary here due to the long length of source documents.
Extractive vs. Abstractive Summarization:
- Extractive Summarization: Selects important sentences or phrases directly from the source text and pieces them together. While Seq2Seq can be used to score and select sentences, it is less common for pure extraction.
- Abstractive Summarization: Generates new sentences that capture the core meaning, potentially using words not present in the original text. Seq2Seq models excel here, as the encoder learns the semantic meaning and the decoder acts as a natural language generator to write the summary from scratch.
Challenges in Abstractive Summarization:
Handling out-of-vocabulary (OOV) words and maintaining factual consistency (hallucination) are major challenges in applying basic Seq2Seq to summarization.
What are the primary limitations of classical sequence-to-sequence models without attention?
Classical Seq2Seq models (those relying solely on a single context vector) suffer from several critical limitations:
-
The Information Bottleneck:
- The encoder must compress the entire input sequence into a single, fixed-length vector (the final hidden state).
- For long sentences, this vector struggles to retain all necessary information, leading to degraded performance (rapid drop in translation quality as sentence length increases).
-
Vanishing Gradient Problem:
- Even with LSTMs or GRUs, processing very long sequences causes gradients to vanish, making it difficult for the model to learn long-range dependencies between the beginning of the input and the output.
-
Lack of Alignment:
- Classical models lack an explicit mechanism to align source and target words. In human translation, a translator focuses on specific parts of the source text when generating a specific target word. Classical Seq2Seq forces the decoder to rely on the same global context for every generated word.
Explain the concept of 'Attention' in deep NLP. Why was it introduced?
Attention in Deep NLP:
Attention is a mechanism that allows a neural network to focus on specific parts of the input sequence when generating a specific part of the output sequence, mimicking human cognitive attention.
Why it was introduced:
- To solve the information bottleneck of classical Seq2Seq models. Instead of forcing the encoder to compress an entire sentence into a single vector, the encoder passes all its hidden states (one for each input token) to the decoder.
- At each decoding step, the attention mechanism calculates a set of attention weights (probabilities) that represent how relevant each input token is to the current output being generated.
- It then takes a weighted sum of the encoder hidden states to create a dynamic context vector specifically tailored for that decoding step.
- This drastically improves performance on long sequences and complex tasks like machine translation.
Provide the mathematical formulation for computing Soft Attention in an encoder-decoder network.
In Soft Attention, the dynamic context vector for the decoder at time step is computed as a weighted sum of all encoder hidden states.
Let the encoder hidden states be .
Let the decoder hidden state at time be .
Step 1: Alignment Score
Calculate an alignment score between the current decoder state (or ) and each encoder state . Using a general scoring function :
Step 2: Attention Weights (Softmax)
Convert the scores into probabilities (attention weights) using the softmax function:
Note that .
Step 3: Context Vector
Compute the context vector as the weighted sum of the encoder hidden states:
This context vector is then used alongside the decoder state to predict the next word.
Distinguish between Soft Attention and Hard Attention mechanisms.
Soft Attention vs. Hard Attention:
Soft Attention:
- Mechanism: Computes a probability distribution over all input tokens and takes a weighted sum of all encoder hidden states.
- Differentiability: It is fully differentiable, meaning the entire model can be trained end-to-end using standard backpropagation.
- Computation: Can be computationally expensive for very long sequences because it calculates weights for every input token at every decoding step.
Hard Attention:
- Mechanism: Selects exactly one (or a few discrete) input tokens to focus on at each step. The attention weight is $1$ for the selected token and $0$ for all others.
- Differentiability: It is non-differentiable because making a discrete choice breaks the gradient flow.
- Training: Requires reinforcement learning techniques (like the REINFORCE algorithm) or Monte Carlo sampling to train.
- Advantage: Less computational cost at inference time since it only processes one encoder state, but much harder to train.
Explain the Bahdanau Attention mechanism (Additive Attention) in detail, including its alignment score equation.
Bahdanau Attention (Additive Attention):
Introduced by Dzmitry Bahdanau et al., this was the first attention mechanism applied to NLP (specifically machine translation).
Core Idea:
It uses a feed-forward neural network to calculate the alignment score, learning the alignment jointly with the translation model.
Alignment Score Equation:
The score (how well the inputs around position and the output at position match) is calculated using the previous decoder hidden state and the encoder hidden state :
Where:
- and are weight matrices.
- is a weight vector.
- introduces non-linearity.
Process:
- Calculate for all encoder states .
- Apply Softmax to get weights .
- Compute context vector .
- The context vector is concatenated with the input word embedding at step to compute the new decoder state .
Because it adds the linearly transformed states before applying the activation function, it is known as Additive Attention.
Explain the Luong Attention mechanism (Multiplicative Attention). How does it differ from Bahdanau attention in terms of state usage?
Luong Attention (Multiplicative Attention):
Introduced by Thang Luong et al., it is a refinement of the Bahdanau attention mechanism, offering different scoring functions and variations in how the context vector is used.
State Usage Difference:
- Bahdanau: Uses the previous decoder hidden state to calculate attention, which then helps compute the current decoder state .
- Luong: First computes the current decoder hidden state using the standard RNN/LSTM update. It then uses this current alongside the encoder states to compute the attention weights and the context vector .
Process in Luong:
- Decoder steps forward to get .
- Calculate alignment scores .
- Apply Softmax to get .
- Compute context vector .
- Concatenate and to create an attentional hidden state , which is used to predict the output word.
What are the three different scoring functions proposed in Luong Attention?
Luong et al. proposed three different ways to calculate the alignment score (or ) between the decoder hidden state and the encoder hidden state :
-
Dot Product:
This is the simplest and fastest method, but it requires the encoder and decoder hidden states to have the exact same dimensions. -
General (Multiplicative):
Here, is a trainable weight matrix. This allows the encoder and decoder to have different dimensions and introduces learned parameters to project the states into a common space before the dot product. -
Concat (Similar to Bahdanau):
This concatenates the states and passes them through a linear layer and a activation, followed by a dot product with a weight vector .
Compare Bahdanau Attention and Luong Attention mechanisms.
Comparison between Bahdanau and Luong Attention:
-
Alignment Calculation State:
- Bahdanau: Uses the previous decoder state to compute attention.
- Luong: Uses the current decoder state to compute attention.
-
Scoring Function:
- Bahdanau: Exclusively uses an Additive scoring function: .
- Luong: Offers multiple scoring functions, most notably the Multiplicative (General) and Dot product .
-
Architecture Pathway:
- Bahdanau: Context vector is concatenated with the decoder input, which then goes into the decoder RNN.
- Luong: Context vector is calculated after the decoder RNN step, concatenated with the decoder output to form an attentional state, which is passed to the softmax classifier.
-
Types (Global vs. Local):
- Bahdanau is strictly a "global" attention mechanism (looks at all encoder states).
- Luong introduced the concept of "local" attention, where the model only looks at a small window of encoder states, improving efficiency.
Describe the process of integrating attention into an encoder-decoder network during the decoding phase.
Integrating Attention into the Decoder:
During decoding, the attention mechanism acts as a bridge between the encoder and decoder. The step-by-step integration at time step is:
- State Acquisition: The decoder uses its current state (or previous state, depending on the architecture) and fetches all the hidden states from the encoder.
- Scoring: The attention layer calculates an alignment score between the decoder state and each encoder state using a scoring function (e.g., dot product or additive).
- Weights: The scores are passed through a Softmax layer to generate a probability distribution (attention weights). These weights highlight which input words are most relevant right now.
- Context Vector Formulation: A context vector is created by taking the element-wise weighted sum of the encoder hidden states based on the attention weights.
- Combination: The context vector is concatenated with the decoder's state. This combined vector contains both the generated context history and the targeted source information.
- Output Generation: The concatenated vector is passed through a dense layer with a softmax activation to output the probability distribution for the next word.
What is the BLEU score? Explain how modified n-gram precision is calculated in BLEU.
BLEU (Bilingual Evaluation Understudy):
BLEU is a widely used metric for evaluating the quality of text generated by machines, particularly in machine translation. It measures how similar the machine's output is to one or more human reference translations.
Modified N-gram Precision:
Standard precision would just count how many n-grams from the candidate (machine) translation appear in the reference. However, this can be easily gamed by repeating words (e.g., "the the the the").
To prevent this, BLEU uses modified n-gram precision:
- Count the maximum number of times a specific n-gram appears in any single reference translation. This is the maximum reference count.
- Count the occurrences of that n-gram in the candidate translation.
- Clip the candidate count: the count cannot exceed the maximum reference count.
- Sum these clipped counts for all unique n-grams in the candidate, and divide by the total number of n-grams in the candidate.
Formula for modified unigram precision:
Explain the Brevity Penalty in the context of the BLEU score calculation. Why is it necessary?
Brevity Penalty (BP) in BLEU:
Why it is necessary:
BLEU relies heavily on precision (how many of the generated words are correct). A model could cheat by outputting a very short sentence containing only a few highly probable "safe" words (e.g., translating a 10-word sentence into "I am"). The precision would be 100%, but the translation is terrible because it lacks recall. Since BLEU does not use standard recall (because there can be multiple valid references), it introduces the Brevity Penalty to penalize short translations.
Calculation:
The Brevity Penalty scales the overall BLEU score down if the candidate translation length () is shorter than the reference translation length ().
- If the candidate length is greater than or equal to the reference , (no penalty).
- If , the penalty is exponentially applied. The final BLEU score is multiplied by .
What are ROUGE scores in NLP evaluation? Differentiate between ROUGE-N and ROUGE-L.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
ROUGE is a set of metrics primarily used for evaluating automatic summarization and machine translation models. While BLEU focuses heavily on precision, ROUGE traditionally emphasizes recall—measuring how much of the human reference summary is captured by the machine-generated summary.
ROUGE-N:
- Measures the overlap of N-grams between the system and reference summaries.
- For example, ROUGE-1 measures unigram (single word) overlap, and ROUGE-2 measures bigram overlap.
- The recall is calculated as the number of overlapping N-grams divided by the total number of N-grams in the reference summary.
ROUGE-L:
- Based on the Longest Common Subsequence (LCS).
- It finds the longest sequence of words that appear in both the generated and reference text in the same order, though not necessarily consecutively.
- It captures sentence structure better than ROUGE-N because it automatically incorporates longest in-sequence matches without requiring predefined n-gram lengths.
- ROUGE-L provides precision, recall, and an F1-score based on the LCS.
Compare BLEU and ROUGE scores. In which NLP tasks is each predominantly used and why?
BLEU vs. ROUGE:
-
Core Metric Emphasis:
- BLEU: Heavily emphasizes Precision (with a Brevity Penalty to compensate for recall). It checks if the words generated by the machine are valid by looking at references.
- ROUGE: Traditionally emphasizes Recall (though F1 scores are commonly used now). It checks how much of the vital information from the human reference was captured by the machine.
-
Predominant Use Cases:
- BLEU -> Machine Translation: In translation, generating accurate/precise words is critical. Outputting unnecessary or wrong words changes the meaning drastically. Precision is key.
- ROUGE -> Text Summarization: In summarization, the primary goal is to ensure all key points (gists) of the original text are included. Omitting key information is worse than phrasing it slightly differently, making Recall more important.
-
Calculation Mechanics:
- BLEU uses modified n-gram precision and a brevity penalty.
- ROUGE uses n-gram recall (ROUGE-N), longest common subsequence (ROUGE-L), or skip-bigrams (ROUGE-S).
What is Teacher Forcing, and why is it used during the training of sequence-to-sequence models?
Teacher Forcing:
Teacher forcing is a training strategy for recurrent neural networks that use sequence-to-sequence architectures.
Mechanism:
- During the decoding phase, instead of feeding the model's own predicted output from time step as the input for time step , the actual ground-truth target word from the training dataset is fed into the network as input.
Why it is used:
- Faster Convergence: If the model makes a mistake early in a sequence (e.g., predicting the wrong first word), feeding that mistake back in causes errors to compound. The model spends time exploring irrelevant state spaces. Teacher forcing corrects the model immediately, leading to much faster and more stable training.
- Parallelization: In some architectures (like Transformers, though less so in RNNs), providing the ground truth allows calculations for all time steps to be processed in parallel.
Drawback:
Exposure Bias: During inference (testing), the ground truth is unavailable, so the model must rely on its own predictions. If it hasn't learned to recover from its own small mistakes during training, performance can drop. This is often mitigated by techniques like Scheduled Sampling.
Discuss the evaluation challenges in generative NLP tasks and how automated metrics address or fail to address them.
Evaluation Challenges in Generative NLP:
Generative tasks like translation and summarization have no single "correct" answer. A sentence can be rewritten in multiple ways while preserving meaning.
Challenges:
- Exact Match Failure: Standard accuracy metrics (like exact string match) fail because a perfectly valid translation might use a synonym (e.g., "big" vs. "large").
- Semantic Similarity: Measuring whether two sentences mean the same thing mathematically is difficult.
- Fluency vs. Adequacy: A model might generate perfectly fluent English that has nothing to do with the source text, or an accurate translation that is grammatically terrible.
How Automated Metrics Address Them:
- BLEU/ROUGE: Address the exact match failure by looking at n-gram overlaps. By allowing multiple reference texts, they account for varied phrasing.
How They Fail:
- Lack of Semantic Understanding: BLEU and ROUGE rely on lexical overlap (word matching). They penalize synonyms heavily (e.g., predicting "automobile" when the reference says "car" yields a score of 0 for that word).
- Word Order: While higher-order n-grams help, metrics can still give decent scores to slightly garbled sentences.
- Modern Solutions: To address these failures, modern evaluation uses neural metrics like BERTScore or METEOR, which use word embeddings to measure semantic similarity rather than just exact lexical matching.
Explain the concept of ROUGE-S (Skip-Bigram Co-occurrence). How does it differ from standard ROUGE-2?
ROUGE-S (Skip-Bigram Co-occurrence):
ROUGE-S measures the overlap of skip-bigrams between a candidate text and a reference text.
Concept:
- A skip-bigram is any pair of words in their sentence order, allowing for arbitrary gaps (other words) between them.
- For example, in the sentence "I have a red car", the skip-bigrams include ("I", "have"), ("I", "a"), ("I", "red"), ("I", "car"), ("have", "a"), etc.
- ROUGE-S calculates the number of skip-bigrams that appear in both the reference and the generated summary.
Difference from ROUGE-2:
- ROUGE-2 only counts adjacent bigrams. If the reference is "red car" and the model outputs "red fast car", ROUGE-2 yields 0 overlap.
- ROUGE-S captures long-distance dependencies and allows for word insertions. In the same example, the skip-bigram ("red", "car") is preserved, giving the model partial credit for maintaining the correct relative order of key words.
Derive the basic structural flow of how a translation is generated in a Sequence-to-Sequence model with Attention, from input to final probability distribution.
Structural Flow of Seq2Seq with Attention:
-
Input Representation:
Let the input sentence be tokenized into . Each token is mapped to an embedding vector. -
Encoder Phase:
The encoder RNN processes embeddings to produce hidden states:
-
Decoder Phase (at step ):
Let the previous generated word be and previous decoder state be . -
Attention Scoring:
Compute score between and each (using Bahdanau as example):
-
Attention Weights:
Normalize scores using Softmax:
-
Context Vector:
Compute weighted sum of encoder states:
-
Decoder State Update:
The new decoder state is calculated using the context vector, previous word, and previous state:
-
Output Probability:
Pass the state (and sometimes context) through a dense layer to get vocabulary probabilities:
The word with the highest probability is selected as .