Unit2 - Subjective Questions
CSE472 • Practice Questions with Detailed Answers
Define Vector Space Models (VSM) in the context of Natural Language Processing and explain their primary purpose.
Vector Space Models (VSMs) are algebraic models used to represent text documents or individual words as vectors of identifiers (e.g., index terms or frequencies) in a multi-dimensional space.
Primary Purpose:
- Mathematical Representation: They allow the conversion of textual data into a numerical format that machine learning algorithms can process.
- Similarity Measurement: By representing words or documents as vectors, VSMs enable the computation of distances or angles between them (e.g., using Cosine Similarity) to quantify semantic or syntactic similarity.
- Information Retrieval: VSMs form the foundational theory behind search engines, where queries and documents are mapped to the same space to find the closest match.
Distinguish between sparse vector representations and dense word embeddings. Provide an example of each.
In NLP, word representations can be broadly categorized into sparse and dense vectors:
1. Sparse Vector Representations:
- Definition: Vectors that have a very high dimensionality (often equal to the vocabulary size) and contain mostly zeros.
- Characteristics: High memory footprint, lack of semantic relationship capture (orthogonal vectors).
- Example: One-hot encoding, TF-IDF, Bag-of-Words (BoW).
2. Dense Word Embeddings:
- Definition: Vectors with a relatively low, fixed dimensionality (e.g., 50 to 300 dimensions) where most elements are non-zero floating-point numbers.
- Characteristics: Computationally efficient, captures semantic and syntactic relationships (similar words have similar vectors).
- Example: Word2Vec, GloVe, FastText.
Explain the concept of dense word embeddings and discuss two major advantages they offer over traditional count-based representations.
Dense Word Embeddings map words or phrases from a vocabulary to vectors of real numbers in a low-dimensional continuous space. Unlike count-based models, these representations are learned using neural networks or matrix factorization techniques over large corpora.
Advantages over traditional count-based representations:
- Dimensionality Reduction: Traditional representations like one-hot encoding scale with the vocabulary size (e.g., dimensions), whereas dense embeddings are fixed (e.g., ). This drastically reduces memory usage and computational complexity.
- Semantic Generalization: Dense embeddings capture distributed representations. Words with similar meanings are mapped to proximate points in the vector space, allowing models to generalize better to unseen phrases (e.g., recognizing that "dog" and "puppy" are related).
Describe the Word2Vec framework. What are the two primary architectures introduced in this framework?
Word2Vec is a popular framework developed by Mikolov et al. at Google for learning dense word embeddings from large datasets. It uses shallow, two-layer neural networks trained to reconstruct linguistic contexts of words.
The framework operates on the principle of the distributional hypothesis: "You shall know a word by the company it keeps."
Two Primary Architectures:
- Continuous Bag-of-Words (CBOW): This architecture predicts a target word based on its surrounding context words. The context words' vectors are averaged or summed to predict the target.
- Skip-Gram: This architecture does the exact opposite of CBOW. It uses a single target word to predict the surrounding context words within a specified window size.
Explain the Continuous Bag-of-Words (CBOW) model in detail, including its objective function.
Continuous Bag-of-Words (CBOW) is a Word2Vec architecture designed to predict a central target word given its surrounding context words within a window of size .
Architecture Details:
- Input Layer: Takes the one-hot encoded vectors of the context words.
- Hidden Layer: Projects the inputs into a dense latent space (the embedding layer). The vectors of the context words are averaged to form a single hidden representation.
- Output Layer: Uses a softmax function over the vocabulary to output the probability distribution of the target word.
Objective Function:
The goal of CBOW is to maximize the conditional probability of the target word given its context .
Mathematically, the loss function (negative log-likelihood) to be minimized over a corpus of size is:
Where is computed using the softmax function over the dot product of the hidden state and the output word vectors.
Describe the Skip-Gram model architecture and formulate its objective function.
Skip-Gram is a Word2Vec model that predicts the surrounding context words given a single central target word.
Architecture Details:
- Input Layer: A one-hot encoded vector representing the target word .
- Hidden Layer: Acts as a lookup table to retrieve the dense embedding of the target word.
- Output Layer: Multiple softmax classifiers (or approximated versions) that predict the context words within a window of size .
Objective Function:
The goal is to maximize the average log probability of context words given the target word. Over a sequence of training words , the objective is to maximize:
Where is the training context size. The probability is defined using the softmax function:
where and are the "input" and "output" vector representations of word , and is the vocabulary size.
Compare the CBOW and Skip-Gram architectures of Word2Vec in terms of learning capabilities and performance.
While both CBOW and Skip-Gram belong to the Word2Vec family, they have distinct learning behaviors:
1. Prediction Goal:
- CBOW: Predicts the target word from the context.
- Skip-Gram: Predicts the context from the target word.
2. Training Speed:
- CBOW: Generally trains faster because it averages the context words into a single hidden vector, resulting in fewer network updates per training sample.
- Skip-Gram: Slower to train since it creates multiple training pairs for each window.
3. Performance on Word Frequencies:
- CBOW: Performs better for frequently occurring words and has slightly better accuracy for syntactic tasks.
- Skip-Gram: Works very well with small amounts of training data and handles rare words or phrases much better than CBOW.
Explain the concept of Negative Sampling in the context of training Word2Vec models. Why is it necessary?
Negative Sampling is an optimization technique used to efficiently train Word2Vec models, specifically designed to approximate the computationally expensive softmax denominator.
Why it is necessary:
In the standard Skip-Gram or CBOW model, calculating the softmax probability requires summing over the entire vocabulary (which can be millions of words). This makes computing the gradients and updating weights extremely slow.
How it works:
Instead of updating all weights in the vocabulary for every training example, Negative Sampling simplifies the problem into binary classification (using logistic regression).
- For a given context, the model takes the true pair (target, true context) and labels it as $1$ (positive sample).
- It then randomly samples words from a noise distribution (words not in the context) and pairs them with the target, labeling them as $0$ (negative samples).
- The objective becomes maximizing the dot product for the positive pair while minimizing it for the negative pairs. Typically, is between for small datasets, and for large datasets.
Describe the Global Vectors for Word Representation (GloVe) model. How does it utilize the word co-occurrence matrix?
GloVe (Global Vectors for Word Representation) is an unsupervised learning algorithm developed by Stanford for obtaining vector representations for words.
Core Concept:
Unlike Word2Vec, which is a predictive model that uses local context windows, GloVe is a count-based model. It relies heavily on global statistical information of the corpus. The fundamental idea is that the ratio of co-occurrence probabilities of two words with various probe words contains the semantic information.
Utilization of the Co-occurrence Matrix:
- Matrix Construction: GloVe first builds a massive term-term co-occurrence matrix , where each entry represents how many times word occurs in the context of word .
- Dimensionality Reduction: The raw matrix is too sparse and large. GloVe formulates a weighted least squares objective to factorize this matrix.
- Learning Embeddings: The model learns word vectors such that the dot product of two word vectors equals the logarithm of their probability of co-occurrence.
Formulate the mathematical objective function of the GloVe model and explain its components.
The GloVe objective function aims to minimize the weighted least squares loss between the dot product of two word vectors and the logarithm of their co-occurrence counts.
The cost function is formulated as:
Explanation of Components:
- : Vocabulary size.
- : The number of times word occurs in the context of word .
- : The embedding vectors for the main word and the context word .
- : Scalar biases for the main word and context word.
- : A weighting function designed to prevent rare co-occurrences from disproportionately influencing the cost, while also capping the influence of highly frequent pairs (like "the" and "and"). It is typically defined as:
(where is often set to $0.75$).
Compare and contrast Word2Vec and GloVe embeddings in terms of underlying principles and training mechanisms.
1. Underlying Principles:
- Word2Vec: A predictive model based on deep learning architectures (feed-forward neural nets). It learns embeddings by explicitly trying to predict a word from its context (CBOW) or context from a word (Skip-Gram).
- GloVe: A count-based, matrix factorization model. It explicitly leverages global statistical properties (word co-occurrence counts) of the entire corpus to learn embeddings.
2. Training Mechanism:
- Word2Vec: Scans through the corpus sequentially using a sliding window. It updates the vector weights incrementally using stochastic gradient descent (SGD) on individual target-context pairs.
- GloVe: First scans the entire corpus to build a massive global co-occurrence matrix. Then, it optimizes the embeddings using matrix factorization techniques (weighted least squares) on this pre-computed matrix.
3. Performance Considerations:
- Both generally yield similar high-quality downstream NLP performance when optimally tuned, but GloVe makes better use of global corpus statistics, whereas Word2Vec focuses strictly on local context windows.
How do word embeddings capture semantic similarity? Explain how Cosine Similarity is used to measure it.
Capturing Semantic Similarity:
Word embeddings capture semantics based on the distributional hypothesis (words appearing in similar contexts have similar meanings). During training, words that frequently share the same surrounding context are forced to have similar latent representations, placing them close to each other in the continuous vector space.
Using Cosine Similarity:
To quantify how similar two words are, we measure the angle between their respective embedding vectors. Cosine similarity calculates the cosine of this angle.
Mathematically, for two vectors and :
- A value of $1$ means the vectors point in the exact same direction (highly similar).
- A value of $0$ means they are orthogonal (unrelated).
- A value of means they point in opposite directions.
Explain how word embeddings exhibit analogy relationships. Provide a mathematical example using classic vector offsets.
Analogy Relationships in Embeddings:
One of the most striking features of well-trained dense word embeddings (like Word2Vec and GloVe) is that they capture syntactic and semantic regularities as linear offsets in the vector space. This means the relationship between pairs of words can be represented as a translation vector.
Mathematical Example:
The classic analogy is identifying gender relationships: "Man is to King as Woman is to Queen."
In the vector space, the directional vector from Man to Woman is roughly parallel to the vector from King to Queen.
Using vector arithmetic:
Rearranging to solve for the analogy:
By computing the vector for and finding the closest word vector in the vocabulary using Cosine Similarity, the model will remarkably output "Queen".
Why is it important to visualize embedding spaces? Name two common algorithms used for this purpose.
Importance of Visualizing Embedding Spaces:
- Model Debugging & Validation: Visualization allows researchers to verify if the model has actually learned meaningful semantic clusters (e.g., ensuring animals are clustered together away from vehicles).
- Discovering Relationships: It helps uncover hidden biases (like gender or racial bias) or interesting analogy relationships captured by the vector math.
- Interpretability: Since embeddings exist in high-dimensional spaces (e.g., 300 dimensions), human brains cannot comprehend them directly. Dimensionality reduction allows us to plot them in 2D or 3D spaces to make the "black box" of deep learning more intuitive.
Common Algorithms Used:
- PCA (Principal Component Analysis)
- t-SNE (t-Distributed Stochastic Neighbor Embedding)
Explain how Principal Component Analysis (PCA) can be used to visualize high-dimensional word embeddings.
Principal Component Analysis (PCA) is a linear dimensionality reduction technique used to project high-dimensional word embeddings into a lower-dimensional space (typically 2D or 3D) for visualization.
How it works for embeddings:
- Centering the Data: The mean of the word vectors is calculated and subtracted from every vector to center the data around the origin.
- Covariance Matrix: The algorithm computes the covariance matrix of the centered embeddings to understand how the dimensions vary together.
- Eigen Decomposition: It calculates the eigenvalues and eigenvectors of the covariance matrix. The eigenvectors represent the directions of maximum variance (the "principal components").
- Projection: To visualize the embeddings in 2D, the original vectors are projected onto the top two eigenvectors (those with the highest eigenvalues).
Result: PCA helps reveal global structures in the embedding space, such as parallel analogy vectors, though it may struggle to cleanly separate complex, non-linear local clusters.
Describe t-Distributed Stochastic Neighbor Embedding (t-SNE) and explain why it is favored over PCA for visualizing word embeddings.
t-SNE is a non-linear dimensionality reduction technique heavily used for visualizing high-dimensional datasets, including word embeddings.
How it works:
t-SNE converts high-dimensional Euclidean distances between word vectors into conditional probabilities representing similarities. It then defines a similar probability distribution in a low-dimensional space (2D or 3D) and minimizes the Kullback-Leibler (KL) divergence between the two distributions using gradient descent. It uses a Student's t-distribution in the low-dimensional space to alleviate the "crowding problem."
Why it is favored over PCA:
- Local Structure Preservation: PCA is a linear algorithm that preserves global data variance. t-SNE is non-linear and specifically designed to preserve local neighborhood structures.
- Clustering: t-SNE is much better at separating words into distinct, visually interpretable semantic clusters (e.g., grouping all colors tightly together, separate from numbers). PCA often results in a massive, overlapping blob where local clusters are difficult to distinguish.
Discuss the limitations of static word embeddings like Word2Vec and GloVe.
While revolutionary, static word embeddings like Word2Vec and GloVe have several significant limitations:
- Polysemy and Homonymy (Context-Independence): They generate a single, fixed vector for every word in the vocabulary. Therefore, they cannot distinguish between different meanings of the same word based on context. For example, the vector for "bank" is exactly the same in "river bank" and "bank account", effectively creating a blended, confusing representation.
- Out-of-Vocabulary (OOV) Problem: If a word does not appear in the training corpus, the model cannot assign it a vector. They do not utilize subword information (unlike FastText) to guess embeddings for rare, misspelled, or morphologically rich unseen words.
- Memory Intensive at Scale: They require maintaining a massive lookup table containing a vector for every distinct word in the vocabulary, which can be computationally taxing to store and deploy in memory-constrained environments.
Explain the role of the 'window size' hyperparameter in training Word2Vec models. How does varying it affect the learned embeddings?
The window size () is a crucial hyperparameter in Word2Vec that dictates how many context words to the left and right of the target word are considered during training.
Effect of varying window size:
- Small Window Size (e.g., ):
- The model focuses on the immediate neighbors of the target word.
- The learned embeddings capture syntactic similarity (e.g., recognizing parts of speech). Words that are functionally interchangeable in a sentence (like "dog" and "cat") will have highly similar embeddings.
- Large Window Size (e.g., to $10$):
- The model incorporates a broader context, capturing the general topic or domain.
- The learned embeddings capture semantic or topical similarity (e.g., "dog" might be clustered closer to words like "bark", "leash", or "veterinarian").
Define Hierarchical Softmax. How does it improve the efficiency of training word embedding models?
Hierarchical Softmax is an alternative to the standard softmax function and Negative Sampling, used to make the training of Word2Vec models computationally efficient.
How it works:
Instead of evaluating the probability of a target word against all words in the vocabulary (an operation), Hierarchical Softmax represents the vocabulary as a binary tree (typically a Huffman tree). The leaves of the tree are the actual words, and the internal nodes contain trainable weights.
Efficiency Improvement:
To calculate the probability of a word, the model only needs to trace the path from the root of the tree to the leaf corresponding to that word. At each internal node, it performs a binary logistic regression to decide whether to go left or right.
This reduces the computational complexity of the output probability calculation from to , vastly speeding up the training process for large vocabularies.
Outline the end-to-end pipeline for training and evaluating a Word2Vec model on a custom text corpus.
End-to-End Pipeline for Word2Vec:
1. Data Collection and Preprocessing:
- Tokenization: Split the raw text into individual words or subwords.
- Normalization: Convert to lowercase, remove punctuation, special characters, and optionally filter out stop words (though sometimes stop words help provide context).
- Subsampling: Downsample highly frequent words (e.g., "the", "is") to accelerate training and improve the representation of less frequent words.
2. Vocabulary Construction:
- Build a dictionary mapping every unique word to an integer index. Set a minimum frequency threshold (e.g.,
min_count=5) to discard extremely rare words.
3. Model Configuration & Training:
- Hyperparameters: Choose the architecture (CBOW vs. Skip-Gram), vector dimensionality (e.g., 300), window size (e.g., 5), and optimization strategy (Negative Sampling vs. Hierarchical Softmax).
- Training: Pass the context-target pairs through the neural network and update the embedding weight matrix using Stochastic Gradient Descent (SGD).
4. Evaluation:
- Intrinsic Evaluation: Test the model on word analogy tasks (e.g., ) and calculate the Spearman correlation against human-annotated word similarity datasets (e.g., WordSim-353).
- Visualization: Plot the learned vectors using t-SNE or PCA to inspect semantic clustering.