1What is the primary function of a vector space model in Natural Language Processing?
vector space models
Easy
A.To translate text directly from one language to another.
B.To represent words strictly as strings of text.
C.To convert audio signals into textual data.
D.To represent words or documents as vectors of numerical values.
Correct Answer: To represent words or documents as vectors of numerical values.
Explanation:
Vector space models provide a way to represent text algebraically by mapping words or documents to vectors of real numbers, allowing computers to process natural language mathematically.
Incorrect! Try again.
2In a vector space model, what does the spatial distance between two word vectors typically represent?
vector space models
Easy
A.The historical age of the words.
B.The difference in their character lengths.
C.Their semantic similarity or relatedness.
D.Their alphabetical order in a dictionary.
Correct Answer: Their semantic similarity or relatedness.
Explanation:
In vector space models, words that share similar meanings or contexts are mapped to vectors that are close to each other in the vector space.
Incorrect! Try again.
3What is a key advantage of dense word embeddings over traditional sparse representations like one-hot encoding?
dense word embeddings
Easy
A.They do not require any training data.
B.They represent every word as a vector composed almost entirely of zeros.
C.They capture semantic meaning in continuous, lower-dimensional vectors.
D.They require infinite memory to store.
Correct Answer: They capture semantic meaning in continuous, lower-dimensional vectors.
Explanation:
Unlike one-hot encoding (which is sparse and high-dimensional), dense embeddings use lower-dimensional continuous vectors where the values capture semantic relationships.
Incorrect! Try again.
4Compared to sparse vectors, dense vectors typically have:
dense word embeddings
Easy
A.Only binary values (0s and 1s).
B.Variable lengths depending on the word size.
C.Lower dimensionality and continuous non-zero values.
D.Higher dimensionality and mostly zero values.
Correct Answer: Lower dimensionality and continuous non-zero values.
Explanation:
Dense embeddings are fixed-length, low-dimensional vectors (often between 50 and 300 dimensions) where most elements are non-zero real numbers.
Incorrect! Try again.
5Which of the following is a popular shallow neural network framework developed for learning word embeddings?
Word2Vec
Easy
A.YOLO
B.Transformer
C.ResNet
D.Word2Vec
Correct Answer: Word2Vec
Explanation:
Word2Vec is a highly popular technique introduced by researchers at Google for learning dense vector representations of words.
Incorrect! Try again.
6What are the two main architectures introduced in the Word2Vec model?
Word2Vec
Easy
A.CNN and RNN
B.CBOW and Skip-Gram
C.PCA and t-SNE
D.LSTM and GRU
Correct Answer: CBOW and Skip-Gram
Explanation:
Word2Vec consists of two distinct model architectures for generating embeddings: Continuous Bag-of-Words (CBOW) and Skip-Gram.
Incorrect! Try again.
7Word2Vec relies on the idea that words appearing in similar contexts tend to have similar meanings. What is this concept called?
Word2Vec
Easy
A.The Turing test
B.The distributional hypothesis
C.The Markov assumption
D.The bag-of-words hypothesis
Correct Answer: The distributional hypothesis
Explanation:
The distributional hypothesis states that words occurring in similar contexts tend to have similar semantic meanings, which forms the basis for algorithms like Word2Vec.
Incorrect! Try again.
8What is the main objective of the Continuous Bag-of-Words (CBOW) architecture?
CBOW
Easy
A.To predict the surrounding context words given a central target word.
B.To classify a text document into predefined categories.
C.To predict the next sentence in a document.
D.To predict a target word given its surrounding context words.
Correct Answer: To predict a target word given its surrounding context words.
Explanation:
CBOW learns embeddings by taking a window of surrounding context words as input and predicting the central target word.
Incorrect! Try again.
9In the CBOW architecture, how is the context typically represented before predicting the target word?
CBOW
Easy
A.By taking the average or sum of the context word vectors.
B.By multiplying all context word vectors together.
C.By randomly selecting one context word.
D.By ignoring all words except the first one in the sequence.
Correct Answer: By taking the average or sum of the context word vectors.
Explanation:
The CBOW model combines the representations of the context words (often by averaging or summing them) into a single continuous representation to predict the target word.
Incorrect! Try again.
10What does the Skip-Gram model aim to predict during training?
Skip-Gram models
Easy
A.A central target word given the surrounding context words.
B.The surrounding context words given a central target word.
C.The grammatical structure of the sentence.
D.The overall sentiment of the entire sentence.
Correct Answer: The surrounding context words given a central target word.
Explanation:
The Skip-Gram model flips the CBOW approach; it takes a single target word as input and tries to predict the words that appear in its surrounding context window.
Incorrect! Try again.
11Which Word2Vec architecture is generally considered to work better with small amounts of training data and represents rare words well?
Skip-Gram models
Easy
A.One-hot encoding
B.CBOW
C.TF-IDF
D.Skip-Gram
Correct Answer: Skip-Gram
Explanation:
Skip-Gram treats each context-target pair as a new observation, making it better at capturing representations for infrequent words compared to CBOW.
Incorrect! Try again.
12What does the acronym "GloVe" stand for in the context of NLP?
GloVe embeddings
Easy
A.Generative Language Output Vector Engine
B.Generalized Lexical Vocabulary Embeddings
C.Global Vectors for Word Representation
D.Global Vocabulary Extraction
Correct Answer: Global Vectors for Word Representation
Explanation:
GloVe stands for Global Vectors for Word Representation, an embedding method developed by researchers at Stanford.
Incorrect! Try again.
13What kind of statistical information does GloVe primarily use to train its embeddings?
GloVe embeddings
Easy
A.Global word co-occurrence matrices
B.Character n-gram counts
C.Dependency parse trees
D.Bilingual translation pairs
Correct Answer: Global word co-occurrence matrices
Explanation:
GloVe leverages global statistical information by constructing a large matrix of word co-occurrence counts from the training corpus.
Incorrect! Try again.
14Which of the following best describes the core mathematical approach of GloVe?
GloVe embeddings
Easy
A.It relies entirely on human-annotated semantic dictionaries.
B.It combines the benefits of local context window methods with global matrix factorization.
C.It creates sparse vectors strictly using Term Frequency-Inverse Document Frequency (TF-IDF).
D.It only uses a simple recurrent neural network to predict the next word.
Correct Answer: It combines the benefits of local context window methods with global matrix factorization.
Explanation:
GloVe bridges the gap between traditional matrix factorization techniques (like LSA) and local context window methods (like Skip-Gram) to capture both global statistics and local semantics.
Incorrect! Try again.
15Which mathematical metric is most commonly used to compute the similarity between two word embeddings?
capturing semantic similarity
Easy
A.Euclidean distance
B.Cosine similarity
C.Manhattan distance
D.Jaccard index
Correct Answer: Cosine similarity
Explanation:
Cosine similarity measures the cosine of the angle between two vectors. It is the standard metric used in NLP to evaluate how semantically similar two word embeddings are, independent of their magnitude.
Incorrect! Try again.
16If two words are highly semantically similar, their cosine similarity score will be closest to:
capturing semantic similarity
Easy
A.1
B.-1
C.100
D.0
Correct Answer: 1
Explanation:
A cosine similarity of 1 indicates that the vectors point in the exact same direction, meaning the words are highly similar in meaning.
Incorrect! Try again.
17Word embeddings are famous for capturing algebraic analogies. Which classic vector equation best demonstrates this?
analogy relationships
Easy
A.
B.
C.
D.
Correct Answer:
Explanation:
This famous equation shows that subtracting the concept of 'man' from 'king' and adding 'woman' results in a vector closest to 'queen', proving embeddings capture semantic gender analogies.
Incorrect! Try again.
18By solving analogy tests (e.g., 'Paris is to France as Tokyo is to X'), we are primarily testing a word embedding model's ability to capture:
analogy relationships
Easy
A.Document length and structure.
B.Syntactic and semantic relationships.
C.Alphabetical sorting and casing.
D.Sentence and paragraph boundaries.
Correct Answer: Syntactic and semantic relationships.
Explanation:
Analogy tasks are designed to evaluate how well the embedding space organizes words according to real-world syntactic (grammar) and semantic (meaning) relationships.
Incorrect! Try again.
19Why are techniques like PCA and t-SNE commonly used with word embeddings?
visualizing embedding spaces using PCA or t-SNE
Easy
A.To translate embeddings into a different language.
B.To increase the dimensionality of the vectors for better accuracy.
C.To reduce the high-dimensional vectors to 2D or 3D for human visualization.
D.To convert textual embeddings directly into speech.
Correct Answer: To reduce the high-dimensional vectors to 2D or 3D for human visualization.
Explanation:
Word embeddings typically have 100-300 dimensions, which humans cannot visualize. Dimensionality reduction techniques like PCA and t-SNE project them into 2D or 3D space so we can plot and inspect them.
Incorrect! Try again.
20Which visualization technique is specifically well-known for preserving local data structures, making it highly effective for clustering similar word vectors together in a 2D scatter plot?
visualizing embedding spaces using PCA or t-SNE
Easy
t-SNE is a non-linear dimensionality reduction technique highly effective at keeping similar instances close together in lower dimensions, making it ideal for visualizing word clusters.
Incorrect! Try again.
21In a standard Vector Space Model, what is the primary consequence of representing a vocabulary of size using one-hot encoded vectors?
vector space models
Medium
A.The matrix of all word vectors forms a dense representation, requiring storage overall.
B.The dot product of any two distinct word vectors will be exactly 1.
C.The vectors inherently capture semantic relationships through their distance in the vector space.
D.The dot product of any two distinct word vectors evaluates to 0, making it impossible to measure semantic similarity directly.
Correct Answer: The dot product of any two distinct word vectors evaluates to 0, making it impossible to measure semantic similarity directly.
Explanation:
One-hot vectors are mutually orthogonal. Because they only contain a single '1' and '0's elsewhere, the dot product of two different words is always 0. This prevents the model from capturing any intrinsic similarity between different words.
Incorrect! Try again.
22When measuring the similarity between two document vectors and in a vector space model, why is cosine similarity typically preferred over Euclidean distance?
vector space models
Medium
A.Cosine similarity is mathematically faster to compute because it does not require a dot product.
B.Cosine similarity automatically penalizes terms that appear frequently across all documents in a corpus.
C.Euclidean distance is sensitive to the magnitude (length) of the vectors, meaning documents with similar content but different lengths may appear highly dissimilar.
D.Euclidean distance cannot be applied to vectors containing negative values.
Correct Answer: Euclidean distance is sensitive to the magnitude (length) of the vectors, meaning documents with similar content but different lengths may appear highly dissimilar.
Explanation:
Cosine similarity measures the angle between vectors, disregarding their magnitude. This ensures that a long document and a short document with the same word distribution are considered similar, which Euclidean distance would fail to do.
Incorrect! Try again.
23Which of the following best describes the core mechanism by which dense word embeddings learn meaning?
dense word embeddings
Medium
A.They optimize character-level features to predict the morphological structure of words.
B.They map words to predefined semantic categories provided by linguists in a centralized dictionary.
C.They adjust continuous vector weights to maximize the probability of words appearing in similar local contexts based on the distributional hypothesis.
D.They are trained to reconstruct a sparse TF-IDF matrix using Singular Value Decomposition.
Correct Answer: They adjust continuous vector weights to maximize the probability of words appearing in similar local contexts based on the distributional hypothesis.
Explanation:
Dense embeddings (like Word2Vec) rely on the distributional hypothesis ('a word is characterized by the company it keeps'), adjusting weights via a neural network to predict surrounding contexts, thereby encoding semantic meaning.
Incorrect! Try again.
24If the dimensionality of a dense word embedding is chosen to be excessively large relative to the vocabulary size and dataset, what is the most likely outcome?
dense word embeddings
Medium
A.The embeddings will naturally degrade into one-hot encoded vectors.
B.The model may overfit to the training corpus and fail to capture broad semantic similarities.
C.The model will perfectly generalize to out-of-vocabulary words.
D.The training process will become a convex optimization problem.
Correct Answer: The model may overfit to the training corpus and fail to capture broad semantic similarities.
Explanation:
While higher dimensions capture more nuanced features, an excessively large can lead to overfitting, where the model memorizes specific contexts from the training set rather than learning generalizable semantic representations.
Incorrect! Try again.
25What is the primary computational purpose of using Negative Sampling in Word2Vec training?
Word2Vec
Medium
A.To replace the computationally expensive softmax operation over the entire vocabulary with a set of binary logistic regression tasks.
B.To intentionally inject noise into the input vectors, acting as a form of regularization like dropout.
C.To remove negative or toxic words from the training corpus to prevent bias.
D.To invert the vectors of antonyms so they point in opposite directions in the vector space.
Correct Answer: To replace the computationally expensive softmax operation over the entire vocabulary with a set of binary logistic regression tasks.
Explanation:
Updating the weights for the entire vocabulary (via a full softmax) for every training example is highly inefficient. Negative sampling approximates this by updating weights for the true context word and only a small number of randomly sampled 'negative' words.
Incorrect! Try again.
26In Word2Vec, subsampling of frequent words is often employed. How does this technique improve the resulting embeddings?
Word2Vec
Medium
A.It balances the dataset by artificially generating synonyms for rare words.
B.It discards rare words that appear fewer than 5 times, preventing the vocabulary from becoming too large.
C.It probabilistically discards highly frequent words (like 'the', 'is'), preventing them from dominating the training time and allowing the model to focus on more informative co-occurrences.
D.It strictly enforces the model to assign larger magnitudes to the vectors of frequent words.
Correct Answer: It probabilistically discards highly frequent words (like 'the', 'is'), preventing them from dominating the training time and allowing the model to focus on more informative co-occurrences.
Explanation:
Highly frequent words provide less semantic value. Subsampling reduces their frequency in the training data, speeding up training and improving the quality of representations for less frequent, more meaningful words.
Incorrect! Try again.
27In the Continuous Bag-of-Words (CBOW) model with a window size of , how is the hidden layer representation constructed before predicting the target word?
CBOW
Medium
A.By applying a recurrent neural network (RNN) sequentially over the context words.
B.By taking the average (or sum) of the input vector representations of the context words.
C.By computing the dot product between all pairs of context words in the window.
D.By concatenating the context word vectors into a single vector of length .
Correct Answer: By taking the average (or sum) of the input vector representations of the context words.
Explanation:
In CBOW, the input to the prediction layer is a single vector, which is calculated as the average (or sum) of the dense embeddings of the context words within the window. It is called 'bag-of-words' because the order of context words is lost in this averaging.
Incorrect! Try again.
28Suppose you are training a CBOW model. If the sentence is 'The quick brown fox jumps over the lazy dog', and the current target word is 'fox' with a window size of . What are the input context words?
A window size of means taking the 2 words immediately preceding the target word and the 2 words immediately following it. For 'fox', the preceding words are 'quick', 'brown' and the following are 'jumps', 'over'.
Incorrect! Try again.
29How does the training objective of the Skip-Gram model fundamentally differ from that of the CBOW model?
B.Skip-Gram predicts surrounding context words given a single target word, whereas CBOW predicts a target word given a set of context words.
C.Skip-Gram predicts a target word given a set of context words, whereas CBOW predicts context words given a target word.
D.Skip-Gram computes word co-occurrence matrices explicitly, whereas CBOW relies on a shallow neural network.
Correct Answer: Skip-Gram predicts surrounding context words given a single target word, whereas CBOW predicts a target word given a set of context words.
Explanation:
The Skip-Gram architecture takes a single target word as input and attempts to predict the words in its surrounding context window. CBOW does the exact opposite.
Incorrect! Try again.
30Given the sentence 'She loves deep learning deeply', and using a Skip-Gram model with a window size of , how many training pairs are generated when 'deep' is the target word ?
B.3 pairs: ('loves', 'deep'), ('learning', 'deep'), and ('deeply', 'deep')
C.2 pairs: ('deep', 'loves') and ('deep', 'learning')
D.1 pair: ('deep', 'learning')
Correct Answer: 2 pairs: ('deep', 'loves') and ('deep', 'learning')
Explanation:
With a window size of , the context consists of 1 word to the left ('loves') and 1 word to the right ('learning'). Since Skip-Gram predicts context given the target, the pairs (target, context) are ('deep', 'loves') and ('deep', 'learning').
Incorrect! Try again.
31When comparing Skip-Gram and CBOW on a large corpus, what is a well-documented empirical advantage of the Skip-Gram model?
Skip-Gram models
Medium
A.It requires much less memory because it limits the context to one word.
B.It trains significantly faster than CBOW because it averages input vectors.
C.It tends to produce better quality representations for rare words or infrequent words.
D.It naturally groups antonyms together while pushing synonyms apart.
Correct Answer: It tends to produce better quality representations for rare words or infrequent words.
Explanation:
Because Skip-Gram treats each target-context word pair as a separate observation, rare words are updated more distinctly and carefully. CBOW averages context words, smoothing over the presence of rare words, making it faster but slightly worse for infrequent terms.
Incorrect! Try again.
32The Global Vectors (GloVe) model captures semantics primarily by modeling which of the following statistical properties of the corpus?
GloVe embeddings
Medium
A.The ratio of co-occurrence probabilities of two words with various probe words.
B.The absolute frequency count of single words sorted in descending order.
C.The probability of the target word given the continuous bag of context words.
D.The Singular Value Decomposition (SVD) of a term-document matrix.
Correct Answer: The ratio of co-occurrence probabilities of two words with various probe words.
Explanation:
GloVe is designed on the principle that the ratio of co-occurrence probabilities (rather than raw probabilities) is what fundamentally encodes meaning and distinguishes words from one another in a global co-occurrence matrix.
Incorrect! Try again.
33In the GloVe objective function, a weighting function is applied to the squared error term. What is a crucial property of this weighting function?
GloVe embeddings
Medium
A.It ensures that zero co-occurrences () result in a zero weight, and it caps the weight of highly frequent co-occurrences to avoid over-weighting.
B.It strictly assigns a weight of 0 to the most frequent words to prevent them from dominating the loss.
C.It is heavily exponentially weighted for rare co-occurrences to give them more importance.
D.It applies a softmax distribution to normalize all co-occurrences into probabilities sum to 1.
Correct Answer: It ensures that zero co-occurrences () result in a zero weight, and it caps the weight of highly frequent co-occurrences to avoid over-weighting.
Explanation:
The GloVe weighting function limits the influence of extremely common co-occurrences (like 'the' and 'and') by capping at a maximum value, and guarantees that so that the infinite log of 0 is ignored.
Incorrect! Try again.
34How does GloVe inherently bridge the gap between matrix factorization methods (like LSA) and shallow window-based methods (like Word2Vec)?
GloVe embeddings
Medium
A.It uses a local context window to compute co-occurrences, but trains its vectors by optimizing a global log-bilinear regression over the resulting co-occurrence matrix.
B.It computes the global Term-Frequency Inverse Document Frequency (TF-IDF) and feeds it directly into a Continuous Bag-of-Words model.
C.It processes local context windows sequentially using an RNN, but initializes the weights with an LSA matrix.
D.It performs Singular Value Decomposition (SVD) at every iteration step of a Skip-Gram training loop.
Correct Answer: It uses a local context window to compute co-occurrences, but trains its vectors by optimizing a global log-bilinear regression over the resulting co-occurrence matrix.
Explanation:
GloVe counts global co-occurrences from local context windows into a large matrix, then learns vectors such that their dot product equals the logarithm of the probability of co-occurrence. This combines global statistical information with local window-based insights.
Incorrect! Try again.
35If two word vectors and have been -normalized so that and , how is their Euclidean distance algebraically related to their cosine similarity ?
capturing semantic similarity
Medium
A.
B.
C.
D.
Correct Answer:
Explanation:
For normalized vectors, the Euclidean distance squared is . Since , this simplifies to . Cosine similarity for normalized vectors is just the dot product, making .
Incorrect! Try again.
36A significant limitation of standard Word2Vec and GloVe embeddings in capturing semantic similarity in real-world applications is their handling of Out-Of-Vocabulary (OOV) words. Why does this limitation exist?
capturing semantic similarity
Medium
A.They only map whole words seen during training to a fixed dictionary of vectors, making them incapable of inferring vectors for unseen words based on subwords.
B.They dynamically assign a random vector to new words at inference, causing catastrophic forgetting of existing similarities.
C.They rely on part-of-speech tags, meaning OOV words lack syntactic features necessary for vector computation.
D.They use absolute positional encodings that break when a text exceeds the maximum sequence length seen during training.
Correct Answer: They only map whole words seen during training to a fixed dictionary of vectors, making them incapable of inferring vectors for unseen words based on subwords.
Explanation:
Traditional Word2Vec and GloVe are word-level models with a fixed vocabulary dictionary. They do not utilize subword or character information (unlike FastText), so any word not present in the training vocabulary cannot have a vector assigned to it.
Incorrect! Try again.
37To solve the analogy task 'man is to king as woman is to X' using word embeddings, which vector arithmetic operation is conventionally used to find the target vector for ?
analogy relationships
Medium
A.
B.
C.
D.
Correct Answer:
Explanation:
The relationship 'man to king' translates to a vector offset . Adding this offset to yields the expected target for (queen), hence .
Incorrect! Try again.
38What mathematical property of dense embedding spaces allows them to successfully resolve syntactic analogies (e.g., walk:walking :: jump:jumping)?
analogy relationships
Medium
A.Relationships are captured as consistent linear translations (offsets) spanning across the vector space.
B.Syntactic analogies rely on exact string matching functions built into the embedding lookup mechanism.
C.The embeddings cluster all verbs into a single distinct hypersphere in the vector space.
D.The vectors are strictly normalized to orthogonal axes based on their morphological suffixes.
Correct Answer: Relationships are captured as consistent linear translations (offsets) spanning across the vector space.
Explanation:
The offset between a base word and its continuous form (e.g., ) represents a morphological/syntactic relationship. In models like Word2Vec, this geometric offset tends to be consistent across different verb pairs, appearing as a linear translation in the space.
Incorrect! Try again.
39When visualizing a 300-dimensional word embedding space in 2D, how does t-SNE generally differ from Principal Component Analysis (PCA) in its treatment of the data?
visualizing embedding spaces using PCA or t-SNE
Medium
A.t-SNE preserves the global variance of the data strictly linearly, whereas PCA focuses on minimizing the distance between nearby points non-linearly.
B.t-SNE models probabilities of neighborhood distances to preserve local cluster structures non-linearly, whereas PCA linearly projects data to maximize global variance.
C.PCA is computationally much slower but yields a deterministic mapping, while t-SNE is faster but random.
D.PCA requires embeddings to be converted to one-hot vectors, while t-SNE works natively on dense floating-point vectors.
Correct Answer: t-SNE models probabilities of neighborhood distances to preserve local cluster structures non-linearly, whereas PCA linearly projects data to maximize global variance.
Explanation:
PCA is a linear dimensionality reduction technique that prioritizes preserving large-scale variance (global structure). t-SNE is non-linear and specifically minimizes the Kullback-Leibler divergence between joint probabilities, specifically excelling at keeping similar points close together (local structure).
Incorrect! Try again.
40When applying t-SNE to visualize word embeddings, a crucial hyperparameter is 'perplexity'. What does perplexity effectively control in the t-SNE algorithm?
visualizing embedding spaces using PCA or t-SNE
Medium
A.The trade-off between the number of dimensions in the output space (e.g., 2D vs 3D).
B.The learning rate of the gradient descent used to optimize the KL divergence.
C.The balance between local and global aspects of the data, acting as a soft measure of the number of effective nearest neighbors for each point.
D.The number of iterations before the algorithm terminates.
Correct Answer: The balance between local and global aspects of the data, acting as a soft measure of the number of effective nearest neighbors for each point.
Explanation:
In t-SNE, perplexity dictates the variance of the Gaussian distribution used to measure similarities in the high-dimensional space. Practically, it represents a smooth measure of the effective number of nearest neighbors, determining how heavily local vs global topology is weighted.
Incorrect! Try again.
41In high-dimensional term-document vector space models leveraging tf-idf, what is the primary consequence of applying Latent Semantic Analysis (LSA) with a truncated Singular Value Decomposition (SVD) on the preservation of cosine similarity between rare words?
vector space models
Hard
A.The cosine similarity between rare words often becomes highly distorted or artificially inflated due to projection into dimensions dominated by frequent word variance.
B.Rare word vectors become strictly orthogonal to all other vectors because they are entirely relegated to the discarded singular values .
C.LSA guarantees the preservation of exact pairwise Euclidean distances for rare words, effectively bypassing the curse of dimensionality.
D.The truncated SVD acts as a perfect regularization mechanism, increasing the cosine similarity of rare words strictly proportionally to their true semantic overlap.
Correct Answer: The cosine similarity between rare words often becomes highly distorted or artificially inflated due to projection into dimensions dominated by frequent word variance.
Explanation:
LSA uses truncated SVD to optimally reconstruct the matrix in terms of the Frobenius norm. Because top singular vectors are primarily driven by the variance of frequent words, rare words suffer from structural distortion and can mistakenly be projected into overlapping regions, inflating their similarity.
Incorrect! Try again.
42Count-based Vector Space Models often utilize Positive Pointwise Mutual Information (PPMI). Why does standard PPMI introduce a systematic bias in word representations, and how is it mathematically mitigated in practice?
vector space models
Hard
A.It biases towards infrequent words because low-probability events yield extremely high PMI values; mitigated by context distribution smoothing, such as raising context probabilities to .
B.It biases towards frequent word pairs; mitigated by applying a logarithmic scaling factor to all raw co-occurrence counts prior to PPMI calculation.
C.It suffers from linear dependence on document length; mitigated by normalizing all term vectors strictly by their norm prior to calculating the probability distribution.
D.It assigns negative infinity to zero co-occurrences, destroying the vector space; mitigated by replacing all zero counts with the expected co-occurrence probability.
Correct Answer: It biases towards infrequent words because low-probability events yield extremely high PMI values; mitigated by context distribution smoothing, such as raising context probabilities to .
Explanation:
PMI naturally overestimates the association of rare events because the joint probability in the numerator is bounded, while the denominator (product of marginals) can be exceptionally small. Raising the context frequency to a fractional power (like 0.75) increases the effective probability of rare words, reducing this bias.
Incorrect! Try again.
43Polysemy poses a challenge for standard dense word embeddings because a single vector must represent multiple meanings. Under the linear superposition hypothesis proposed by Arora et al. (2018), how are multiple distinct senses of a word theoretically represented in a single dense vector in ?
dense word embeddings
Hard
A. is approximately a linear combination of the underlying sense vectors, and sparse coding can recover the individual senses provided the sense vectors are sufficiently uncorrelated and isotropically distributed.
B.The multiple senses occupy mutually orthogonal subspaces governed by the eigenvalues of the context matrix, allowing recovery via Principal Component Analysis (PCA).
C. is strictly the geometric centroid of context vectors, meaning the dominant sense entirely overrides minor senses, requiring non-linear manifold unraveling to recover.
D. represents a probabilistic mixture where individual senses can only be isolated by computing the gradient of the loss function with respect to context words.
Correct Answer: is approximately a linear combination of the underlying sense vectors, and sparse coding can recover the individual senses provided the sense vectors are sufficiently uncorrelated and isotropically distributed.
Explanation:
Theoretical work by Arora et al. demonstrates that word embeddings act as a linear superposition (a weighted sum) of their distinct senses. Because vectors are sparse over semantic concepts, techniques like sparse dictionary learning can successfully disentangle the original sense vectors.
Incorrect! Try again.
44Which of the following mathematical properties best explains why dense word embeddings organically develop structural features allowing for linear algebraic operations (e.g., ) without explicit semantic supervision?
dense word embeddings
Hard
A.The softmax function introduces a strict orthogonality constraint between syntactically distant words, aligning them along the primary axes of .
B.The optimization objective effectively factorizes a shifted log-co-occurrence matrix, meaning the inner product of vectors approximates the log probability, thus turning multiplicative probabilities into additive vector operations.
C.The continuous Bag-of-Words assumption forces all syntactically similar words to collapse into single eigenvectors of the co-occurrence matrix.
D.The embeddings inherently enforce a manifold where Euclidean distance is strictly equal to the inverse of raw co-occurrence frequency.
Correct Answer: The optimization objective effectively factorizes a shifted log-co-occurrence matrix, meaning the inner product of vectors approximates the log probability, thus turning multiplicative probabilities into additive vector operations.
Explanation:
Implicit matrix factorization proofs for embedding algorithms (like SGNS) show they factorize a variant of a Pointwise Mutual Information (PMI) matrix. By operating in log-probability space, multiplicative relationships in context distributions translate naturally to linear (additive/subtractive) relationships in the embedding space.
Incorrect! Try again.
45In Word2Vec, negative sampling replaces the computationally expensive full softmax. If negative samples are drawn per positive sample, what is the asymptotic relationship between the Word2Vec negative sampling objective and the Pointwise Mutual Information (PMI) matrix as the embedding dimension ?
Word2Vec
Hard
A.The dot product converges to , magnifying the similarity of rare words.
B.The dot product converges exactly to the Positive PMI matrix, .
C.The dot product converges to the log-likelihood of the marginal probability scaled by .
D.The dot product converges to a shifted PMI matrix: .
Correct Answer: The dot product converges to a shifted PMI matrix: .
Explanation:
Levy and Goldberg (2014) proved that the global minimum of the Skip-gram with Negative Sampling (SGNS) objective corresponds to implicitly factoring a shifted PMI matrix, where each cell is exactly .
Incorrect! Try again.
46During Word2Vec training, frequent words are subsampled with a probability . Beyond simply reducing training time, what is the theoretical effect of this subsampling mechanism on the network's learning dynamics?
Word2Vec
Hard
A.It enforces a strict sparsity constraint on the resulting word vectors, causing them to approximate one-hot encodings for high-frequency stop words.
B.It shifts the loss function from cross-entropy to mean squared error for high-frequency target words.
C.It mathematically normalizes the context vectors to unit length, preventing gradient explosion in the hidden layer.
D.It effectively expands the dynamic context window size by skipping over uninformative frequent words, capturing longer-range dependencies for rare words.
Correct Answer: It effectively expands the dynamic context window size by skipping over uninformative frequent words, capturing longer-range dependencies for rare words.
Explanation:
When a frequent word is dropped during subsampling, the words around it are pulled closer together in the training sequence. This effectively stretches the context window over a larger physical distance in the original text, allowing the model to capture longer-range semantic relations.
Incorrect! Try again.
47In the Continuous Bag-of-Words (CBOW) model, the hidden layer representation is the average (or sum) of the context word vectors. Given this architecture, how does the backpropagation gradient behave when updating the context vectors for a single training step?
CBOW
Hard
A.The gradient is heavily weighted towards context words nearest to the target word due to a positional decay function in the averaging layer.
B.The gradient calculated from the loss with respect to the hidden state is distributed identically to all context words in the window, ignoring their relative distances to the target.
C.Only the context word vector most similar to the target word receives a non-zero gradient, acting as an implicit max-pooling mechanism.
D.The gradient propagates exclusively to the negative samples, while positive context words are updated purely via momentum.
Correct Answer: The gradient calculated from the loss with respect to the hidden state is distributed identically to all context words in the window, ignoring their relative distances to the target.
Explanation:
Because the hidden state is a simple arithmetic average (or sum) of the context vectors, the derivative of the hidden state with respect to any individual context vector is uniform (or scaled by a constant ). Therefore, the same gradient from the output layer is passed equally to all context word embeddings.
Incorrect! Try again.
48Consider a CBOW model predicting a target word from a context window of size (total words). If the vocabulary size is and the embedding dimension is , which of the following describes the complexity of computing the forward pass for a single target word using the standard softmax function?
CBOW
Hard
A., because each context word must independently compute a full softmax over the vocabulary.
B., because the model must compute a full covariance matrix between the target word and the vocabulary.
C., because the model averages vectors of size and then computes the dot product of the hidden state with all output vectors.
D., due to the required use of a binary tree for the softmax calculation.
Correct Answer: , because the model averages vectors of size and then computes the dot product of the hidden state with all output vectors.
Explanation:
The forward pass involves fetching and summing/averaging context vectors (taking operations) and then multiplying the resulting -dimensional hidden vector by the output matrix to produce the logits for standard softmax (taking operations).
Incorrect! Try again.
49The Skip-Gram with Negative Sampling (SGNS) objective utilizes a noise distribution for drawing negative samples, often chosen as the unigram distribution raised to the power of . What is the theoretical motivation for this specific fractional exponent?
Skip-Gram models
Hard
A.It mathematically guarantees that the objective function becomes strictly convex, ensuring convergence to a global minimum.
B.It exactly matches the Zipfian distribution of natural language, converting a heavy-tailed distribution into a uniform distribution.
C.It dampens the sampling probability of highly frequent words while proportionally increasing the likelihood of sampling rare words, improving the gradient signal for rare terms.
D.It normalizes the embedding space by forcing the norm of the gradient vectors to decay at a rate of over time.
Correct Answer: It dampens the sampling probability of highly frequent words while proportionally increasing the likelihood of sampling rare words, improving the gradient signal for rare terms.
Explanation:
Raising the unigram distribution to a fractional power flattens the distribution. This empirical heuristic specifically addresses the imbalance between frequent words (which would otherwise dominate negative samples) and rare words, ensuring rare words are sampled as negatives more often than their raw frequency would dictate.
Incorrect! Try again.
50When training a Skip-gram model using Hierarchical Softmax rather than Negative Sampling, the output vocabulary is organized into a Huffman tree. If a target word is located at depth in this tree, how is the gradient distributed to the output representation matrices during a single backpropagation step?
Skip-Gram models
Hard
A.Updates bypass the internal nodes and directly modify the leaf node of using a regularization term.
B.Updates are applied to the embedding vectors of all words in the vocabulary, inversely weighted by their tree distance to .
C.Updates are applied uniformly to all leaf nodes that share a common ancestor with at depth .
D.Updates are applied exclusively to the internal node vectors along the path from the root to the leaf node corresponding to .
Correct Answer: Updates are applied exclusively to the internal node vectors along the path from the root to the leaf node corresponding to .
Explanation:
Hierarchical softmax replaces the output word vectors with internal node vectors in a binary tree. Predicting a word involves making binary decisions along the path to its leaf. Therefore, during backpropagation, gradients are only calculated for and applied to the internal nodes on that specific path.
Incorrect! Try again.
51Assume a Skip-Gram model is trained on a sufficiently large corpus where two distinct target words, and , never directly co-occur in any window, but their distribution of context words is perfectly identical. How will their resulting embedding vectors and relate geometrically in the converged vector space?
Skip-Gram models
Hard
A.They will be highly similar (i.e., cosine similarity near 1) because they optimize for the exact same context predictions, pulling their vectors to the same region.
B.They will be placed at opposite poles of the embedding space to maximize their margin in the softmax denominator.
C.They will be perfectly orthogonal because they never co-occur as target-context pairs in the corpus.
D.Their relationship will be strictly arbitrary because Skip-Gram cannot establish relationships without direct co-occurrence.
Correct Answer: They will be highly similar (i.e., cosine similarity near 1) because they optimize for the exact same context predictions, pulling their vectors to the same region.
Explanation:
Skip-Gram models second-order co-occurrence. Since and share identical context words, the model receives identical gradient updates pushing and to maximize dot products with the same set of context vectors. Thus, they will converge to nearly identical points in the vector space.
Incorrect! Try again.
52The GloVe objective function is defined as . What is the critical structural role of the bias terms and in this formulation?
GloVe embeddings
Hard
A.They absorb the independent marginal frequencies of the words and , isolating the pure correlation (PMI) within the dot product .
B.They break the inherent symmetry between the target word matrix and the context matrix .
C.They dynamically control the shape of the weighting function during training to prevent zero-counts from producing infinite loss.
D.They act as a regularization mechanism to strictly limit the norm of the embedding vectors.
Correct Answer: They absorb the independent marginal frequencies of the words and , isolating the pure correlation (PMI) within the dot product .
Explanation:
The GloVe derivation links the dot product to the log co-occurrence probability . Since , and we want the dot product to capture relationships independent of raw unigram frequency, the biases and are introduced specifically to absorb the and marginal terms.
Incorrect! Try again.
53GloVe models semantic relationships based on the ratio of co-occurrence probabilities . To transition from a mapping function to the final GloVe objective, the authors enforce homomorphism between vector addition and scalar multiplication. What specific mathematical constraint does this impose on ?
GloVe embeddings
Hard
A. must be a normalized sigmoid function to ensure the ratios represent valid probability densities.
B. must be a logarithmic mapping, converting polynomial distributions into uniform linear manifolds.
C. must utilize a Fourier transform kernel to project ratios into a complex-valued Hilbert space.
D. must take the form of an exponential function acting on the dot product of and , resulting in .
Correct Answer: must take the form of an exponential function acting on the dot product of and , resulting in .
Explanation:
To respect the linear structure of the vector space, the function needs to satisfy . The exponential function natively satisfies this homomorphism, meaning , which directly leads to the relation .
Incorrect! Try again.
54If the GloVe weighting function was replaced by a uniform constant for all , what would be the most severe degradation observed in the resulting embeddings?
GloVe embeddings
Hard
A.The model would overfit heavily to rare, noisy co-occurrences, treating an event seen once identically to an event seen ten thousand times.
C.The model would entirely ignore long-range semantic analogies, focusing strictly on local syntax.
D.The model would collapse into a trivial solution where all vectors are zero.
Correct Answer: The model would overfit heavily to rare, noisy co-occurrences, treating an event seen once identically to an event seen ten thousand times.
Explanation:
The weighting function in GloVe is crucial for dampening the impact of rare co-occurrences. Since least-squares regression is highly sensitive to variance, a uniform weight would assign the same importance to a noisy word pair seen exactly once as to a highly reliable, frequent word pair, severely degrading the semantic quality of the embeddings.
Incorrect! Try again.
55When evaluating semantic similarity using cosine similarity on embeddings trained via Word2Vec or GloVe, one often encounters the 'hubness' problem. What causes this phenomenon in high-dimensional embedding spaces?
capturing semantic similarity
Hard
A.The curse of dimensionality dictates that vectors located near the mean of the space have a high probability of becoming nearest neighbors to a disproportionately large number of other vectors.
B.The cosine similarity metric mathematically fails in dimensions greater than 256, returning identical scores for uncorrelated vectors.
C.The occurrence of out-of-vocabulary words pushes all known vectors into a single hyper-sphere.
D.The optimization objective inherently forces vectors into a single orthogonal basis, preventing clustered similarity metrics.
Correct Answer: The curse of dimensionality dictates that vectors located near the mean of the space have a high probability of becoming nearest neighbors to a disproportionately large number of other vectors.
Explanation:
Hubness is a known artifact in high-dimensional spaces where certain points (hubs), usually located closer to the center of the data distribution, are spuriously identified as the nearest neighbors for a vast number of other points, degrading retrieval and similarity tasks.
Incorrect! Try again.
56A simple but strong baseline for computing sentence similarity is to take the arithmetic mean of constituent word embeddings. Under Arora et al.'s random walk model of sentence generation, what is the theoretical justification for this 'continuous bag-of-words' sentence embedding?
capturing semantic similarity
Hard
A.Words in a sentence are drawn from a uniform distribution independent of syntax, making the arithmetic mean structurally equivalent to a recurrent neural network.
B.The sentence generation process is modeled as a random walk driven by a slow-drifting discourse vector, where the probability of emitting a word is proportional to .
C.The arithmetic mean maximizes the mutual information between the context words and the exact syntactical sequence.
D.Averaging cancels out word-specific noise entirely due to the strict mutual orthogonality of all word vectors in a GloVe space.
Correct Answer: The sentence generation process is modeled as a random walk driven by a slow-drifting discourse vector, where the probability of emitting a word is proportional to .
Explanation:
Arora et al. theoretically justify averaging word embeddings by assuming a generative model where a slow-moving discourse vector drives a random walk in the embedding space. Under this generative model, the maximum a posteriori (MAP) estimate of the discourse vector is approximated by the average of the word embeddings.
Incorrect! Try again.
57The widely cited vector arithmetic for analogies (e.g., ) assumes that semantic offsets form consistent linear structures. Under what specific condition does the Skip-Gram model natively guarantee this strict linear translation invariant structure?
analogy relationships
Hard
A.When the negative sampling parameter is set to $0$, removing all noise from the objective function.
B.When the Pointwise Mutual Information (PMI) matrix is exactly low-rank and differences in the log-probabilities of context words between pairs are strictly constant.
C.When the context window is infinite, reducing the model to a standard Singular Value Decomposition.
D.When word frequencies follow a uniform distribution rather than a Zipfian distribution.
Correct Answer: When the Pointwise Mutual Information (PMI) matrix is exactly low-rank and differences in the log-probabilities of context words between pairs are strictly constant.
Explanation:
Analogies work in vector space because the relation vectors correspond to differences in the log co-occurrence probabilities. Theoretical derivations show that exact linear invariance holds if the PMI matrix is low-rank such that the relationship (the ratio of context probabilities) between "man/woman" is identical to "king/queen" across all contexts.
Incorrect! Try again.
58Evaluation of word analogies () often utilizes either 3CosAdd or 3CosMul. Why does the 3CosMul objective frequently yield superior practical results on complex analogy tasks compared to the additive baseline ?
analogy relationships
Hard
A.3CosMul structurally mimics a non-linear manifold projection, effectively upgrading word embeddings to contextualized representations.
B.3CosMul treats similarities multiplicatively, which mitigates the risk of a single marginally high cosine similarity term completely dominating the outcome.
C.3CosMul computes the exact geometric median of the target vectors rather than the arithmetic mean.
D.3CosMul enforces strict normalization on all vectors, ensuring the target word does not deviate from the unit hypersphere.
Correct Answer: 3CosMul treats similarities multiplicatively, which mitigates the risk of a single marginally high cosine similarity term completely dominating the outcome.
Explanation:
In 3CosAdd, a very high cosine similarity between and one of the positive terms (like ) can overshadow a poor similarity with . 3CosMul amplifies differences by multiplying the similarities, heavily penalizing candidates that only have a high similarity with one of the target contexts but not the other.
Incorrect! Try again.
59When visualizing a 300-dimensional Word2Vec embedding space using t-SNE, a researcher notices that the resulting 2D plot shows overly dense, shattered clusters that do not align with known continuous semantic fields. Which hyperparameter of t-SNE is most likely responsible for this artifact, and why?
visualizing embedding spaces using PCA or t-SNE
Hard
A.Early exaggeration. If set too low, the attractive forces between all clusters pull them into a single indistinguishable mass.
B.Perplexity. If set too low, the algorithm heavily prioritizes strictly local variations, artificially shattering broader semantic manifolds into disconnected micro-clusters.
C.Number of iterations. If set too low, the algorithm terminates before the clusters can merge into a continuous space.
D.Learning rate. If set too high, the gradient descent bounces out of optimal global minima.
Correct Answer: Perplexity. If set too low, the algorithm heavily prioritizes strictly local variations, artificially shattering broader semantic manifolds into disconnected micro-clusters.
Explanation:
Perplexity in t-SNE acts as a smooth measure of the effective number of nearest neighbors. If perplexity is set too low (e.g., 2 or 5 for thousands of points), t-SNE ignores global structure and solely preserves immediate local distances, which artificially breaks continuous spaces into fragmented, tight clusters.
Incorrect! Try again.
60Both PCA and t-SNE are standard tools used to visualize embedding spaces. If semantic relationships are structurally represented by parallel linear offsets (e.g., gender vectors) in , why might t-SNE theoretically fail to visually preserve these parallel analogy relationships compared to PCA?
visualizing embedding spaces using PCA or t-SNE
Hard
A.PCA strictly maximizes local entropy, which coincidentally aligns with the mathematical formulation of Skip-Gram offset vectors.
B.t-SNE is a non-linear manifold learning technique that preserves local pairwise distances but often distorts global geometric structure and distances, failing to map parallel structures consistently.
C.t-SNE computes exact Euclidean distances rather than cosine similarity, which misinterprets angular relationships in the high-dimensional space.
D.t-SNE relies on Singular Value Decomposition (SVD), which orthogonalizes all input variables, inherently destroying parallel lines.
Correct Answer: t-SNE is a non-linear manifold learning technique that preserves local pairwise distances but often distorts global geometric structure and distances, failing to map parallel structures consistently.
Explanation:
PCA is a linear transformation that preserves the global variance and covariance (and therefore parallel structures). t-SNE uses a non-linear optimization to map high-dimensional local neighborhoods to low-dimensional space. While excellent for clustering, t-SNE routinely distorts global geometry, meaning lines that are parallel in high dimensions will likely bend, intersect, or diverge in the 2D projection.