1 $What is the primary function of a vector space model in Natural Language Processing?$

vector space models Easy

A.

To translate text directly from one language to another.

B.

To represent words strictly as strings of text.

C.

To represent words or documents as vectors of numerical values.

D.

To convert audio signals into textual data.

2 $In a vector space model, what does the spatial distance between two word vectors typically represent?$

vector space models Easy

A.

Their semantic similarity or relatedness.

B.

Their alphabetical order in a dictionary.

C.

The difference in their character lengths.

D.

The historical age of the words.

3 $What is a key advantage of dense word embeddings over traditional sparse representations like one-hot encoding?$

dense word embeddings Easy

A.

They represent every word as a vector composed almost entirely of zeros.

B.

They do not require any training data.

C.

They capture semantic meaning in continuous, lower-dimensional vectors.

D.

They require infinite memory to store.

4 $Compared to sparse vectors, dense vectors typically have:$

dense word embeddings Easy

A.

Variable lengths depending on the word size.

B.

Only binary values (0s and 1s).

C.

Lower dimensionality and continuous non-zero values.

D.

Higher dimensionality and mostly zero values.

5 $Which of the following is a popular shallow neural network framework developed for learning word embeddings?$

Word2Vec Easy

A.

ResNet

B.

YOLO

C.

Transformer

D.

Word2Vec

6 $What are the two main architectures introduced in the Word2Vec model?$

Word2Vec Easy

A.

CBOW and Skip-Gram

B.

PCA and t-SNE

C.

LSTM and GRU

D.

CNN and RNN

7 $Word2Vec relies on the idea that words appearing in similar contexts tend to have similar meanings. What is this concept called?$

Word2Vec Easy

A.

The Markov assumption

B.

The distributional hypothesis

C.

The Turing test

D.

The bag-of-words hypothesis

8 $What is the main objective of the Continuous Bag-of-Words (CBOW) architecture?$

CBOW Easy

A.

To classify a text document into predefined categories.

B.

To predict the next sentence in a document.

C.

To predict a target word given its surrounding context words.

D.

To predict the surrounding context words given a central target word.

9 $In the CBOW architecture, how is the context typically represented before predicting the target word?$

CBOW Easy

A.

By multiplying all context word vectors together.

B.

By taking the average or sum of the context word vectors.

C.

By ignoring all words except the first one in the sequence.

D.

By randomly selecting one context word.

10 $What does the Skip-Gram model aim to predict during training?$

Skip-Gram models Easy

A.

The grammatical structure of the sentence.

B.

The overall sentiment of the entire sentence.

C.

The surrounding context words given a central target word.

D.

A central target word given the surrounding context words.

11 $Which Word2Vec architecture is generally considered to work better with small amounts of training data and represents rare words well?$

Skip-Gram models Easy

A.

One-hot encoding

B.

Skip-Gram

C.

TF-IDF

D.

CBOW

12 $What does the acronym "GloVe" stand for in the context of NLP?$

GloVe embeddings Easy

A.

Global Vocabulary Extraction

B.

Global Vectors for Word Representation

C.

Generalized Lexical Vocabulary Embeddings

D.

Generative Language Output Vector Engine

13 $What kind of statistical information does GloVe primarily use to train its embeddings?$

GloVe embeddings Easy

A.

Global word co-occurrence matrices

B.

Bilingual translation pairs

C.

Character n-gram counts

D.

Dependency parse trees

14 $Which of the following best describes the core mathematical approach of GloVe?$

GloVe embeddings Easy

A.

It combines the benefits of local context window methods with global matrix factorization.

B.

It creates sparse vectors strictly using Term Frequency-Inverse Document Frequency (TF-IDF).

C.

It relies entirely on human-annotated semantic dictionaries.

D.

It only uses a simple recurrent neural network to predict the next word.

15 $Which mathematical metric is most commonly used to compute the similarity between two word embeddings?$

capturing semantic similarity Easy

A.

Cosine similarity

B.

Manhattan distance

C.

Jaccard index

D.

Euclidean distance

16 $If two words are highly semantically similar, their cosine similarity score will be closest to:$

capturing semantic similarity Easy

A.

1

B.

0

C.

100

D.

-1

17 $Word embeddings are famous for capturing algebraic analogies. Which classic vector equation best demonstrates this?$

analogy relationships Easy

A.

B.

C.

D.

18 $By solving analogy tests (e.g., 'Paris is to France as Tokyo is to X'), we are primarily testing a word embedding model's ability to capture:$

analogy relationships Easy

A.

Sentence and paragraph boundaries.

B.

Alphabetical sorting and casing.

C.

Syntactic and semantic relationships.

D.

Document length and structure.

19 $Why are techniques like PCA and t-SNE commonly used with word embeddings?$

visualizing embedding spaces using PCA or t-SNE Easy

A.

To translate embeddings into a different language.

B.

To convert textual embeddings directly into speech.

C.

To reduce the high-dimensional vectors to 2D or 3D for human visualization.

D.

To increase the dimensionality of the vectors for better accuracy.

20 $Which visualization technique is specifically well-known for preserving local data structures, making it highly effective for clustering similar word vectors together in a 2D scatter plot?$

visualizing embedding spaces using PCA or t-SNE Easy

A.

K-Means Clustering

B.

Linear Regression

C.

Decision Trees

D.

t-SNE (t-Distributed Stochastic Neighbor Embedding)

21 $In a standard Vector Space Model, what is the primary consequence of representing a vocabulary of size using one-hot encoded vectors?$

vector space models Medium

A.

The matrix of all word vectors forms a dense representation, requiring storage overall.

B.

The vectors inherently capture semantic relationships through their distance in the vector space.

C.

The dot product of any two distinct word vectors will be exactly 1.

D.

The dot product of any two distinct word vectors evaluates to 0, making it impossible to measure semantic similarity directly.

22 $When measuring the similarity between two document vectors and in a vector space model, why is cosine similarity typically preferred over Euclidean distance?$

vector space models Medium

A.

Euclidean distance is sensitive to the magnitude (length) of the vectors, meaning documents with similar content but different lengths may appear highly dissimilar.

B.

Cosine similarity automatically penalizes terms that appear frequently across all documents in a corpus.

C.

Euclidean distance cannot be applied to vectors containing negative values.

D.

Cosine similarity is mathematically faster to compute because it does not require a dot product.

23 $Which of the following best describes the core mechanism by which dense word embeddings learn meaning?$

dense word embeddings Medium

A.

They optimize character-level features to predict the morphological structure of words.

B.

They map words to predefined semantic categories provided by linguists in a centralized dictionary.

C.

They are trained to reconstruct a sparse TF-IDF matrix using Singular Value Decomposition.

D.

They adjust continuous vector weights to maximize the probability of words appearing in similar local contexts based on the distributional hypothesis.

24 $If the dimensionality of a dense word embedding is chosen to be excessively large relative to the vocabulary size and dataset, what is the most likely outcome?$

dense word embeddings Medium

A.

The embeddings will naturally degrade into one-hot encoded vectors.

B.

The training process will become a convex optimization problem.

C.

The model will perfectly generalize to out-of-vocabulary words.

D.

The model may overfit to the training corpus and fail to capture broad semantic similarities.

25 $What is the primary computational purpose of using Negative Sampling in Word2Vec training?$

Word2Vec Medium

A.

To remove negative or toxic words from the training corpus to prevent bias.

B.

To intentionally inject noise into the input vectors, acting as a form of regularization like dropout.

C.

To replace the computationally expensive softmax operation over the entire vocabulary with a set of binary logistic regression tasks.

D.

To invert the vectors of antonyms so they point in opposite directions in the vector space.

26 $In Word2Vec, subsampling of frequent words is often employed. How does this technique improve the resulting embeddings?$

Word2Vec Medium

A.

It strictly enforces the model to assign larger magnitudes to the vectors of frequent words.

B.

It discards rare words that appear fewer than 5 times, preventing the vocabulary from becoming too large.

C.

It balances the dataset by artificially generating synonyms for rare words.

D.

It probabilistically discards highly frequent words (like 'the', 'is'), preventing them from dominating the training time and allowing the model to focus on more informative co-occurrences.

27 $In the Continuous Bag-of-Words (CBOW) model with a window size of, how is the hidden layer representation constructed before predicting the target word?$

CBOW Medium

A.

By taking the average (or sum) of the input vector representations of the context words.

B.

By computing the dot product between all pairs of context words in the window.

C.

By applying a recurrent neural network (RNN) sequentially over the context words.

D.

By concatenating the context word vectors into a single vector of length .

28 $Suppose you are training a CBOW model. If the sentence is 'The quick brown fox jumps over the lazy dog', and the current target word is 'fox' with a window size of . What are the input context words?$

CBOW Medium

A.

['The', 'quick', 'jumps', 'over']

B.

['quick', 'brown', 'jumps', 'over']

C.

['brown', 'jumps']

D.

['quick', 'brown', 'fox', 'jumps', 'over']

29 $How does the training objective of the Skip-Gram model fundamentally differ from that of the CBOW model?$

Skip-Gram models Medium

A.

Skip-Gram computes word co-occurrence matrices explicitly, whereas CBOW relies on a shallow neural network.

B.

Skip-Gram predicts surrounding context words given a single target word, whereas CBOW predicts a target word given a set of context words.

C.

Skip-Gram uses hierarchical softmax, whereas CBOW strictly requires negative sampling.

D.

Skip-Gram predicts a target word given a set of context words, whereas CBOW predicts context words given a target word.

30 $Given the sentence 'She loves deep learning deeply', and using a Skip-Gram model with a window size of, how many training pairs are generated when 'deep' is the target word ?$

Skip-Gram models Medium

A.

1 pair: ('deep', 'learning')

B.

2 pairs: ('deep', 'loves') and ('deep', 'learning')

C.

4 pairs: ('She', 'deep'), ('loves', 'deep'), ('learning', 'deep'), and ('deeply', 'deep')

D.

3 pairs: ('loves', 'deep'), ('learning', 'deep'), and ('deeply', 'deep')

31 $When comparing Skip-Gram and CBOW on a large corpus, what is a well-documented empirical advantage of the Skip-Gram model?$

Skip-Gram models Medium

A.

It tends to produce better quality representations for rare words or infrequent words.

B.

It trains significantly faster than CBOW because it averages input vectors.

C.

It naturally groups antonyms together while pushing synonyms apart.

D.

It requires much less memory because it limits the context to one word.

32 $The Global Vectors (GloVe) model captures semantics primarily by modeling which of the following statistical properties of the corpus?$

GloVe embeddings Medium

A.

The ratio of co-occurrence probabilities of two words with various probe words.

B.

The Singular Value Decomposition (SVD) of a term-document matrix.

C.

The probability of the target word given the continuous bag of context words.

D.

The absolute frequency count of single words sorted in descending order.

33 $In the GloVe objective function, a weighting function is applied to the squared error term. What is a crucial property of this weighting function?$

GloVe embeddings Medium

A.

It ensures that zero co-occurrences () result in a zero weight, and it caps the weight of highly frequent co-occurrences to avoid over-weighting.

B.

It strictly assigns a weight of 0 to the most frequent words to prevent them from dominating the loss.

C.

It is heavily exponentially weighted for rare co-occurrences to give them more importance.

D.

It applies a softmax distribution to normalize all co-occurrences into probabilities sum to 1.

34 $How does GloVe inherently bridge the gap between matrix factorization methods (like LSA) and shallow window-based methods (like Word2Vec)?$

GloVe embeddings Medium

A.

It performs Singular Value Decomposition (SVD) at every iteration step of a Skip-Gram training loop.

B.

It computes the global Term-Frequency Inverse Document Frequency (TF-IDF) and feeds it directly into a Continuous Bag-of-Words model.

C.

It uses a local context window to compute co-occurrences, but trains its vectors by optimizing a global log-bilinear regression over the resulting co-occurrence matrix.

D.

It processes local context windows sequentially using an RNN, but initializes the weights with an LSA matrix.

35 $If two word vectors and have been -normalized so that and, how is their Euclidean distance algebraically related to their cosine similarity ?$

capturing semantic similarity Medium

A.

B.

C.

D.

36 $A significant limitation of standard Word2Vec and GloVe embeddings in capturing semantic similarity in real-world applications is their handling of Out-Of-Vocabulary (OOV) words. Why does this limitation exist?$

capturing semantic similarity Medium

A.

They dynamically assign a random vector to new words at inference, causing catastrophic forgetting of existing similarities.

B.

They rely on part-of-speech tags, meaning OOV words lack syntactic features necessary for vector computation.

C.

They only map whole words seen during training to a fixed dictionary of vectors, making them incapable of inferring vectors for unseen words based on subwords.

D.

They use absolute positional encodings that break when a text exceeds the maximum sequence length seen during training.

37 $To solve the analogy task 'man is to king as woman is to X' using word embeddings, which vector arithmetic operation is conventionally used to find the target vector for ?$

analogy relationships Medium

A.

B.

C.

D.

38 $What mathematical property of dense embedding spaces allows them to successfully resolve syntactic analogies (e.g., walk:walking :: jump:jumping)?$

analogy relationships Medium

A.

Syntactic analogies rely on exact string matching functions built into the embedding lookup mechanism.

B.

Relationships are captured as consistent linear translations (offsets) spanning across the vector space.

C.

The vectors are strictly normalized to orthogonal axes based on their morphological suffixes.

D.

The embeddings cluster all verbs into a single distinct hypersphere in the vector space.

39 $When visualizing a 300-dimensional word embedding space in 2D, how does t-SNE generally differ from Principal Component Analysis (PCA) in its treatment of the data?$

visualizing embedding spaces using PCA or t-SNE Medium

A.

t-SNE models probabilities of neighborhood distances to preserve local cluster structures non-linearly, whereas PCA linearly projects data to maximize global variance.

B.

t-SNE preserves the global variance of the data strictly linearly, whereas PCA focuses on minimizing the distance between nearby points non-linearly.

C.

PCA is computationally much slower but yields a deterministic mapping, while t-SNE is faster but random.

D.

PCA requires embeddings to be converted to one-hot vectors, while t-SNE works natively on dense floating-point vectors.

40 $When applying t-SNE to visualize word embeddings, a crucial hyperparameter is 'perplexity'. What does perplexity effectively control in the t-SNE algorithm?$

visualizing embedding spaces using PCA or t-SNE Medium

A.

The balance between local and global aspects of the data, acting as a soft measure of the number of effective nearest neighbors for each point.

B.

The trade-off between the number of dimensions in the output space (e.g., 2D vs 3D).

C.

The learning rate of the gradient descent used to optimize the KL divergence.

D.

The number of iterations before the algorithm terminates.

41 $In high-dimensional term-document vector space models leveraging tf-idf, what is the primary consequence of applying Latent Semantic Analysis (LSA) with a truncated Singular Value Decomposition (SVD) on the preservation of cosine similarity between rare words?$

vector space models Hard

A.

Rare word vectors become strictly orthogonal to all other vectors because they are entirely relegated to the discarded singular values .

B.

The truncated SVD acts as a perfect regularization mechanism, increasing the cosine similarity of rare words strictly proportionally to their true semantic overlap.

C.

LSA guarantees the preservation of exact pairwise Euclidean distances for rare words, effectively bypassing the curse of dimensionality.

D.

The cosine similarity between rare words often becomes highly distorted or artificially inflated due to projection into dimensions dominated by frequent word variance.

42 $Count-based Vector Space Models often utilize Positive Pointwise Mutual Information (PPMI). Why does standard PPMI introduce a systematic bias in word representations, and how is it mathematically mitigated in practice?$

vector space models Hard

A.

It suffers from linear dependence on document length; mitigated by normalizing all term vectors strictly by their norm prior to calculating the probability distribution.

B.

It biases towards frequent word pairs; mitigated by applying a logarithmic scaling factor to all raw co-occurrence counts prior to PPMI calculation.

C.

It biases towards infrequent words because low-probability events yield extremely high PMI values; mitigated by context distribution smoothing, such as raising context probabilities to .

D.

It assigns negative infinity to zero co-occurrences, destroying the vector space; mitigated by replacing all zero counts with the expected co-occurrence probability.

43 $Polysemy poses a challenge for standard dense word embeddings because a single vector must represent multiple meanings. Under the linear superposition hypothesis proposed by Arora et al. (2018), how are multiple distinct senses of a word theoretically represented in a single dense vector in ?$

dense word embeddings Hard

A.

The multiple senses occupy mutually orthogonal subspaces governed by the eigenvalues of the context matrix, allowing recovery via Principal Component Analysis (PCA).

B.

is approximately a linear combination of the underlying sense vectors, and sparse coding can recover the individual senses provided the sense vectors are sufficiently uncorrelated and isotropically distributed.

C.

represents a probabilistic mixture where individual senses can only be isolated by computing the gradient of the loss function with respect to context words.

D.

is strictly the geometric centroid of context vectors, meaning the dominant sense entirely overrides minor senses, requiring non-linear manifold unraveling to recover.

44 $Which of the following mathematical properties best explains why dense word embeddings organically develop structural features allowing for linear algebraic operations (e.g.,) without explicit semantic supervision?$

dense word embeddings Hard

A.

The continuous Bag-of-Words assumption forces all syntactically similar words to collapse into single eigenvectors of the co-occurrence matrix.

B.

The softmax function introduces a strict orthogonality constraint between syntactically distant words, aligning them along the primary axes of .

C.

The embeddings inherently enforce a manifold where Euclidean distance is strictly equal to the inverse of raw co-occurrence frequency.

D.

The optimization objective effectively factorizes a shifted log-co-occurrence matrix, meaning the inner product of vectors approximates the log probability, thus turning multiplicative probabilities into additive vector operations.

45 $In Word2Vec, negative sampling replaces the computationally expensive full softmax. If negative samples are drawn per positive sample, what is the asymptotic relationship between the Word2Vec negative sampling objective and the Pointwise Mutual Information (PMI) matrix as the embedding dimension ?$

Word2Vec Hard

A.

The dot product converges to, magnifying the similarity of rare words.

B.

The dot product converges exactly to the Positive PMI matrix, .

C.

The dot product converges to the log-likelihood of the marginal probability scaled by .

D.

The dot product converges to a shifted PMI matrix: .

46 $During Word2Vec training, frequent words are subsampled with a probability . Beyond simply reducing training time, what is the theoretical effect of this subsampling mechanism on the network's learning dynamics?$

Word2Vec Hard

A.

It shifts the loss function from cross-entropy to mean squared error for high-frequency target words.

B.

It enforces a strict sparsity constraint on the resulting word vectors, causing them to approximate one-hot encodings for high-frequency stop words.

C.

It mathematically normalizes the context vectors to unit length, preventing gradient explosion in the hidden layer.

D.

It effectively expands the dynamic context window size by skipping over uninformative frequent words, capturing longer-range dependencies for rare words.

47 $In the Continuous Bag-of-Words (CBOW) model, the hidden layer representation is the average (or sum) of the context word vectors. Given this architecture, how does the backpropagation gradient behave when updating the context vectors for a single training step?$

CBOW Hard

A.

The gradient calculated from the loss with respect to the hidden state is distributed identically to all context words in the window, ignoring their relative distances to the target.

B.

Only the context word vector most similar to the target word receives a non-zero gradient, acting as an implicit max-pooling mechanism.

C.

The gradient propagates exclusively to the negative samples, while positive context words are updated purely via momentum.

D.

The gradient is heavily weighted towards context words nearest to the target word due to a positional decay function in the averaging layer.

48 $Consider a CBOW model predicting a target word from a context window of size (total words). If the vocabulary size is and the embedding dimension is, which of the following describes the complexity of computing the forward pass for a single target word using the standard softmax function?$

CBOW Hard

A.

, because each context word must independently compute a full softmax over the vocabulary.

B.

, due to the required use of a binary tree for the softmax calculation.

C.

, because the model averages vectors of size and then computes the dot product of the hidden state with all output vectors.

D.

, because the model must compute a full covariance matrix between the target word and the vocabulary.

49 $The Skip-Gram with Negative Sampling (SGNS) objective utilizes a noise distribution for drawing negative samples, often chosen as the unigram distribution raised to the power of . What is the theoretical motivation for this specific fractional exponent?$

Skip-Gram models Hard

A.

It dampens the sampling probability of highly frequent words while proportionally increasing the likelihood of sampling rare words, improving the gradient signal for rare terms.

B.

It normalizes the embedding space by forcing the norm of the gradient vectors to decay at a rate of over time.

C.

It mathematically guarantees that the objective function becomes strictly convex, ensuring convergence to a global minimum.

D.

It exactly matches the Zipfian distribution of natural language, converting a heavy-tailed distribution into a uniform distribution.

50 $When training a Skip-gram model using Hierarchical Softmax rather than Negative Sampling, the output vocabulary is organized into a Huffman tree. If a target word is located at depth in this tree, how is the gradient distributed to the output representation matrices during a single backpropagation step?$

Skip-Gram models Hard

A.

Updates bypass the internal nodes and directly modify the leaf node of using a regularization term.

B.

Updates are applied uniformly to all leaf nodes that share a common ancestor with at depth .

C.

Updates are applied exclusively to the internal node vectors along the path from the root to the leaf node corresponding to .

D.

Updates are applied to the embedding vectors of all words in the vocabulary, inversely weighted by their tree distance to .

51 $Assume a Skip-Gram model is trained on a sufficiently large corpus where two distinct target words, and, never directly co-occur in any window, but their distribution of context words is perfectly identical. How will their resulting embedding vectors and relate geometrically in the converged vector space?$

Skip-Gram models Hard

A.

They will be placed at opposite poles of the embedding space to maximize their margin in the softmax denominator.

B.

Their relationship will be strictly arbitrary because Skip-Gram cannot establish relationships without direct co-occurrence.

C.

They will be highly similar (i.e., cosine similarity near 1) because they optimize for the exact same context predictions, pulling their vectors to the same region.

D.

They will be perfectly orthogonal because they never co-occur as target-context pairs in the corpus.

52 $The GloVe objective function is defined as . What is the critical structural role of the bias terms and in this formulation?$

GloVe embeddings Hard

A.

They absorb the independent marginal frequencies of the words and, isolating the pure correlation (PMI) within the dot product .

B.

They dynamically control the shape of the weighting function during training to prevent zero-counts from producing infinite loss.

C.

They act as a regularization mechanism to strictly limit the norm of the embedding vectors.

D.

They break the inherent symmetry between the target word matrix and the context matrix .

53 $GloVe models semantic relationships based on the ratio of co-occurrence probabilities . To transition from a mapping function to the final GloVe objective, the authors enforce homomorphism between vector addition and scalar multiplication. What specific mathematical constraint does this impose on ?$

GloVe embeddings Hard

A.

must be a logarithmic mapping, converting polynomial distributions into uniform linear manifolds.

B.

must take the form of an exponential function acting on the dot product of and, resulting in .

C.

must be a normalized sigmoid function to ensure the ratios represent valid probability densities.

D.

must utilize a Fourier transform kernel to project ratios into a complex-valued Hilbert space.

54 $If the GloVe weighting function was replaced by a uniform constant for all, what would be the most severe degradation observed in the resulting embeddings?$

GloVe embeddings Hard

A.

The model would entirely ignore long-range semantic analogies, focusing strictly on local syntax.

B.

The model would overfit heavily to rare, noisy co-occurrences, treating an event seen once identically to an event seen ten thousand times.

C.

The model would collapse into a trivial solution where all vectors are zero.

D.

The model's embedding vectors would exhibit infinite covariance, rendering cosine similarity meaningless.

55 $When evaluating semantic similarity using cosine similarity on embeddings trained via Word2Vec or GloVe, one often encounters the 'hubness' problem. What causes this phenomenon in high-dimensional embedding spaces?$

capturing semantic similarity Hard

A.

The occurrence of out-of-vocabulary words pushes all known vectors into a single hyper-sphere.

B.

The curse of dimensionality dictates that vectors located near the mean of the space have a high probability of becoming nearest neighbors to a disproportionately large number of other vectors.

C.

The optimization objective inherently forces vectors into a single orthogonal basis, preventing clustered similarity metrics.

D.

The cosine similarity metric mathematically fails in dimensions greater than 256, returning identical scores for uncorrelated vectors.

56 $A simple but strong baseline for computing sentence similarity is to take the arithmetic mean of constituent word embeddings. Under Arora et al.'s random walk model of sentence generation, what is the theoretical justification for this 'continuous bag-of-words' sentence embedding?$

capturing semantic similarity Hard

A.

Words in a sentence are drawn from a uniform distribution independent of syntax, making the arithmetic mean structurally equivalent to a recurrent neural network.

B.

Averaging cancels out word-specific noise entirely due to the strict mutual orthogonality of all word vectors in a GloVe space.

C.

The sentence generation process is modeled as a random walk driven by a slow-drifting discourse vector, where the probability of emitting a word is proportional to .

D.

The arithmetic mean maximizes the mutual information between the context words and the exact syntactical sequence.

57 $The widely cited vector arithmetic for analogies (e.g.,) assumes that semantic offsets form consistent linear structures. Under what specific condition does the Skip-Gram model natively guarantee this strict linear translation invariant structure?$

analogy relationships Hard

A.

When the negative sampling parameter is set to $0$, removing all noise from the objective function.

B.

When the Pointwise Mutual Information (PMI) matrix is exactly low-rank and differences in the log-probabilities of context words between pairs are strictly constant.

C.

When the context window is infinite, reducing the model to a standard Singular Value Decomposition.

D.

When word frequencies follow a uniform distribution rather than a Zipfian distribution.

58 $Evaluation of word analogies () often utilizes either 3CosAdd or 3CosMul. Why does the 3CosMul objective frequently yield superior practical results on complex analogy tasks compared to the additive baseline ?$

analogy relationships Hard

A.

3CosMul computes the exact geometric median of the target vectors rather than the arithmetic mean.

B.

3CosMul structurally mimics a non-linear manifold projection, effectively upgrading word embeddings to contextualized representations.

C.

3CosMul treats similarities multiplicatively, which mitigates the risk of a single marginally high cosine similarity term completely dominating the outcome.

D.

3CosMul enforces strict normalization on all vectors, ensuring the target word does not deviate from the unit hypersphere.

59 $When visualizing a 300-dimensional Word2Vec embedding space using t-SNE, a researcher notices that the resulting 2D plot shows overly dense, shattered clusters that do not align with known continuous semantic fields. Which hyperparameter of t-SNE is most likely responsible for this artifact, and why?$

visualizing embedding spaces using PCA or t-SNE Hard

A.

Number of iterations. If set too low, the algorithm terminates before the clusters can merge into a continuous space.

B.

Learning rate. If set too high, the gradient descent bounces out of optimal global minima.

C.

Perplexity. If set too low, the algorithm heavily prioritizes strictly local variations, artificially shattering broader semantic manifolds into disconnected micro-clusters.

D.

Early exaggeration. If set too low, the attractive forces between all clusters pull them into a single indistinguishable mass.

60 $Both PCA and t-SNE are standard tools used to visualize embedding spaces. If semantic relationships are structurally represented by parallel linear offsets (e.g., gender vectors) in, why might t-SNE theoretically fail to visually preserve these parallel analogy relationships compared to PCA?$

visualizing embedding spaces using PCA or t-SNE Hard

A.

t-SNE is a non-linear manifold learning technique that preserves local pairwise distances but often distorts global geometric structure and distances, failing to map parallel structures consistently.

B.

t-SNE computes exact Euclidean distances rather than cosine similarity, which misinterprets angular relationships in the high-dimensional space.

C.

PCA strictly maximizes local entropy, which coincidentally aligns with the mathematical formulation of Skip-Gram offset vectors.

D.

t-SNE relies on Singular Value Decomposition (SVD), which orthogonalizes all input variables, inherently destroying parallel lines.

Unit 2 - Practice Quiz