1What is the fundamental representation of a word in a Vector Space Model (VSM)?
A.A linked list
B.A binary tree structure
C.A vector of real numbers
D.A scalar integer
Correct Answer: A vector of real numbers
Explanation:
In Vector Space Models, words are mapped to vectors of real numbers within a continuous vector space.
Incorrect! Try again.
2Which of the following is a primary advantage of dense word vectors over one-hot encoding?
A.They capture semantic relationships
B.They are strictly binary
C.They are easier to calculate manually
D.They use more memory
Correct Answer: They capture semantic relationships
Explanation:
Dense vectors place similar words close to each other in the vector space, capturing semantic meaning, whereas one-hot vectors are orthogonal and equidistant.
Incorrect! Try again.
3In the context of the Continuous Bag-of-Words (CBOW) model, what is the input to the neural network?
A.A random noise vector
B.The center word
C.The context words
D.The entire document
Correct Answer: The context words
Explanation:
CBOW predicts the current (center) word based on the surrounding context words.
Incorrect! Try again.
4How is 'cosine similarity' calculated between two word vectors, A and B?
A.Cross product of A and B
B.Euclidean distance between A and B
C.Dot product of A and B divided by the product of their magnitudes
D.Sum of elements in A minus sum of elements in B
Correct Answer: Dot product of A and B divided by the product of their magnitudes
Explanation:
Cosine similarity measures the cosine of the angle between two vectors, calculated as (A . B) / (||A|| * ||B||).
Incorrect! Try again.
5If two word vectors have a cosine similarity of 1, what does this imply?
A.The words are unrelated
B.The vectors are orthogonal
C.The vectors point in exactly the same direction
D.The words represent opposite meanings
Correct Answer: The vectors point in exactly the same direction
Explanation:
A cosine similarity of 1 implies the angle between the vectors is 0 degrees, meaning they are identical in orientation (highly similar).
Incorrect! Try again.
6Which architecture predicts the surrounding words given a center word?
A.Continuous Bag-of-Words (CBOW)
B.Principal Component Analysis
C.Latent Dirichlet Allocation
D.Skip-gram
Correct Answer: Skip-gram
Explanation:
Skip-gram is the inverse of CBOW; it takes the center word as input and tries to predict the context words.
Incorrect! Try again.
7What mathematical technique is commonly used to visualize high-dimensional word vectors in two dimensions?
A.Linear Regression
B.Logistic Regression
C.Fourier Transform
D.Principal Component Analysis (PCA)
Correct Answer: Principal Component Analysis (PCA)
Explanation:
PCA is a dimensionality reduction technique used to project high-dimensional data into lower dimensions (like 2D) while preserving variance.
Incorrect! Try again.
8In vector arithmetic for analogies, what result is expected for vector('King') - vector('Man') + vector('Woman')?
A.Monarch
B.Princess
C.Prince
D.Queen
Correct Answer: Queen
Explanation:
This is a classic example of semantic relationships in vector space, where arithmetic operations preserve gender and royalty relationships.
Incorrect! Try again.
9What is the role of the 'window size' hyperparameter in CBOW?
A.It sets the learning rate
B.It determines the number of dimensions in the vector
C.It determines the number of epochs
D.It defines how many neighbors to consider as context
Correct Answer: It defines how many neighbors to consider as context
Explanation:
The window size determines the scope of the context (e.g., 2 words to the left and 2 to the right) used to predict the center word.
Incorrect! Try again.
10Why are vector space models useful for information retrieval/document search?
A.They eliminate the need for indexing
B.They can match queries to documents based on semantic similarity
C.They allow exact keyword matching only
D.They work best with images
Correct Answer: They can match queries to documents based on semantic similarity
Explanation:
VSMs allow calculating similarity between a query vector and document vectors, retrieving relevant results even if exact keywords differ.
Incorrect! Try again.
11When transforming word vectors from one language to another (e.g., English to French) using a linear mapping, what are we trying to learn?
A.A decision tree
B.A clustering algorithm
C.A rotation matrix
D.A binary classifier
Correct Answer: A rotation matrix
Explanation:
Cross-lingual embedding alignment often involves learning a transformation matrix (often a rotation matrix) that maps the source vector space to the target vector space.
Incorrect! Try again.
12In PCA, the first principal component is the direction that maximizes what?
A.The number of clusters
B.The error rate
C.The cosine similarity
D.The variance of the data
Correct Answer: The variance of the data
Explanation:
PCA identifies the axes (principal components) along which the data varies the most to preserve information.
Incorrect! Try again.
13Which of the following best describes the 'bag-of-words' model assumption?
A.Word order is critical for meaning
B.Word order is ignored, only frequency counts matter
C.Grammar rules are strictly enforced
D.Dependencies between words are preserved
Correct Answer: Word order is ignored, only frequency counts matter
Explanation:
Bag-of-Words treats a text as an unordered collection of words, disregarding grammar and word order.
Incorrect! Try again.
14In the CBOW model, how are the input context vectors usually handled before passing to the hidden layer?
A.Only the first word is used
B.They are averaged or summed
C.They are concatenated
D.They are multiplied
Correct Answer: They are averaged or summed
Explanation:
In standard CBOW, the vectors of the context words are averaged or summed to create a single projection layer input.
Incorrect! Try again.
15What does a cosine similarity of 0 indicate between two word vectors?
A.They are opposite
B.They are identical
C.They are orthogonal (unrelated)
D.One is a scalar multiple of the other
Correct Answer: They are orthogonal (unrelated)
Explanation:
A cosine similarity of 0 means the angle is 90 degrees (orthogonal), implying no correlation in the vector space.
Incorrect! Try again.
16Which loss function is typically minimized when aligning two vector spaces (X and Y) via a transformation matrix R?
A.Hinge loss
B.Frobenius norm of (XR - Y)
C.Cross-entropy loss
D.Accuracy score
Correct Answer: Frobenius norm of (XR - Y)
Explanation:
The goal is to minimize the distance between the transformed source vectors (XR) and the target vectors (Y), often using the Frobenius norm (Euclidean distance equivalent for matrices).
Incorrect! Try again.
17Deep Learning vector models like Word2Vec are often referred to as:
A.Hierarchical clusters
B.Prediction-based embeddings
C.Sparse embeddings
D.Count-based matrices
Correct Answer: Prediction-based embeddings
Explanation:
Word2Vec models learn embeddings by training a network to predict words (or context) rather than counting co-occurrences directly.
Incorrect! Try again.
18If 'Apple' and 'Pear' are close in vector space, this indicates:
A.Morphological similarity
B.Phonetic similarity
C.Syntactic similarity
D.Semantic similarity
Correct Answer: Semantic similarity
Explanation:
Proximity in vector space usually implies that words share similar contexts and meanings (semantics).
Incorrect! Try again.
19When visualizing word vectors, why can't we simply plot the 300-dimensional vectors directly?
A.Human visual perception is limited to 2 or 3 dimensions
B.The vectors become binary
C.It would take too long to render
D.Computers cannot store 300 dimensions
Correct Answer: Human visual perception is limited to 2 or 3 dimensions
Explanation:
We need dimensionality reduction like PCA because humans cannot visually conceptualize spaces higher than 3 dimensions.
Incorrect! Try again.
20Which of the following is NOT a benefit of using Vector Space Models in Machine Translation?
A.Guaranteeing grammatically perfect sentences
B.Improving alignment of synonyms
C.Handling rare words via similarity
D.Mapping entire languages without parallel corpora (unsupervised)
While VSMs help with word alignment and meaning transfer, they do not inherently guarantee the grammatical correctness of the generated sentence.
Incorrect! Try again.
21In a Word2Vec model, the dimension of the hidden layer corresponds to:
A.The number of training documents
B.The size of the word embedding vector
C.The window size
D.The vocabulary size
Correct Answer: The size of the word embedding vector
Explanation:
The weights between the input and the hidden layer (or the hidden layer size itself) constitute the actual word embeddings.
Incorrect! Try again.
22To perform document search using word vectors, how might one represent a whole document?
A.By using the vector of the first word only
B.By taking the average (centroid) of all word vectors in the document
C.By concatenating all vectors into one giant vector
D.By summing the ASCII values of characters
Correct Answer: By taking the average (centroid) of all word vectors in the document
Explanation:
A common baseline approach is to average the word vectors of all words in the document to get a single document vector.
Incorrect! Try again.
23What is the 'Curse of Dimensionality' in the context of NLP?
A.The time it takes to train a model
B.The inability to use PCA
C.Having too few dimensions to represent meaning
D.Data becoming sparse and distance metrics becoming less meaningful in very high dimensions
Correct Answer: Data becoming sparse and distance metrics becoming less meaningful in very high dimensions
Explanation:
In extremely high-dimensional spaces (like raw vocabulary size), data points are sparse, and traditional distance metrics can behave counterintuitively.
Incorrect! Try again.
24Which algebraic structure is used to transform word vectors from a source language space to a target language space?
A.A transformation matrix
B.A scalar
C.A vector
D.A tensor of rank 3
Correct Answer: A transformation matrix
Explanation:
A matrix multiplication (W * R) maps vectors from space W to space V.
Incorrect! Try again.
25In PCA, what are 'eigenvalues' used for?
A.To label the axes
B.To determine the direction of axes
C.To quantify the variance explained by each principal component
D.To calculate the dot product
Correct Answer: To quantify the variance explained by each principal component
Explanation:
Eigenvalues indicate the magnitude of variance captured by their corresponding eigenvectors (principal components).
Incorrect! Try again.
26Which word pair would likely have the highest Euclidean distance in a well-trained vector space?
A.Car - Automobile
B.Happy - Joyful
C.Frog - Toad
D.Computer - Sandwich
Correct Answer: Computer - Sandwich
Explanation:
'Computer' and 'Sandwich' are semantically unrelated, so they would be far apart in the vector space compared to synonyms.
Incorrect! Try again.
27The output layer of a standard CBOW model typically uses which activation function to generate probabilities?
A.Softmax
B.Tanh
C.Sigmoid
D.ReLU
Correct Answer: Softmax
Explanation:
Softmax is used to convert the raw output scores into a probability distribution over the vocabulary.
Incorrect! Try again.
28What is the main limitation of using Euclidean distance for word vectors compared to Cosine similarity?
A.It cannot handle negative numbers
B.It is computationally harder
C.It is sensitive to the magnitude (length) of the vectors
D.It only works in 2D
Correct Answer: It is sensitive to the magnitude (length) of the vectors
Explanation:
Euclidean distance is affected by vector length (frequency of words), while Cosine similarity focuses on the angle (orientation/meaning), making Cosine often preferred.
Incorrect! Try again.
29Which concept explains why 'Paris' is to 'France' as 'Tokyo' is to 'Japan' in vector space?
A.Orthogonality
B.Linear substructures / Parallelism
C.Singular Value Decomposition
D.One-hot encoding
Correct Answer: Linear substructures / Parallelism
Explanation:
The relationship vectors (Country -> Capital) tend to be parallel and of similar length, allowing for linear algebraic analogies.
Incorrect! Try again.
30How does PCA reduce dimensions?
A.By averaging all data points to zero
B.By deleting the last 50 columns of data
C.By projecting data onto new axes that minimize information loss
D.By removing words with fewer than 3 letters
Correct Answer: By projecting data onto new axes that minimize information loss
Explanation:
PCA constructs new orthogonal axes (principal components) and projects data onto the top k axes that retain the most variance.
Incorrect! Try again.
31In the context of relationships between words, 'distributional semantics' suggests that:
A.Words are defined by their spelling
B.Words are unrelated entities
C.Words that appear in similar contexts have similar meanings
D.Words are defined by their dictionary definitions
Correct Answer: Words that appear in similar contexts have similar meanings
Explanation:
This is the core hypothesis behind VSMs: 'You shall know a word by the company it keeps'.
Incorrect! Try again.
32When training CBOW, what is the 'target'?
A.The next sentence
B.The sentiment of the sentence
C.The center word
D.The part of speech
Correct Answer: The center word
Explanation:
The objective of CBOW is to correctly predict the center word given the context words.
Incorrect! Try again.
33What happens to the vectors of synonyms (e.g., 'huge' and 'enormous') during training?
A.They become orthogonal
B.They move closer together
C.They move infinitely far apart
D.One replaces the other
Correct Answer: They move closer together
Explanation:
Since synonyms appear in similar contexts, the training process adjusts their vectors to be spatially proximal.
Incorrect! Try again.
34If you want to visualize a 1000-word subset of your vocabulary using PCA, what is the shape of the input matrix?
A.1000 x 2
B.2 x 2
C.1000 x Dimension_of_Embedding
D.Dimension_of_Embedding x 1000
Correct Answer: 1000 x Dimension_of_Embedding
Explanation:
The input data matrix for PCA consists of N samples (1000 words) by D features (the embedding dimension).
Incorrect! Try again.
35In cross-lingual information retrieval, query translation can be achieved by:
A.Re-training the model from scratch
B.Using a dictionary lookup only
C.Ignoring the language difference
D.Multiplying the query vector by a transformation matrix
Correct Answer: Multiplying the query vector by a transformation matrix
Explanation:
The query vector in the source language is projected into the target language space using the learned transformation matrix.
Incorrect! Try again.
36What is a 'context window'?
A.The software used to view the code
B.The number of words before and after a target word
C.The graphical user interface
D.The time limit for training
Correct Answer: The number of words before and after a target word
Explanation:
The context window defines the span of text surrounding a target word used to learn dependencies.
Incorrect! Try again.
37Which of the following is NOT a step in performing PCA?
A.Standardizing the data
B.Applying a Softmax function
C.Calculating the covariance matrix
D.Computing eigenvectors and eigenvalues
Correct Answer: Applying a Softmax function
Explanation:
Softmax is an activation function for neural networks; PCA involves covariance, eigenvectors, and projection.
Incorrect! Try again.
38If a word vector has a magnitude of 1, it is called:
A.A complex vector
B.A sparse vector
C.A binary vector
D.A normalized vector
Correct Answer: A normalized vector
Explanation:
A vector with a length (L2 norm) of 1 is normalized (unit vector).
Incorrect! Try again.
39Which approach is generally faster to train: CBOW or Skip-gram?
A.Skip-gram
B.CBOW
C.Neither is trainable
D.They are exactly the same
Correct Answer: CBOW
Explanation:
CBOW is generally faster to train than Skip-gram because it treats the entire context as one observation, whereas Skip-gram treats each context-target pair as a new observation.
Incorrect! Try again.
40To capture dependencies between words that are far apart in a sentence, one should:
A.Use one-hot encoding
B.Increase the window size
C.Decrease the window size
D.Set window size to 0
Correct Answer: Increase the window size
Explanation:
A larger window size captures broader context and longer-range dependencies, though it may introduce more noise.
Incorrect! Try again.
41The 'Manifold Hypothesis' in NLP suggests that:
A.High-dimensional language data lies on a lower-dimensional manifold
B.All words are equidistant
C.Vectors must be 3D
D.Language is flat
Correct Answer: High-dimensional language data lies on a lower-dimensional manifold
Explanation:
This hypothesis justifies dimensionality reduction, suggesting that real-world data points cluster on a lower-dimensional surface embedded in the high-dimensional space.
Incorrect! Try again.
42When performing vector arithmetic for 'Paris - France + Italy', the result is likely closest to:
A.Germany
B.Rome
C.Pizza
D.London
Correct Answer: Rome
Explanation:
This operation transfers the relationship 'Capital of' from France to Italy.
Incorrect! Try again.
43What is the dimensionality of the transformation matrix R used to map a source space of dimension D to a target space of dimension D?
A.D x 1
B.2D x 2D
C.1 x D
D.D x D
Correct Answer: D x D
Explanation:
To map a D-dimensional vector to another D-dimensional vector via linear transformation, a D x D matrix is required.
Incorrect! Try again.
44Which vector operation is primarily used to measure the relevance of a document to a search query in VSM?
A.Scalar Multiplication
B.Vector Subtraction
C.Cosine Similarity
D.Vector Addition
Correct Answer: Cosine Similarity
Explanation:
Relevance is usually determined by how close (similar) the document vector is to the query vector.
Incorrect! Try again.
45Sparse vectors (like Bag-of-Words) are characterized by:
A.Mostly zero values
B.Negative numbers only
C.Mostly non-zero values
D.Complex numbers
Correct Answer: Mostly zero values
Explanation:
In a large vocabulary, a single document contains only a few unique words, resulting in vectors with mostly zeros.
Incorrect! Try again.
46Word embeddings capture which type of relationships?
A.Neither
B.Only syntactic
C.Only semantic
D.Both syntactic and semantic
Correct Answer: Both syntactic and semantic
Explanation:
Good embeddings capture semantic meanings (King-Queen) and syntactic rules (walk-walking, swim-swimming).
Incorrect! Try again.
47Before applying PCA, it is standard practice to:
A.Square the data
B.Invert the data
C.Mean-center the data
D.Randomize the data
Correct Answer: Mean-center the data
Explanation:
PCA is sensitive to the scaling of the data; mean-centering ensures the first principal component passes through the center of the cloud.
Incorrect! Try again.
48In the analogy 'A is to B as C is to D', which equation represents the relationship in vector space?
A.B / A = D / C
B.B - A = D - C
C.B A = D C
D.B + A = D + C
Correct Answer: B - A = D - C
Explanation:
The difference vector (relationship) between B and A should be roughly the same as the difference between D and C.
Incorrect! Try again.
49Why might we use PCA on word vectors before performing clustering?
A.To convert vectors to text
B.To remove noise and reduce computational cost
C.To increase the number of dimensions
D.To translate the language
Correct Answer: To remove noise and reduce computational cost
Explanation:
Reducing dimensions helps remove variance that may be noise and speeds up clustering algorithms (like K-Means).
Incorrect! Try again.
50Which technique allows checking if the transformation matrix between two languages is accurate?
A.Calculating the determinant
B.Checking the accuracy of translation on a hold-out dictionary
C.Measuring the vector length
D.Checking if the matrix is square
Correct Answer: Checking the accuracy of translation on a hold-out dictionary
Explanation:
Evaluation involves applying the transformation to known words (not in the training set) and checking if the nearest neighbor in the target space is the correct translation.