1What is the fundamental representation of a word in a Vector Space Model (VSM)?
A.A scalar integer
B.A vector of real numbers
C.A binary tree structure
D.A linked list
Correct Answer: A vector of real numbers
Explanation:In Vector Space Models, words are mapped to vectors of real numbers within a continuous vector space.
Incorrect! Try again.
2Which of the following is a primary advantage of dense word vectors over one-hot encoding?
A.They capture semantic relationships
B.They use more memory
C.They are easier to calculate manually
D.They are strictly binary
Correct Answer: They capture semantic relationships
Explanation:Dense vectors place similar words close to each other in the vector space, capturing semantic meaning, whereas one-hot vectors are orthogonal and equidistant.
Incorrect! Try again.
3In the context of the Continuous Bag-of-Words (CBOW) model, what is the input to the neural network?
A.The center word
B.The context words
C.The entire document
D.A random noise vector
Correct Answer: The context words
Explanation:CBOW predicts the current (center) word based on the surrounding context words.
Incorrect! Try again.
4How is 'cosine similarity' calculated between two word vectors, A and B?
A.Dot product of A and B divided by the product of their magnitudes
B.Euclidean distance between A and B
C.Sum of elements in A minus sum of elements in B
D.Cross product of A and B
Correct Answer: Dot product of A and B divided by the product of their magnitudes
Explanation:Cosine similarity measures the cosine of the angle between two vectors, calculated as (A . B) / (||A|| * ||B||).
Incorrect! Try again.
5If two word vectors have a cosine similarity of 1, what does this imply?
A.The words are unrelated
B.The words represent opposite meanings
C.The vectors point in exactly the same direction
D.The vectors are orthogonal
Correct Answer: The vectors point in exactly the same direction
Explanation:A cosine similarity of 1 implies the angle between the vectors is 0 degrees, meaning they are identical in orientation (highly similar).
Incorrect! Try again.
6Which architecture predicts the surrounding words given a center word?
A.Continuous Bag-of-Words (CBOW)
B.Skip-gram
C.Principal Component Analysis
D.Latent Dirichlet Allocation
Correct Answer: Skip-gram
Explanation:Skip-gram is the inverse of CBOW; it takes the center word as input and tries to predict the context words.
Incorrect! Try again.
7What mathematical technique is commonly used to visualize high-dimensional word vectors in two dimensions?
A.Linear Regression
B.Principal Component Analysis (PCA)
C.Logistic Regression
D.Fourier Transform
Correct Answer: Principal Component Analysis (PCA)
Explanation:PCA is a dimensionality reduction technique used to project high-dimensional data into lower dimensions (like 2D) while preserving variance.
Incorrect! Try again.
8In vector arithmetic for analogies, what result is expected for vector('King') - vector('Man') + vector('Woman')?
A.Prince
B.Queen
C.Princess
D.Monarch
Correct Answer: Queen
Explanation:This is a classic example of semantic relationships in vector space, where arithmetic operations preserve gender and royalty relationships.
Incorrect! Try again.
9What is the role of the 'window size' hyperparameter in CBOW?
A.It determines the number of dimensions in the vector
B.It defines how many neighbors to consider as context
C.It sets the learning rate
D.It determines the number of epochs
Correct Answer: It defines how many neighbors to consider as context
Explanation:The window size determines the scope of the context (e.g., 2 words to the left and 2 to the right) used to predict the center word.
Incorrect! Try again.
10Why are vector space models useful for information retrieval/document search?
A.They allow exact keyword matching only
B.They can match queries to documents based on semantic similarity
C.They eliminate the need for indexing
D.They work best with images
Correct Answer: They can match queries to documents based on semantic similarity
Explanation:VSMs allow calculating similarity between a query vector and document vectors, retrieving relevant results even if exact keywords differ.
Incorrect! Try again.
11When transforming word vectors from one language to another (e.g., English to French) using a linear mapping, what are we trying to learn?
A.A rotation matrix
B.A binary classifier
C.A clustering algorithm
D.A decision tree
Correct Answer: A rotation matrix
Explanation:Cross-lingual embedding alignment often involves learning a transformation matrix (often a rotation matrix) that maps the source vector space to the target vector space.
Incorrect! Try again.
12In PCA, the first principal component is the direction that maximizes what?
A.The error rate
B.The variance of the data
C.The number of clusters
D.The cosine similarity
Correct Answer: The variance of the data
Explanation:PCA identifies the axes (principal components) along which the data varies the most to preserve information.
Incorrect! Try again.
13Which of the following best describes the 'bag-of-words' model assumption?
A.Word order is critical for meaning
B.Grammar rules are strictly enforced
C.Word order is ignored, only frequency counts matter
D.Dependencies between words are preserved
Correct Answer: Word order is ignored, only frequency counts matter
Explanation:Bag-of-Words treats a text as an unordered collection of words, disregarding grammar and word order.
Incorrect! Try again.
14In the CBOW model, how are the input context vectors usually handled before passing to the hidden layer?
A.They are concatenated
B.They are averaged or summed
C.They are multiplied
D.Only the first word is used
Correct Answer: They are averaged or summed
Explanation:In standard CBOW, the vectors of the context words are averaged or summed to create a single projection layer input.
Incorrect! Try again.
15What does a cosine similarity of 0 indicate between two word vectors?
A.They are identical
B.They are opposite
C.They are orthogonal (unrelated)
D.One is a scalar multiple of the other
Correct Answer: They are orthogonal (unrelated)
Explanation:A cosine similarity of 0 means the angle is 90 degrees (orthogonal), implying no correlation in the vector space.
Incorrect! Try again.
16Which loss function is typically minimized when aligning two vector spaces (X and Y) via a transformation matrix R?
A.Cross-entropy loss
B.Frobenius norm of (XR - Y)
C.Hinge loss
D.Accuracy score
Correct Answer: Frobenius norm of (XR - Y)
Explanation:The goal is to minimize the distance between the transformed source vectors (XR) and the target vectors (Y), often using the Frobenius norm (Euclidean distance equivalent for matrices).
Incorrect! Try again.
17Deep Learning vector models like Word2Vec are often referred to as:
A.Sparse embeddings
B.Prediction-based embeddings
C.Count-based matrices
D.Hierarchical clusters
Correct Answer: Prediction-based embeddings
Explanation:Word2Vec models learn embeddings by training a network to predict words (or context) rather than counting co-occurrences directly.
Incorrect! Try again.
18If 'Apple' and 'Pear' are close in vector space, this indicates:
A.Syntactic similarity
B.Semantic similarity
C.Phonetic similarity
D.Morphological similarity
Correct Answer: Semantic similarity
Explanation:Proximity in vector space usually implies that words share similar contexts and meanings (semantics).
Incorrect! Try again.
19When visualizing word vectors, why can't we simply plot the 300-dimensional vectors directly?
A.Computers cannot store 300 dimensions
B.Human visual perception is limited to 2 or 3 dimensions
C.The vectors become binary
D.It would take too long to render
Correct Answer: Human visual perception is limited to 2 or 3 dimensions
Explanation:We need dimensionality reduction like PCA because humans cannot visually conceptualize spaces higher than 3 dimensions.
Incorrect! Try again.
20Which of the following is NOT a benefit of using Vector Space Models in Machine Translation?
A.Handling rare words via similarity
B.Mapping entire languages without parallel corpora (unsupervised)
Explanation:While VSMs help with word alignment and meaning transfer, they do not inherently guarantee the grammatical correctness of the generated sentence.
Incorrect! Try again.
21In a Word2Vec model, the dimension of the hidden layer corresponds to:
A.The vocabulary size
B.The size of the word embedding vector
C.The number of training documents
D.The window size
Correct Answer: The size of the word embedding vector
Explanation:The weights between the input and the hidden layer (or the hidden layer size itself) constitute the actual word embeddings.
Incorrect! Try again.
22To perform document search using word vectors, how might one represent a whole document?
A.By using the vector of the first word only
B.By taking the average (centroid) of all word vectors in the document
C.By summing the ASCII values of characters
D.By concatenating all vectors into one giant vector
Correct Answer: By taking the average (centroid) of all word vectors in the document
Explanation:A common baseline approach is to average the word vectors of all words in the document to get a single document vector.
Incorrect! Try again.
23What is the 'Curse of Dimensionality' in the context of NLP?
A.Having too few dimensions to represent meaning
B.Data becoming sparse and distance metrics becoming less meaningful in very high dimensions
C.The inability to use PCA
D.The time it takes to train a model
Correct Answer: Data becoming sparse and distance metrics becoming less meaningful in very high dimensions
Explanation:In extremely high-dimensional spaces (like raw vocabulary size), data points are sparse, and traditional distance metrics can behave counterintuitively.
Incorrect! Try again.
24Which algebraic structure is used to transform word vectors from a source language space to a target language space?
A.A scalar
B.A vector
C.A transformation matrix
D.A tensor of rank 3
Correct Answer: A transformation matrix
Explanation:A matrix multiplication (W * R) maps vectors from space W to space V.
Incorrect! Try again.
25In PCA, what are 'eigenvalues' used for?
A.To determine the direction of axes
B.To quantify the variance explained by each principal component
C.To calculate the dot product
D.To label the axes
Correct Answer: To quantify the variance explained by each principal component
Explanation:Eigenvalues indicate the magnitude of variance captured by their corresponding eigenvectors (principal components).
Incorrect! Try again.
26Which word pair would likely have the highest Euclidean distance in a well-trained vector space?
A.Car - Automobile
B.Frog - Toad
C.Computer - Sandwich
D.Happy - Joyful
Correct Answer: Computer - Sandwich
Explanation:'Computer' and 'Sandwich' are semantically unrelated, so they would be far apart in the vector space compared to synonyms.
Incorrect! Try again.
27The output layer of a standard CBOW model typically uses which activation function to generate probabilities?
A.ReLU
B.Sigmoid
C.Softmax
D.Tanh
Correct Answer: Softmax
Explanation:Softmax is used to convert the raw output scores into a probability distribution over the vocabulary.
Incorrect! Try again.
28What is the main limitation of using Euclidean distance for word vectors compared to Cosine similarity?
A.It is computationally harder
B.It is sensitive to the magnitude (length) of the vectors
C.It cannot handle negative numbers
D.It only works in 2D
Correct Answer: It is sensitive to the magnitude (length) of the vectors
Explanation:Euclidean distance is affected by vector length (frequency of words), while Cosine similarity focuses on the angle (orientation/meaning), making Cosine often preferred.
Incorrect! Try again.
29Which concept explains why 'Paris' is to 'France' as 'Tokyo' is to 'Japan' in vector space?
A.One-hot encoding
B.Linear substructures / Parallelism
C.Orthogonality
D.Singular Value Decomposition
Correct Answer: Linear substructures / Parallelism
Explanation:The relationship vectors (Country -> Capital) tend to be parallel and of similar length, allowing for linear algebraic analogies.
Incorrect! Try again.
30How does PCA reduce dimensions?
A.By deleting the last 50 columns of data
B.By projecting data onto new axes that minimize information loss
C.By averaging all data points to zero
D.By removing words with fewer than 3 letters
Correct Answer: By projecting data onto new axes that minimize information loss
Explanation:PCA constructs new orthogonal axes (principal components) and projects data onto the top k axes that retain the most variance.
Incorrect! Try again.
31In the context of relationships between words, 'distributional semantics' suggests that:
A.Words are defined by their spelling
B.Words are defined by their dictionary definitions
C.Words that appear in similar contexts have similar meanings
D.Words are unrelated entities
Correct Answer: Words that appear in similar contexts have similar meanings
Explanation:This is the core hypothesis behind VSMs: 'You shall know a word by the company it keeps'.
Incorrect! Try again.
32When training CBOW, what is the 'target'?
A.The next sentence
B.The sentiment of the sentence
C.The center word
D.The part of speech
Correct Answer: The center word
Explanation:The objective of CBOW is to correctly predict the center word given the context words.
Incorrect! Try again.
33What happens to the vectors of synonyms (e.g., 'huge' and 'enormous') during training?
A.They become orthogonal
B.They move closer together
C.They move infinitely far apart
D.One replaces the other
Correct Answer: They move closer together
Explanation:Since synonyms appear in similar contexts, the training process adjusts their vectors to be spatially proximal.
Incorrect! Try again.
34If you want to visualize a 1000-word subset of your vocabulary using PCA, what is the shape of the input matrix?
A.1000 x 2
B.1000 x Dimension_of_Embedding
C.Dimension_of_Embedding x 1000
D.2 x 2
Correct Answer: 1000 x Dimension_of_Embedding
Explanation:The input data matrix for PCA consists of N samples (1000 words) by D features (the embedding dimension).
Incorrect! Try again.
35In cross-lingual information retrieval, query translation can be achieved by:
A.Multiplying the query vector by a transformation matrix
B.Re-training the model from scratch
C.Using a dictionary lookup only
D.Ignoring the language difference
Correct Answer: Multiplying the query vector by a transformation matrix
Explanation:The query vector in the source language is projected into the target language space using the learned transformation matrix.
Incorrect! Try again.
36What is a 'context window'?
A.The software used to view the code
B.The number of words before and after a target word
C.The graphical user interface
D.The time limit for training
Correct Answer: The number of words before and after a target word
Explanation:The context window defines the span of text surrounding a target word used to learn dependencies.
Incorrect! Try again.
37Which of the following is NOT a step in performing PCA?
A.Standardizing the data
B.Calculating the covariance matrix
C.Computing eigenvectors and eigenvalues
D.Applying a Softmax function
Correct Answer: Applying a Softmax function
Explanation:Softmax is an activation function for neural networks; PCA involves covariance, eigenvectors, and projection.
Incorrect! Try again.
38If a word vector has a magnitude of 1, it is called:
A.A normalized vector
B.A sparse vector
C.A binary vector
D.A complex vector
Correct Answer: A normalized vector
Explanation:A vector with a length (L2 norm) of 1 is normalized (unit vector).
Incorrect! Try again.
39Which approach is generally faster to train: CBOW or Skip-gram?
A.CBOW
B.Skip-gram
C.They are exactly the same
D.Neither is trainable
Correct Answer: CBOW
Explanation:CBOW is generally faster to train than Skip-gram because it treats the entire context as one observation, whereas Skip-gram treats each context-target pair as a new observation.
Incorrect! Try again.
40To capture dependencies between words that are far apart in a sentence, one should:
A.Decrease the window size
B.Increase the window size
C.Set window size to 0
D.Use one-hot encoding
Correct Answer: Increase the window size
Explanation:A larger window size captures broader context and longer-range dependencies, though it may introduce more noise.
Incorrect! Try again.
41The 'Manifold Hypothesis' in NLP suggests that:
A.Language is flat
B.High-dimensional language data lies on a lower-dimensional manifold
C.All words are equidistant
D.Vectors must be 3D
Correct Answer: High-dimensional language data lies on a lower-dimensional manifold
Explanation:This hypothesis justifies dimensionality reduction, suggesting that real-world data points cluster on a lower-dimensional surface embedded in the high-dimensional space.
Incorrect! Try again.
42When performing vector arithmetic for 'Paris - France + Italy', the result is likely closest to:
A.London
B.Rome
C.Germany
D.Pizza
Correct Answer: Rome
Explanation:This operation transfers the relationship 'Capital of' from France to Italy.
Incorrect! Try again.
43What is the dimensionality of the transformation matrix R used to map a source space of dimension D to a target space of dimension D?
A.D x 1
B.1 x D
C.D x D
D.2D x 2D
Correct Answer: D x D
Explanation:To map a D-dimensional vector to another D-dimensional vector via linear transformation, a D x D matrix is required.
Incorrect! Try again.
44Which vector operation is primarily used to measure the relevance of a document to a search query in VSM?
A.Vector Addition
B.Scalar Multiplication
C.Cosine Similarity
D.Vector Subtraction
Correct Answer: Cosine Similarity
Explanation:Relevance is usually determined by how close (similar) the document vector is to the query vector.
Incorrect! Try again.
45Sparse vectors (like Bag-of-Words) are characterized by:
A.Mostly zero values
B.Mostly non-zero values
C.Complex numbers
D.Negative numbers only
Correct Answer: Mostly zero values
Explanation:In a large vocabulary, a single document contains only a few unique words, resulting in vectors with mostly zeros.
Incorrect! Try again.
46Word embeddings capture which type of relationships?
47Before applying PCA, it is standard practice to:
A.Mean-center the data
B.Square the data
C.Randomize the data
D.Invert the data
Correct Answer: Mean-center the data
Explanation:PCA is sensitive to the scaling of the data; mean-centering ensures the first principal component passes through the center of the cloud.
Incorrect! Try again.
48In the analogy 'A is to B as C is to D', which equation represents the relationship in vector space?
A.B - A = D - C
B.B + A = D + C
C.B A = D C
D.B / A = D / C
Correct Answer: B - A = D - C
Explanation:The difference vector (relationship) between B and A should be roughly the same as the difference between D and C.
Incorrect! Try again.
49Why might we use PCA on word vectors before performing clustering?
A.To increase the number of dimensions
B.To remove noise and reduce computational cost
C.To convert vectors to text
D.To translate the language
Correct Answer: To remove noise and reduce computational cost
Explanation:Reducing dimensions helps remove variance that may be noise and speeds up clustering algorithms (like K-Means).
Incorrect! Try again.
50Which technique allows checking if the transformation matrix between two languages is accurate?
A.Checking the accuracy of translation on a hold-out dictionary
B.Measuring the vector length
C.Checking if the matrix is square
D.Calculating the determinant
Correct Answer: Checking the accuracy of translation on a hold-out dictionary
Explanation:Evaluation involves applying the transformation to known words (not in the training set) and checking if the nearest neighbor in the target space is the correct translation.
Incorrect! Try again.
Give Feedback
Help us improve by sharing your thoughts or reporting issues.