1 $What is the fundamental representation of a word in a Vector Space Model (VSM)?$

A.

A scalar integer

B.

A vector of real numbers

C.

A binary tree structure

D.

A linked list

2 $Which of the following is a primary advantage of dense word vectors over one-hot encoding?$

A.

They capture semantic relationships

B.

They use more memory

C.

They are easier to calculate manually

D.

They are strictly binary

3 $In the context of the Continuous Bag-of-Words (CBOW) model, what is the input to the neural network?$

A.

The center word

B.

The context words

C.

The entire document

D.

A random noise vector

4 $How is 'cosine similarity' calculated between two word vectors, A and B?$

A.

Dot product of A and B divided by the product of their magnitudes

B.

Euclidean distance between A and B

C.

Sum of elements in A minus sum of elements in B

D.

Cross product of A and B

5 $If two word vectors have a cosine similarity of 1, what does this imply?$

A.

The words are unrelated

B.

The words represent opposite meanings

C.

The vectors point in exactly the same direction

D.

The vectors are orthogonal

6 $Which architecture predicts the surrounding words given a center word?$

A.

Continuous Bag-of-Words (CBOW)

B.

Skip-gram

C.

Principal Component Analysis

D.

Latent Dirichlet Allocation

7 $What mathematical technique is commonly used to visualize high-dimensional word vectors in two dimensions?$

A.

Linear Regression

B.

Principal Component Analysis (PCA)

C.

Logistic Regression

D.

Fourier Transform

8 $In vector arithmetic for analogies, what result is expected for vector('King') - vector('Man') + vector('Woman')?$

A.

Prince

B.

Queen

C.

Princess

D.

Monarch

9 $What is the role of the 'window size' hyperparameter in CBOW?$

A.

It determines the number of dimensions in the vector

B.

It defines how many neighbors to consider as context

C.

It sets the learning rate

D.

It determines the number of epochs

10 $Why are vector space models useful for information retrieval/document search?$

A.

They allow exact keyword matching only

B.

They can match queries to documents based on semantic similarity

C.

They eliminate the need for indexing

D.

They work best with images

11 $When transforming word vectors from one language to another (e.g., English to French) using a linear mapping, what are we trying to learn?$

A.

A rotation matrix

B.

A binary classifier

C.

A clustering algorithm

D.

A decision tree

12 $In PCA, the first principal component is the direction that maximizes what?$

A.

The error rate

B.

The variance of the data

C.

The number of clusters

D.

The cosine similarity

13 $Which of the following best describes the 'bag-of-words' model assumption?$

A.

Word order is critical for meaning

B.

Grammar rules are strictly enforced

C.

Word order is ignored, only frequency counts matter

D.

Dependencies between words are preserved

14 $In the CBOW model, how are the input context vectors usually handled before passing to the hidden layer?$

A.

They are concatenated

B.

They are averaged or summed

C.

They are multiplied

D.

Only the first word is used

15 $What does a cosine similarity of 0 indicate between two word vectors?$

A.

They are identical

B.

They are opposite

C.

They are orthogonal (unrelated)

D.

One is a scalar multiple of the other

16 $Which loss function is typically minimized when aligning two vector spaces (X and Y) via a transformation matrix R?$

A.

Cross-entropy loss

B.

Frobenius norm of (XR - Y)

C.

Hinge loss

D.

Accuracy score

17 $Deep Learning vector models like Word2Vec are often referred to as:$

A.

Sparse embeddings

B.

Prediction-based embeddings

C.

Count-based matrices

D.

Hierarchical clusters

18 $If 'Apple' and 'Pear' are close in vector space, this indicates:$

A.

Syntactic similarity

B.

Semantic similarity

C.

Phonetic similarity

D.

Morphological similarity

19 $When visualizing word vectors, why can't we simply plot the 300-dimensional vectors directly?$

A.

Computers cannot store 300 dimensions

B.

Human visual perception is limited to 2 or 3 dimensions

C.

The vectors become binary

D.

It would take too long to render

20 $Which of the following is NOT a benefit of using Vector Space Models in Machine Translation?$

A.

Handling rare words via similarity

B.

Mapping entire languages without parallel corpora (unsupervised)

C.

Guaranteeing grammatically perfect sentences

D.

Improving alignment of synonyms

21 $In a Word2Vec model, the dimension of the hidden layer corresponds to:$

A.

The vocabulary size

B.

The size of the word embedding vector

C.

The number of training documents

D.

The window size

22 $To perform document search using word vectors, how might one represent a whole document?$

A.

By using the vector of the first word only

B.

By taking the average (centroid) of all word vectors in the document

C.

By summing the ASCII values of characters

D.

By concatenating all vectors into one giant vector

23 $What is the 'Curse of Dimensionality' in the context of NLP?$

A.

Having too few dimensions to represent meaning

B.

Data becoming sparse and distance metrics becoming less meaningful in very high dimensions

C.

The inability to use PCA

D.

The time it takes to train a model

24 $Which algebraic structure is used to transform word vectors from a source language space to a target language space?$

A.

A scalar

B.

A vector

C.

A transformation matrix

D.

A tensor of rank 3

25 $In PCA, what are 'eigenvalues' used for?$

A.

To determine the direction of axes

B.

To quantify the variance explained by each principal component

C.

To calculate the dot product

D.

To label the axes

26 $Which word pair would likely have the highest Euclidean distance in a well-trained vector space?$

A.

Car - Automobile

B.

Frog - Toad

C.

Computer - Sandwich

D.

Happy - Joyful

27 $The output layer of a standard CBOW model typically uses which activation function to generate probabilities?$

A.

ReLU

B.

Sigmoid

C.

Softmax

D.

Tanh

28 $What is the main limitation of using Euclidean distance for word vectors compared to Cosine similarity?$

A.

It is computationally harder

B.

It is sensitive to the magnitude (length) of the vectors

C.

It cannot handle negative numbers

D.

It only works in 2D

29 $Which concept explains why 'Paris' is to 'France' as 'Tokyo' is to 'Japan' in vector space?$

A.

One-hot encoding

B.

Linear substructures / Parallelism

C.

Orthogonality

D.

Singular Value Decomposition

30 $How does PCA reduce dimensions?$

A.

By deleting the last 50 columns of data

B.

By projecting data onto new axes that minimize information loss

C.

By averaging all data points to zero

D.

By removing words with fewer than 3 letters

31 $In the context of relationships between words, 'distributional semantics' suggests that:$

A.

Words are defined by their spelling

B.

Words are defined by their dictionary definitions

C.

Words that appear in similar contexts have similar meanings

D.

Words are unrelated entities

32 $When training CBOW, what is the 'target'?$

A.

The next sentence

B.

The sentiment of the sentence

C.

The center word

D.

The part of speech

33 $What happens to the vectors of synonyms (e.g., 'huge' and 'enormous') during training?$

A.

They become orthogonal

B.

They move closer together

C.

They move infinitely far apart

D.

One replaces the other

34 $If you want to visualize a 1000-word subset of your vocabulary using PCA, what is the shape of the input matrix?$

A.

1000 x 2

B.

1000 x Dimension_of_Embedding

C.

Dimension_of_Embedding x 1000

D.

2 x 2

35 $In cross-lingual information retrieval, query translation can be achieved by:$

A.

Multiplying the query vector by a transformation matrix

B.

Re-training the model from scratch

C.

Using a dictionary lookup only

D.

Ignoring the language difference

36 $What is a 'context window'?$

A.

The software used to view the code

B.

The number of words before and after a target word

C.

The graphical user interface

D.

The time limit for training

37 $Which of the following is NOT a step in performing PCA?$

A.

Standardizing the data

B.

Calculating the covariance matrix

C.

Computing eigenvectors and eigenvalues

D.

Applying a Softmax function

38 $If a word vector has a magnitude of 1, it is called:$

A.

A normalized vector

B.

A sparse vector

C.

A binary vector

D.

A complex vector

39 $Which approach is generally faster to train: CBOW or Skip-gram?$

A.

CBOW

B.

Skip-gram

C.

They are exactly the same

D.

Neither is trainable

40 $To capture dependencies between words that are far apart in a sentence, one should:$

A.

Decrease the window size

B.

Increase the window size

C.

Set window size to 0

D.

Use one-hot encoding

41 $The 'Manifold Hypothesis' in NLP suggests that:$

A.

Language is flat

B.

High-dimensional language data lies on a lower-dimensional manifold

C.

All words are equidistant

D.

Vectors must be 3D

42 $When performing vector arithmetic for 'Paris - France + Italy', the result is likely closest to:$

A.

London

B.

Rome

C.

Germany

D.

Pizza

43 $What is the dimensionality of the transformation matrix R used to map a source space of dimension D to a target space of dimension D?$

A.

D x 1

B.

1 x D

C.

D x D

D.

2D x 2D

44 $Which vector operation is primarily used to measure the relevance of a document to a search query in VSM?$

A.

Vector Addition

B.

Scalar Multiplication

C.

Cosine Similarity

D.

Vector Subtraction

45 $Sparse vectors (like Bag-of-Words) are characterized by:$

A.

Mostly zero values

B.

Mostly non-zero values

C.

Complex numbers

D.

Negative numbers only

46 $Word embeddings capture which type of relationships?$

A.

Only syntactic

B.

Only semantic

C.

Both syntactic and semantic

D.

Neither

47 $Before applying PCA, it is standard practice to:$

A.

Mean-center the data

B.

Square the data

C.

Randomize the data

D.

Invert the data

48 $In the analogy 'A is to B as C is to D', which equation represents the relationship in vector space?$

A.

B - A = D - C

B.

B + A = D + C

C.

B A = D C

D.

B / A = D / C

49 $Why might we use PCA on word vectors before performing clustering?$

A.

To increase the number of dimensions

B.

To remove noise and reduce computational cost

C.

To convert vectors to text

D.

To translate the language

50 $Which technique allows checking if the transformation matrix between two languages is accurate?$

A.

Checking the accuracy of translation on a hold-out dictionary

B.

Measuring the vector length

C.

Checking if the matrix is square

D.

Calculating the determinant

Unit 2 - Practice Quiz