Unit 2 - Notes

INT344

Unit 2: Vector Space Models

1. Vector Space Models (VSM) Overview

Vector Space Models are algebraic models used for representing text documents (and by extension, individual words) as vectors of identifiers. They form the foundation of modern NLP by translating linguistic elements into numerical formats that machine learning algorithms can process.

Fundamental Concept

In a Vector Space Model, a vocabulary of size forms a -dimensional space. Words or documents are represented as points (vectors) within this hyperspace.

  • One-Hot Encoding (Sparse Vectors):
    • Each word is a vector of zeros with a single 1 at the index corresponding to that word.
    • Limitation: Orthogonal vectors. No notion of similarity (e.g., the dot product between "car" and "automobile" is 0). High dimensionality and sparsity.
  • Word Embeddings (Dense Vectors):
    • Words are mapped to vectors of real numbers in a lower-dimensional space (e.g., 50 to 300 dimensions).
    • Advantage: Captures semantic relationships.

Capturing Semantic Meaning

The primary goal of modern VSMs (like Word2Vec, GloVe) is to capture semantics—the meaning of words.

  • Distributional Hypothesis: The core linguistic theory driving VSMs. It states that "words that occur in the same contexts tend to have similar meanings."
  • Contextual Similarity: If "coffee" and "tea" both appear frequently near words like "drink", "hot", "cup", and "morning", the model places their vectors close together in the vector space.
  • Result: The geometric distance between vectors correlates with semantic similarity.

2. Continuous Bag-of-Words (CBOW) Model

CBOW is one of the two main architectures (alongside Skip-Gram) used in the Word2Vec family of algorithms to generate dense word embeddings.

Architecture

The objective of CBOW is to predict a target word (center word) based on its context words (surrounding words).

  1. Input Layer:
    • Input consists of context words (defined by a window size, e.g., 2 words before and 2 words after).
    • These words are represented as one-hot vectors.
  2. Hidden Layer (Projection Layer):
    • The one-hot vectors are multiplied by an input weight matrix ().
    • Since inputs are one-hot, this acts as a lookup.
    • Crucially, the vectors of all context words are averaged in this layer. This "averaging" implies the order of context words does not matter (hence "Bag-of-Words").
  3. Output Layer:
    • The averaged vector is multiplied by an output weight matrix ().
    • A Softmax activation function is applied to generate a probability distribution over the entire vocabulary.
    • The word with the highest probability is the predicted target word.

Sliding Window Mechanism

  • The model moves a window across the entire text corpus.
  • Example Sentence: "The quick brown fox jumps."
  • Window size: 1.
  • Step 1: Context: ["The", "brown"], Target: "quick"
  • Step 2: Context: ["quick", "fox"], Target: "brown"

Mathematical Objective

The model learns by minimizing the loss (typically Cross-Entropy Loss) between the predicted probability distribution and the actual one-hot vector of the target word. Backpropagation updates the weights ( and ), which eventually become the word embeddings.


3. Relationships and Dependencies between Words

Once the embeddings are trained via models like CBOW, the resulting vectors capture complex relationships.

Relationships between Words (Vector Arithmetic)

Dense vectors allow for algebraic operations that reflect semantic reasoning.

  • Analogy (Parallelogram Structure):
    The most famous property is the ability to solve analogies using vector subtraction and addition.

    This suggests the vector difference encodes the concept of "royalty" or "leadership," while encodes "gender."

  • Cosine Similarity:
    To measure the relationship between two words, we use Cosine Similarity rather than Euclidean distance (to normalize for vector length).

    • 1: Perfect similarity (synonyms).
    • 0: No relationship (orthogonal).
    • -1: Opposite meaning (antonyms).

Capturing Dependencies

VSMs capture dependencies through co-occurrence statistics.

  • Syntactic Dependencies: The model learns that verbs follow nouns or adjectives precede nouns based on statistical frequency.
  • Semantic Dependencies:
    • Hypernymy/Hyponymy: "Car" is close to "Vehicle".
    • Meronymy: "Hand" is associated with "Finger".
    • These dependencies are encoded implicitly. While the model doesn't "know" grammar rules, the spatial arrangement of vectors respects grammatical and semantic classes.

4. Visualizing Relationships: Principal Component Analysis (PCA)

Word vectors usually exist in high-dimensional space (e.g., 300 dimensions). Humans cannot visualize this. Dimensionality reduction techniques like PCA are used to project these vectors into 2D or 3D for analysis.

PCA Overview

PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called Principal Components.

The Process for Word Vectors

  1. Standardize the Data: Ensure the word vectors have a mean of 0.
  2. Covariance Matrix: Compute the covariance matrix to understand how dimensions vary with respect to each other.
  3. Eigen-decomposition: Calculate Eigenvectors (directions of spread) and Eigenvalues (magnitude of spread) from the covariance matrix.
  4. Sort and Select: Sort Eigenvalues in descending order.
    • PC1 (x-axis): The Eigenvector with the highest Eigenvalue (captures the most variance).
    • PC2 (y-axis): The Eigenvector with the second highest Eigenvalue.
  5. Projection: Dot product the original high-dimensional data with the selected Eigenvectors to get 2D coordinates.

Interpretation

When plotted on a 2D graph:

  • Clustering: Words like "apple", "banana", "pear" will cluster together (Fruits).
  • Orientation: Parallel lines between pairs (e.g., Country Capital) indicate preserved relational structures.

5. Transform Word Vectors

Transforming word vectors often refers to aligning two different vector spaces. This is essential when comparing embeddings trained on different corpora or different languages.

The Problem

If you train Word2Vec on English text and French text separately, the vector for "Cat" (English) and "Chat" (French) will not naturally align because the random initialization and optimization paths were different. The spaces are isometric (similar internal shape) but rotated relative to each other.

Linear Transformation (The Procrustes Problem)

To map Source Space () to Target Space ():

  1. Identify a set of Anchor Words (pairs of words known to be translations, e.g., "one"-"un", "two"-"deux").
  2. Find a transformation matrix (Rotation Matrix) such that .
  3. The goal is to minimize the Frobenius norm:
  4. Once is found, you can apply it to any word in the source language to find its location in the target language space.

6. Applications: Machine Translation and Document Search

Machine Translation (Cross-Lingual Embeddings)

Using the transformation method described above:

  1. Training: Train monolingual embeddings for Language A and Language B.
  2. Alignment: Learn the rotation matrix using a small dictionary of seed words.
  3. Translation: To translate a new word from Language A:
    • Compute vector .
    • Find the nearest neighbor to vector in Language B's vector space.
    • This allows for translation of words not in the training dictionary (Zero-shot learning).

Document Search (Information Retrieval)

VSMs improve search by moving beyond keyword matching (lexical search) to semantic search.

  • Document Vectors: A document can be represented as a vector. A simple baseline is the centroid: the average of all word vectors in the document.
  • Query Processing: The user's search query is also converted into a vector (average of query word vectors).
  • Ranking: Calculate the Cosine Similarity between the Query Vector and all Document Vectors.
  • Benefits:
    • Handles synonyms (searching for "automobile" returns documents containing "car").
    • Ranking is based on conceptual relevance, not just term frequency.