Unit 2 - Notes

CSE472 6 min read

Unit 2: Word Embeddings and Vector Representations

1. Vector Space Models (VSMs)

Vector Space Models represent text (words, sentences, or documents) as vectors of identifiers in a continuous space. VSMs form the foundational mathematical framework for Natural Language Processing (NLP), allowing computers to process natural language using linear algebra and machine learning.

Traditional Sparse Representations

Before modern deep learning embeddings, VSMs relied on sparse, count-based representations:

One-Hot Encoding: Each word is represented by a binary vector with a length equal to the vocabulary size (). The vector has a 1 at the index corresponding to the word and 0s everywhere else.
- Limitations: Extreme sparsity, massive memory requirements, and orthogonal vectors (meaning no relationship or similarity between any two words can be measured; dot product is always 0).
Bag-of-Words (BoW) & TF-IDF: Represents documents by the counts or weighted frequencies of words.
- Limitations: Fails to capture word order (syntax) and context (semantics).

2. Dense Word Embeddings

Dense word embeddings are a paradigm shift from sparse vectors. Instead of representing a word as an isolated entity in a high-dimensional sparse space, dense embeddings represent words as continuous, real-valued vectors in a lower-dimensional space (typically 50 to 300 dimensions).

The Distributional Hypothesis

Dense embeddings are largely built upon the Distributional Hypothesis (Firth, 1957): "You shall know a word by the company it keeps." Words that appear in similar contexts tend to have similar meanings.

Advantages of Dense Embeddings

Dimensionality Reduction: Reduces feature space from $|V|$ (often >100,000) to $d$ (e.g., 300), preventing the "curse of dimensionality."
Generalization: Continuous values allow machine learning models to generalize better across synonyms.
Semantic Meaning: Vector proximity correlates with semantic similarity.

3. Word2Vec Models

Developed by Tomas Mikolov et al. at Google (2013), Word2Vec is a predictive neural network framework used to learn dense word embeddings. It consists of two primary architectures: CBOW and Skip-Gram. Both are shallow, two-layer neural networks trained to reconstruct linguistic contexts.

Continuous Bag-of-Words (CBOW)

CBOW aims to predict a target word given its surrounding context words.

Input: Context words within a specific window size (e.g., 2 words before and 2 words after the target).
Hidden Layer: A linear layer that averages the input context vectors.
Output: A softmax layer predicting the probability distribution of the target word over the entire vocabulary.
Characteristics:
- Faster to train than Skip-Gram.
- Performs slightly better on frequent words.
- Averages out the context, smoothing over distributional statistics.

Skip-Gram Model

Skip-Gram reverses the CBOW objective; it aims to predict surrounding context words given a single target word.

Input: The target word.
Hidden Layer: A linear layer that maps the target word to its dense embedding.
Output: Multiple softmax outputs (one for each context position) predicting the surrounding words.
Characteristics:
- Slower to train than CBOW.
- Performs significantly better on infrequent or rare words because the target word is not averaged with others.
- Captures finer-grained semantic details.

Training Optimizations

Computing the softmax over a massive vocabulary is computationally expensive. Word2Vec uses two main optimization tricks:

Negative Sampling: Instead of updating weights for all $V$ words in the vocabulary, the model updates weights for the true context word (positive sample) and a small number of randomly chosen noise words (negative samples).
Hierarchical Softmax: Replaces the flat softmax layer with a binary tree representation of the vocabulary, reducing complexity from $O(|V|)$ to $O(\log_2(|V|))$ .

4. GloVe Embeddings (Global Vectors)

Developed by Pennington, Socher, and Manning at Stanford (2014), GloVe combines the best of both worlds: global matrix factorization (like Latent Semantic Analysis) and local context window methods (like Word2Vec).

The Mechanics of GloVe

Co-occurrence Matrix: GloVe first builds a global word-word co-occurrence matrix $X$ , where cell $X_{ij}$ represents how many times word $j$ appears in the context of word $i$ across the entire corpus.
Co-occurrence Probabilities: The core intuition of GloVe is that the ratio of co-occurrence probabilities between words encodes meaning.
Objective Function: GloVe trains word vectors such that their dot product equals the logarithm of the words' probability of co-occurrence.
$J = \sum_{i,j=1}^{V} f(X_{ij}) \left( w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij} \right)^2$
Where $f(X_{ij})$ is a weighting function to prevent rare co-occurrences from dominating and to cap the influence of highly frequent word pairs.

5. Capturing Semantic Similarity and Analogy Relationships

Semantic Similarity

Because embeddings map semantic meaning to spatial coordinates, similarity is measured using geometric distance metrics. The most common metric is Cosine Similarity, which measures the cosine of the angle between two vectors, ignoring their magnitude.

$\text{Cosine Similarity}(A, B) = \frac{A \cdot B}{||A|| \times ||B||}$

Value of 1: Vectors point in the exact same direction (highly similar).
Value of 0: Vectors are orthogonal (unrelated).
Value of -1: Vectors point in opposite directions.

Analogy Relationships

A remarkable property of dense embeddings (especially Word2Vec and GloVe) is their ability to capture linear relational substructures. This allows for arithmetic operations on vectors to solve analogies.

Classic Example: "Man is to King as Woman is to X"
$\vec{v}_{\text{King}} - \vec{v}_{\text{Man}} + \vec{v}_{\text{Woman}} \approx \vec{v}_{\text{Queen}}$

This works because the spatial offset (direction and distance) between "Man" and "Woman" is roughly parallel to the offset between "King" and "Queen", encoding the semantic concept of gender.

6. Visualizing Embedding Spaces

Since word embeddings typically possess 100 to 300 dimensions, they cannot be visualized natively. Dimensionality reduction techniques are required to project these vectors into 2D or 3D spaces.

Principal Component Analysis (PCA)

Mechanism: A linear algorithm that projects data onto a lower-dimensional space while maximizing the variance of the projected data.
Pros: Fast, deterministic, preserves global data structure and large-scale distances.
Cons: Struggles to capture complex, non-linear relationships in the embedding space.

t-Distributed Stochastic Neighbor Embedding (t-SNE)

Mechanism: A non-linear algorithm that calculates probability distributions of neighborhoods in both the high-dimensional and low-dimensional spaces, minimizing the Kullback-Leibler (KL) divergence between them.
Pros: Exceptionally good at preserving local structure (close words remain close), making it the preferred method for revealing clusters in NLP embeddings.
Cons: Computationally expensive, non-deterministic (results vary between runs), and does not reliably preserve global distances.

Code Example: Visualizing with t-SNE

PYTHON

import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Assume 'word_vectors' is a 2D numpy array of shape (num_words, embedding_dim)
# and 'words' is a list of strings corresponding to the vectors.

def visualize_embeddings(word_vectors, words):
    # Initialize t-SNE (reduce to 2 dimensions)
    tsne = TSNE(n_components=2, random_state=42, perplexity=30)
    
    # Fit and transform the high-dimensional vectors to 2D
    vectors_2d = tsne.fit_transform(word_vectors)
    
    # Plotting
    plt.figure(figsize=(12, 8))
    for i, word in enumerate(words):
        x, y = vectors_2d[i, 0], vectors_2d[i, 1]
        plt.scatter(x, y, marker='o', color='blue')
        plt.annotate(word, (x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
        
    plt.title("t-SNE Visualization of Word Embeddings")
    plt.grid(True)
    plt.show()

Unit 1

Unit 3