Unit 4 - Notes
Unit 4: Dimensionality Reduction and Representation Learning
1. Need for Dimensionality Reduction
Dimensionality reduction is the process of transforming data from a high-dimensional space into a lower-dimensional space while retaining as much meaningful property of the original data as possible.
Why do we need it?
- The Curse of Dimensionality: As the number of features (dimensions) increases, the volume of the feature space increases exponentially. This makes data sparse, causing distance metrics (like Euclidean distance) to lose their usefulness, which severely degrades the performance of machine learning algorithms.
- Computational Efficiency: High-dimensional data requires massive computational power and memory for training and inference. Reducing dimensions significantly speeds up algorithms.
- Data Visualization: Humans can only perceive data in 2D or 3D. Dimensionality reduction allows us to plot and visually analyze high-dimensional datasets.
- Noise Reduction: Real-world data often contains irrelevant features and random noise. Dimensionality reduction helps filter out this noise by forcing the model to capture only the most prominent, underlying structures (signals).
- Mitigating Multicollinearity: Highly correlated features can make models unstable. Dimensionality reduction techniques combine redundant features into single underlying representations.
2. Linear Dimensionality Reduction: Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is the most widely used linear dimensionality reduction technique. It transforms the original correlated variables into a new set of uncorrelated variables called Principal Components (PCs).
Geometric Intuition
Imagine a cloud of data points in a 3D space shaped like a flat cigar.
- Maximizing Variance: PCA looks for the direction (axis) along which the data varies the most. This direction becomes the First Principal Component (PC1). It captures the maximum amount of information (variance).
- Orthogonality: Next, PCA looks for a second direction that captures the maximum remaining variance, with the strict constraint that it must be exactly perpendicular (orthogonal) to PC1. This is PC2.
- Projection: To reduce dimensions (e.g., from 3D to 2D), the original data points are projected orthogonally onto the plane defined by PC1 and PC2. The dimension with the lowest variance (the thickness of the cigar) is discarded, resulting in minimal information loss.
Mathematical Outline
- Standardization: Center the data (mean = 0) and scale it (variance = 1).
- Covariance Matrix: Compute the covariance matrix to understand how features vary together.
- Eigendecomposition: Calculate the eigenvalues and eigenvectors of the covariance matrix.
- Eigenvectors determine the directions of the new feature space (the PCs).
- Eigenvalues determine the magnitude of variance explained by each PC.
- Sort and Select: Sort eigenvectors by descending eigenvalues and choose the top components.
Code Example (PCA in Python)
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
# X is our high-dimensional data (e.g., shape = [100, 50])
X = np.random.rand(100, 50)
# 1. Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 2. Apply PCA to reduce to 2 dimensions
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print(f"Original shape: {X_scaled.shape}")
print(f"Reduced shape: {X_pca.shape}")
print(f"Explained Variance Ratio: {pca.explained_variance_ratio_}")
3. Manifold Learning Overview
While linear techniques like PCA are powerful, they fail when data has complex, non-linear relationships.
The Manifold Hypothesis
The Manifold Hypothesis states that high-dimensional data in the real world actually lies on (or near) a lower-dimensional, non-linear manifold embedded within the high-dimensional space.
- Analogy: Imagine a sheet of paper (a 2D manifold). If you crumple it up, it now exists in 3D space. Linear PCA would just draw a line through the crumpled ball. Manifold learning attempts to mathematically "uncrumple" the paper to reveal its true 2D structure.
The Swiss Roll Problem
A classic example is the "Swiss Roll" dataset—a 2D plane rolled up in 3D space. If two points are close to each other in 3D Euclidean distance, they might actually be on different layers of the roll (far apart on the actual 2D surface). Manifold learning algorithms calculate distances along the surface of the manifold (geodesic distance) rather than cutting through the empty space.
4. Non-Linear Dimensionality Reduction Concepts
t-SNE (t-Distributed Stochastic Neighbor Embedding)
t-SNE is a non-linear technique specifically designed for data visualization in 2D or 3D.
- Conceptual Intuition: t-SNE focuses on preserving local structures. It ensures that points that are close to each other in the high-dimensional space remain close in the low-dimensional space.
- How it Works:
- High-Dimensional Space: It computes the probability that two points are neighbors based on a Gaussian distribution. Points close together have a high probability; points far apart have a near-zero probability.
- Low-Dimensional Space: It creates a similar probability distribution for the points in the new low-dimensional space, but uses a Student's t-distribution (which has heavier tails than a Gaussian). The heavy tails prevent the "crowding problem," allowing moderately distant points to be pushed further apart to create clear clusters.
- Optimization: It minimizes the difference between these two probability distributions using a metric called Kullback-Leibler (KL) divergence via gradient descent.
- Limitations: It does not preserve global data structure well (the distance between two distinct clusters in a t-SNE plot is not always meaningful), and it is computationally heavy for very large datasets.
UMAP (Uniform Manifold Approximation and Projection)
UMAP is a modern, mathematically rigorous dimensionality reduction technique based on Riemannian geometry and algebraic topology.
- Conceptual Intuition: Like t-SNE, UMAP is excellent for visualization, but it is designed to preserve both local and global structures.
- How it Works:
- It constructs a high-dimensional graph (a fuzzy simplicial complex) representing the data.
- It optimizes a low-dimensional graph to be as structurally similar to the high-dimensional graph as possible, minimizing a cost function called Cross-Entropy.
- Advantages over t-SNE:
- Speed: UMAP is significantly faster than t-SNE, scaling well to massive datasets.
- Global Structure: The distances between clusters in UMAP are much more meaningful than in t-SNE.
- Generality: UMAP can reduce data to any number of dimensions (e.g., 10D, 50D) for use as a preprocessing step for machine learning models, whereas t-SNE is strictly for 2D/3D visualization.
5. Autoencoder Intuition
Autoencoders are a neural network-based approach to representation learning and non-linear dimensionality reduction.
What is an Autoencoder?
An autoencoder is an unsupervised artificial neural network trained to copy its input to its output. Because it is forced to do this through a restricted "bottleneck," it must learn a compressed, efficient representation of the data.
Architecture
- Encoder: A neural network that compresses the high-dimensional input into a lower-dimensional latent space representation .
- Bottleneck (Latent Space): The middle layer of the network with the smallest number of nodes. This forces the network to discard noise and keep only the most essential features required to reconstruct the data.
- Decoder: A neural network that takes the compressed representation and attempts to reconstruct the original input .
Reconstruction Loss
The network is trained by minimizing the Reconstruction Loss—the difference between the original input and the reconstructed output . Commonly used loss functions include Mean Squared Error (MSE) for continuous data or Binary Cross-Entropy for probabilistic data.
Key Insights
- Connection to PCA: If an autoencoder uses only linear activation functions and a single hidden layer (the bottleneck), the latent space it learns spans the exact same principal subspace as PCA.
- Non-Linearity: By introducing non-linear activation functions (like ReLU, Sigmoid) and multiple deep layers, autoencoders can learn highly complex, non-linear manifolds, making them far more powerful than PCA for complex data like images and audio.