Unit 2 - Notes
Unit 2: Matrix Decompositions for Representation Learning
1. Eigen Decomposition
Eigen decomposition is a fundamental concept in linear algebra that breaks down a matrix into its constituent eigenvalues and eigenvectors. It provides a deep insight into the properties of a linear transformation, revealing the directions along which the transformation acts purely as a scaling operation.
1.1. Eigenvalues and Eigenvectors
For a square matrix A ∈ ℝⁿˣⁿ, a non-zero vector v ∈ ℝⁿ is an eigenvector of A if it satisfies the following equation for some scalar λ ∈ ℝ:
Av = λv
- Eigenvector (v): A vector whose direction is unchanged when the linear transformation A is applied to it. The vector is only scaled.
- Eigenvalue (λ): The scalar factor by which the eigenvector v is stretched or shrunk. A negative eigenvalue indicates a reversal of direction.
The equation can be rewritten as:
(A - λI)v = 0
For a non-trivial solution (v ≠ 0) to exist, the matrix (A - λI) must be singular, meaning its determinant must be zero:
det(A - λI) = 0
This is called the characteristic equation, and its roots are the eigenvalues λ of matrix A.
1.2. The Decomposition Formula
If an n x n matrix A has n linearly independent eigenvectors, it is called diagonalizable. It can be decomposed into the form:
A = PDP⁻¹
- P: An n x n matrix whose columns are the eigenvectors of A.
P = [v₁, v₂, ..., vₙ] - D: An n x n diagonal matrix whose diagonal entries are the corresponding eigenvalues of A.
TEXT[λ₁ 0 ... 0 ] D = [0 λ₂ ... 0 ] [... ] [0 0 ... λₙ] - P⁻¹: The inverse of matrix P.
Geometric Interpretation: The decomposition A = PDP⁻¹ can be interpreted as a change of basis. Applying the transformation A to a vector x (y = Ax) is equivalent to:
- P⁻¹x: Changing the basis of x from the standard basis to the basis of eigenvectors.
- DP⁻¹x: Scaling the new coordinates along the eigenvector axes by the corresponding eigenvalues.
- PDP⁻¹x: Changing the basis back to the standard basis.
1.3. Special Case: Symmetric Matrices
If A is a symmetric matrix (A = Aᵀ), its eigendecomposition has special properties:
- All eigenvalues are real.
- Eigenvectors corresponding to distinct eigenvalues are orthogonal.
- It is always diagonalizable.
This means the matrix P of eigenvectors can be chosen to be an orthogonal matrix (its columns are orthonormal). For an orthogonal matrix, P⁻¹ = Pᵀ. The decomposition simplifies to:
A = PDPᵀ
This is a key result used in Principal Component Analysis.
1.4. Limitations in Machine Learning
While powerful, eigendecomposition has significant limitations that restrict its direct application in many ML scenarios:
- Square Matrices Only: Eigendecomposition is only defined for square matrices. Most real-world data matrices (e.g., user-item ratings, image datasets) are rectangular (m ≠ n).
- Existence Not Guaranteed: Not all square matrices are diagonalizable. A matrix must have n linearly independent eigenvectors for the decomposition to exist.
- Non-Orthogonal Basis: For non-symmetric matrices, the eigenvectors are generally not orthogonal. This makes them a less convenient basis for representing data, as the axes are not independent.
These limitations motivate the need for a more general decomposition that works for any matrix: the Singular Value Decomposition (SVD).
2. Singular Value Decomposition (SVD)
SVD is a generalization of eigendecomposition to any m x n matrix. It is arguably the most important matrix factorization in data science and machine learning, providing the foundation for dimensionality reduction, data compression, and noise reduction.
2.1. The Decomposition Formula
Any real matrix A ∈ ℝᵐˣⁿ can be decomposed into three matrices:
A = UΣVᵀ
- U: An m x m orthogonal matrix. Its columns are the left-singular vectors, which are the orthonormal eigenvectors of the matrix AAᵀ.
- Σ (Sigma): An m x n rectangular diagonal matrix. Its diagonal entries, σ₁, σ₂, ..., σᵣ, are the singular values of A, where r is the rank of A. They are non-negative and ordered by magnitude: σ₁ ≥ σ₂ ≥ ... ≥ σᵣ ≥ 0. The singular values are the square roots of the non-zero eigenvalues of both AᵀA and AAᵀ.
- V: An n x n orthogonal matrix. Its columns are the right-singular vectors, which are the orthonormal eigenvectors of the matrix AᵀA.
Vᵀis its transpose.
2.2. Geometric Interpretation
SVD states that any linear transformation can be broken down into three fundamental operations:
- A Rotation/Reflection (Vᵀ): The
Vᵀmatrix rotates or reflects the input vectors in the n-dimensional domain space without changing their lengths. - A Scaling (Σ): The
Σmatrix scales the rotated vectors along the new axes. Some dimensions may be stretched (σᵢ > 1), some shrunk (σᵢ < 1), and some eliminated (σᵢ = 0). It also handles the change in dimensionality from n to m. - Another Rotation/Reflection (U): The
Umatrix rotates or reflects the scaled vectors into their final positions in the m-dimensional codomain space.

2.3. Low-Rank Approximation (Eckart-Young-Mirsky Theorem)
The power of SVD lies in its ability to provide the best possible low-rank approximation of a matrix. The singular values in Σ are ordered by importance. The larger the singular value, the more variance/information it captures about the data.
To get the best rank-k approximation of A, we can simply keep the first k singular values and their corresponding singular vectors:
Aₖ = UₖΣₖVₖᵀ
- Uₖ: First
kcolumns of U (m x k matrix). - Σₖ: Top-left
k x ksubmatrix of Σ (k x k diagonal matrix). - Vₖᵀ: First
krows of Vᵀ (k x n matrix).
The theorem states that Aₖ is the closest rank-k matrix to A in terms of the Frobenius norm:
Aₖ = argmin_{rank(B)=k} ||A - B||_F
This property is central to applications like image compression and dimensionality reduction (PCA).
3. Principal Component Analysis (PCA)
PCA is an unsupervised dimensionality reduction technique that transforms data into a new coordinate system. The new axes, called principal components, are ordered such that the first component captures the most variance in the data, the second captures the most remaining variance, and so on.
Let X be an N x d data matrix (N samples, d features), which has been mean-centered (the mean of each column is subtracted from that column).
3.1. Geometric Perspective: Maximizing Variance
PCA seeks to find a set of orthogonal directions (unit vectors wᵢ) that maximize the variance of the projected data.
-
First Principal Component (PC1): The direction
w₁that maximizes the variance of the data projected onto it.- The projection of the data matrix X onto
w₁isXw₁. - The variance of the projected data is
Var(Xw₁) = (1/N) * ||Xw₁||² = w₁ᵀ(XᵀX/N)w₁. - The term
(XᵀX/N)is the covariance matrix of the data, C. - The optimization problem is:
argmax_{||w₁||=1} w₁ᵀCw₁. - The solution is that
w₁is the eigenvector of the covariance matrix C corresponding to the largest eigenvalue. This eigenvalue represents the amount of variance captured by PC1.
- The projection of the data matrix X onto
-
Subsequent Principal Components: The
k-th principal componentwₖis the direction, orthogonal to all previous componentsw₁, ..., w_{k-1}, that captures the maximum remaining variance. This corresponds to the eigenvector of C with thek-th largest eigenvalue.
3.2. Optimization Perspective: Minimizing Reconstruction Error
An equivalent view of PCA is finding a lower-dimensional subspace that minimizes the sum of squared distances from the original data points to their projections on that subspace. This is the reconstruction error.
This perspective connects directly to the SVD's low-rank approximation property. Minimizing reconstruction error is equivalent to finding the best rank-k approximation of the data matrix X. The subspace that achieves this is spanned by the first k principal components.
3.3. PCA Algorithm
-
Standardize Data: Let the original data be D.
a. Compute the mean of each feature (column).
b. Subtract the mean from each column to get the mean-centered matrix X.
c. (Optional but recommended) Divide each column by its standard deviation for feature scaling. -
Choose Method:
a. Eigendecomposition of the Covariance Matrix:
i. Calculate the covariance matrix:C = (1/(N-1)) * XᵀX.
ii. Perform eigendecomposition:C = PDPᵀ.
iii. The columns of P are the principal components (eigenvectors). The diagonal elements of D are the variances (eigenvalues).
b. SVD of the Data Matrix (more numerically stable):
i. Perform SVD on the mean-centered data:X = UΣVᵀ.
ii. The columns of V are the principal components. These are the same vectors found via eigendecomposition of the covariance matrix. -
Select Principal Components:
- Sort the eigenvalues in descending order and choose the top
keigenvectors (principal components) corresponding to them. - A common way to choose
kis to select enough components to explain a certain percentage (e.g., 95%) of the total variance. The proportion of variance explained by componentiisλᵢ / Σλⱼ.
- Sort the eigenvalues in descending order and choose the top
-
Project Data:
- Form a projection matrix W whose columns are the top
kprincipal components. - The new, lower-dimensional representation of the data is given by
Z = XW. Z is an N x k matrix.
- Form a projection matrix W whose columns are the top
Representation Learning: PCA learns a new basis for the data where the features are linearly uncorrelated and ordered by importance (variance). This new representation often makes subsequent machine learning tasks (like clustering or classification) easier and more efficient.
4. Linear Discriminant Analysis (LDA)
LDA is a supervised dimensionality reduction technique used primarily for classification problems. Unlike PCA, which finds directions of maximum variance, LDA finds directions that maximize the separability between classes.
4.1. Core Idea
The goal is to project the data onto a lower-dimensional space such that:
- The distance between the means of the projected classes is maximized.
- The variance (or "scatter") within each projected class is minimized.
This creates a new feature space where classes are as distinct as possible.
4.2. Mathematical Formulation
LDA uses two key scatter matrices:
-
Within-Class Scatter Matrix (S_W): Measures how spread out the data is within each class. It's the sum of the covariance matrices for each class.
TEXTS_W = Σ_{c=1}^{C} Σ_{xᵢ in Class c} (xᵢ - μ_c)(xᵢ - μ_c)ᵀ
whereCis the number of classes andμ_cis the mean vector for classc. -
Between-Class Scatter Matrix (S_B): Measures how spread out the class means are from the overall data mean.
TEXTS_B = Σ_{c=1}^{C} N_c (μ_c - μ)(μ_c - μ)ᵀ
whereN_cis the number of samples in classcandμis the overall mean of all data.
Fisher's Criterion: LDA aims to find a projection matrix W that maximizes the ratio of the determinant of the between-class scatter to the determinant of the within-class scatter in the projected space.
J(W) = det(WᵀS_BW) / det(WᵀS_WW)
Solution: This is a generalized eigenvalue problem. The columns of the optimal projection matrix W are the eigenvectors of the matrix S_W⁻¹S_B corresponding to the largest eigenvalues.
4.3. LDA Algorithm
- Compute the mean vector for each class
c(μ_c) and the overall mean vectorμ. - Compute the within-class scatter matrix
S_Wand the between-class scatter matrixS_B. - Solve the generalized eigenvalue problem for the matrix
S_W⁻¹S_Bto find its eigenvalues and eigenvectors. - Sort the eigenvectors in descending order based on their corresponding eigenvalues.
- Select the top
keigenvectors to form the projection matrix W. For aC-class problem,kcan be at mostC-1. - Transform the data into the new
k-dimensional space:Z = XW.
4.4. PCA vs. LDA
| Feature | Principal Component Analysis (PCA) | Linear Discriminant Analysis (LDA) |
|---|---|---|
| Supervision | Unsupervised | Supervised |
| Goal | Maximize variance of the entire dataset. | Maximize separability between classes. |
| Input Data | Does not use class labels. | Requires class labels. |
| Projection | Finds directions that best represent the overall data structure. | Finds directions that best discriminate between classes. |
| Max Components | d (number of original features) |
C-1 (number of classes minus one) |
| Use Case | General dimensionality reduction, data visualization, compression. | Pre-processing for classification tasks. |
5. Applications of Matrix Factorization in Recommendation Systems
Matrix factorization is the core technique behind latent factor models, a highly successful approach to building recommendation systems.
5.1. The Problem: The Ratings Matrix
- We have a set of
musers andnitems. - User interactions (e.g., ratings on a scale of 1-5) are stored in an
m x nmatrix R. - This matrix is typically very sparse, as most users have only rated a small fraction of the available items.
- Goal: Predict the missing values in R to recommend new items to users.
5.2. Latent Factor Model
The core idea is to assume that both users and items can be characterized by a small number of k latent (hidden) factors.
- For movies, factors might be "amount of comedy," "level of action," "oriented towards children," etc.
- For users, factors represent their preference for each of these underlying attributes.
5.3. Matrix Factorization Approach
We aim to find two lower-rank matrices whose product approximates the original ratings matrix R:
R ≈ P Qᵀ
- P: An
m x kuser-factor matrix. Rowp_uis ak-dimensional vector representing useru's preferences in the latent space. - Q: An
n x kitem-factor matrix. Rowq_iis ak-dimensional vector representing itemi's attributes in the latent space. - The predicted rating
r̂_uifor useruon itemiis simply the dot product of their latent vectors:r̂_ui = p_u ⋅ q_i = p_uᵀq_i.
5.4. Learning the Latent Factors
Since P and Q are unknown, we must learn them from the available ratings. This is framed as an optimization problem.
Objective Function: We want to minimize the error on the ratings we do know. A common choice is the sum of squared errors, with regularization to prevent overfitting.
L(P, Q) = Σ_{(u,i) in K} (r_ui - p_uᵀq_i)² + λ(||P||²_F + ||Q||²_F)
K: The set of (u, i) pairs for which a ratingr_uiis known.λ: A regularization parameter to control the magnitude of the factor vectors, preventing them from becoming too large and overfitting the training data.||.||²_F: The Frobenius norm.
Optimization Algorithms:
-
Stochastic Gradient Descent (SGD):
- Iterate through each known rating
r_ui. - Calculate the prediction error:
e_ui = r_ui - p_uᵀq_i. - Update the user and item vectors by moving them slightly in the direction that reduces the error:
TEXTp_u ← p_u + α * (e_ui * q_i - λ * p_u) q_i ← q_i + α * (e_ui * p_u - λ * q_i) αis the learning rate.- This is a simple, fast, and popular method.
- Iterate through each known rating
-
Alternating Least Squares (ALS):
- The objective function is not convex in both P and Q together. However, if we fix one matrix, the problem becomes a standard quadratic (least squares) problem for the other.
- ALS alternates between these two steps until convergence:
- Fix P, solve for Q: Each
q_ican be solved for independently using least squares. - Fix Q, solve for P: Each
p_ucan be solved for independently using least squares.
- Fix P, solve for Q: Each
- ALS is deterministic and easy to parallelize, making it suitable for large-scale distributed systems.