Unit 2 - Notes

INT255 13 min read

Unit 2: Matrix Decompositions for Representation Learning

1. Eigen Decomposition

Eigen decomposition is a fundamental concept in linear algebra that breaks down a matrix into its constituent eigenvalues and eigenvectors. It provides a deep insight into the properties of a linear transformation, revealing the directions along which the transformation acts purely as a scaling operation.

1.1. Eigenvalues and Eigenvectors

For a square matrix A ∈ ℝⁿˣⁿ, a non-zero vector v ∈ ℝⁿ is an eigenvector of A if it satisfies the following equation for some scalar λ ∈ ℝ:

TEXT
Av = λv

  • Eigenvector (v): A vector whose direction is unchanged when the linear transformation A is applied to it. The vector is only scaled.
  • Eigenvalue (λ): The scalar factor by which the eigenvector v is stretched or shrunk. A negative eigenvalue indicates a reversal of direction.

The equation can be rewritten as:

TEXT
(A - λI)v = 0

For a non-trivial solution (v ≠ 0) to exist, the matrix (A - λI) must be singular, meaning its determinant must be zero:

TEXT
det(A - λI) = 0

This is called the characteristic equation, and its roots are the eigenvalues λ of matrix A.

1.2. The Decomposition Formula

If an n x n matrix A has n linearly independent eigenvectors, it is called diagonalizable. It can be decomposed into the form:

TEXT
A = PDP⁻¹

  • P: An n x n matrix whose columns are the eigenvectors of A. P = [v₁, v₂, ..., vₙ]
  • D: An n x n diagonal matrix whose diagonal entries are the corresponding eigenvalues of A.
    TEXT
          [λ₁ 0  ... 0 ]
      D = [0  λ₂ ... 0 ]
          [...         ]
          [0  0  ... λₙ]
      
  • P⁻¹: The inverse of matrix P.

Geometric Interpretation: The decomposition A = PDP⁻¹ can be interpreted as a change of basis. Applying the transformation A to a vector x (y = Ax) is equivalent to:

  1. P⁻¹x: Changing the basis of x from the standard basis to the basis of eigenvectors.
  2. DP⁻¹x: Scaling the new coordinates along the eigenvector axes by the corresponding eigenvalues.
  3. PDP⁻¹x: Changing the basis back to the standard basis.

1.3. Special Case: Symmetric Matrices

If A is a symmetric matrix (A = Aᵀ), its eigendecomposition has special properties:

  1. All eigenvalues are real.
  2. Eigenvectors corresponding to distinct eigenvalues are orthogonal.
  3. It is always diagonalizable.

This means the matrix P of eigenvectors can be chosen to be an orthogonal matrix (its columns are orthonormal). For an orthogonal matrix, P⁻¹ = Pᵀ. The decomposition simplifies to:

TEXT
A = PDPᵀ

This is a key result used in Principal Component Analysis.

1.4. Limitations in Machine Learning

While powerful, eigendecomposition has significant limitations that restrict its direct application in many ML scenarios:

  1. Square Matrices Only: Eigendecomposition is only defined for square matrices. Most real-world data matrices (e.g., user-item ratings, image datasets) are rectangular (m ≠ n).
  2. Existence Not Guaranteed: Not all square matrices are diagonalizable. A matrix must have n linearly independent eigenvectors for the decomposition to exist.
  3. Non-Orthogonal Basis: For non-symmetric matrices, the eigenvectors are generally not orthogonal. This makes them a less convenient basis for representing data, as the axes are not independent.

These limitations motivate the need for a more general decomposition that works for any matrix: the Singular Value Decomposition (SVD).


2. Singular Value Decomposition (SVD)

SVD is a generalization of eigendecomposition to any m x n matrix. It is arguably the most important matrix factorization in data science and machine learning, providing the foundation for dimensionality reduction, data compression, and noise reduction.

2.1. The Decomposition Formula

Any real matrix A ∈ ℝᵐˣⁿ can be decomposed into three matrices:

TEXT
A = UΣVᵀ

  • U: An m x m orthogonal matrix. Its columns are the left-singular vectors, which are the orthonormal eigenvectors of the matrix AAᵀ.
  • Σ (Sigma): An m x n rectangular diagonal matrix. Its diagonal entries, σ₁, σ₂, ..., σᵣ, are the singular values of A, where r is the rank of A. They are non-negative and ordered by magnitude: σ₁ ≥ σ₂ ≥ ... ≥ σᵣ ≥ 0. The singular values are the square roots of the non-zero eigenvalues of both AᵀA and AAᵀ.
  • V: An n x n orthogonal matrix. Its columns are the right-singular vectors, which are the orthonormal eigenvectors of the matrix AᵀA. Vᵀ is its transpose.

2.2. Geometric Interpretation

SVD states that any linear transformation can be broken down into three fundamental operations:

  1. A Rotation/Reflection (Vᵀ): The Vᵀ matrix rotates or reflects the input vectors in the n-dimensional domain space without changing their lengths.
  2. A Scaling (Σ): The Σ matrix scales the rotated vectors along the new axes. Some dimensions may be stretched (σᵢ > 1), some shrunk (σᵢ < 1), and some eliminated (σᵢ = 0). It also handles the change in dimensionality from n to m.
  3. Another Rotation/Reflection (U): The U matrix rotates or reflects the scaled vectors into their final positions in the m-dimensional codomain space.

SVD Geometry
AI-generated image — may contain inaccuracies

2.3. Low-Rank Approximation (Eckart-Young-Mirsky Theorem)

The power of SVD lies in its ability to provide the best possible low-rank approximation of a matrix. The singular values in Σ are ordered by importance. The larger the singular value, the more variance/information it captures about the data.

To get the best rank-k approximation of A, we can simply keep the first k singular values and their corresponding singular vectors:

TEXT
Aₖ = UₖΣₖVₖᵀ

  • Uₖ: First k columns of U (m x k matrix).
  • Σₖ: Top-left k x k submatrix of Σ (k x k diagonal matrix).
  • Vₖᵀ: First k rows of Vᵀ (k x n matrix).

The theorem states that Aₖ is the closest rank-k matrix to A in terms of the Frobenius norm:

TEXT
Aₖ = argmin_{rank(B)=k} ||A - B||_F

This property is central to applications like image compression and dimensionality reduction (PCA).


3. Principal Component Analysis (PCA)

PCA is an unsupervised dimensionality reduction technique that transforms data into a new coordinate system. The new axes, called principal components, are ordered such that the first component captures the most variance in the data, the second captures the most remaining variance, and so on.

Let X be an N x d data matrix (N samples, d features), which has been mean-centered (the mean of each column is subtracted from that column).

3.1. Geometric Perspective: Maximizing Variance

PCA seeks to find a set of orthogonal directions (unit vectors wᵢ) that maximize the variance of the projected data.

  • First Principal Component (PC1): The direction w₁ that maximizes the variance of the data projected onto it.

    • The projection of the data matrix X onto w₁ is Xw₁.
    • The variance of the projected data is Var(Xw₁) = (1/N) * ||Xw₁||² = w₁ᵀ(XᵀX/N)w₁.
    • The term (XᵀX/N) is the covariance matrix of the data, C.
    • The optimization problem is: argmax_{||w₁||=1} w₁ᵀCw₁.
    • The solution is that w₁ is the eigenvector of the covariance matrix C corresponding to the largest eigenvalue. This eigenvalue represents the amount of variance captured by PC1.
  • Subsequent Principal Components: The k-th principal component wₖ is the direction, orthogonal to all previous components w₁, ..., w_{k-1}, that captures the maximum remaining variance. This corresponds to the eigenvector of C with the k-th largest eigenvalue.

3.2. Optimization Perspective: Minimizing Reconstruction Error

An equivalent view of PCA is finding a lower-dimensional subspace that minimizes the sum of squared distances from the original data points to their projections on that subspace. This is the reconstruction error.

This perspective connects directly to the SVD's low-rank approximation property. Minimizing reconstruction error is equivalent to finding the best rank-k approximation of the data matrix X. The subspace that achieves this is spanned by the first k principal components.

3.3. PCA Algorithm

  1. Standardize Data: Let the original data be D.
    a. Compute the mean of each feature (column).
    b. Subtract the mean from each column to get the mean-centered matrix X.
    c. (Optional but recommended) Divide each column by its standard deviation for feature scaling.

  2. Choose Method:
    a. Eigendecomposition of the Covariance Matrix:
    i. Calculate the covariance matrix: C = (1/(N-1)) * XᵀX.
    ii. Perform eigendecomposition: C = PDPᵀ.
    iii. The columns of P are the principal components (eigenvectors). The diagonal elements of D are the variances (eigenvalues).
    b. SVD of the Data Matrix (more numerically stable):
    i. Perform SVD on the mean-centered data: X = UΣVᵀ.
    ii. The columns of V are the principal components. These are the same vectors found via eigendecomposition of the covariance matrix.

  3. Select Principal Components:

    • Sort the eigenvalues in descending order and choose the top k eigenvectors (principal components) corresponding to them.
    • A common way to choose k is to select enough components to explain a certain percentage (e.g., 95%) of the total variance. The proportion of variance explained by component i is λᵢ / Σλⱼ.
  4. Project Data:

    • Form a projection matrix W whose columns are the top k principal components.
    • The new, lower-dimensional representation of the data is given by Z = XW. Z is an N x k matrix.

Representation Learning: PCA learns a new basis for the data where the features are linearly uncorrelated and ordered by importance (variance). This new representation often makes subsequent machine learning tasks (like clustering or classification) easier and more efficient.


4. Linear Discriminant Analysis (LDA)

LDA is a supervised dimensionality reduction technique used primarily for classification problems. Unlike PCA, which finds directions of maximum variance, LDA finds directions that maximize the separability between classes.

4.1. Core Idea

The goal is to project the data onto a lower-dimensional space such that:

  • The distance between the means of the projected classes is maximized.
  • The variance (or "scatter") within each projected class is minimized.

This creates a new feature space where classes are as distinct as possible.

4.2. Mathematical Formulation

LDA uses two key scatter matrices:

  1. Within-Class Scatter Matrix (S_W): Measures how spread out the data is within each class. It's the sum of the covariance matrices for each class.

    TEXT
        S_W = Σ_{c=1}^{C} Σ_{xᵢ in Class c} (xᵢ - μ_c)(xᵢ - μ_c)ᵀ
        

    where C is the number of classes and μ_c is the mean vector for class c.

  2. Between-Class Scatter Matrix (S_B): Measures how spread out the class means are from the overall data mean.

    TEXT
        S_B = Σ_{c=1}^{C} N_c (μ_c - μ)(μ_c - μ)ᵀ
        

    where N_c is the number of samples in class c and μ is the overall mean of all data.

Fisher's Criterion: LDA aims to find a projection matrix W that maximizes the ratio of the determinant of the between-class scatter to the determinant of the within-class scatter in the projected space.

TEXT
J(W) = det(WᵀS_BW) / det(WᵀS_WW)

Solution: This is a generalized eigenvalue problem. The columns of the optimal projection matrix W are the eigenvectors of the matrix S_W⁻¹S_B corresponding to the largest eigenvalues.

4.3. LDA Algorithm

  1. Compute the mean vector for each class c (μ_c) and the overall mean vector μ.
  2. Compute the within-class scatter matrix S_W and the between-class scatter matrix S_B.
  3. Solve the generalized eigenvalue problem for the matrix S_W⁻¹S_B to find its eigenvalues and eigenvectors.
  4. Sort the eigenvectors in descending order based on their corresponding eigenvalues.
  5. Select the top k eigenvectors to form the projection matrix W. For a C-class problem, k can be at most C-1.
  6. Transform the data into the new k-dimensional space: Z = XW.

4.4. PCA vs. LDA

Feature Principal Component Analysis (PCA) Linear Discriminant Analysis (LDA)
Supervision Unsupervised Supervised
Goal Maximize variance of the entire dataset. Maximize separability between classes.
Input Data Does not use class labels. Requires class labels.
Projection Finds directions that best represent the overall data structure. Finds directions that best discriminate between classes.
Max Components d (number of original features) C-1 (number of classes minus one)
Use Case General dimensionality reduction, data visualization, compression. Pre-processing for classification tasks.

5. Applications of Matrix Factorization in Recommendation Systems

Matrix factorization is the core technique behind latent factor models, a highly successful approach to building recommendation systems.

5.1. The Problem: The Ratings Matrix

  • We have a set of m users and n items.
  • User interactions (e.g., ratings on a scale of 1-5) are stored in an m x n matrix R.
  • This matrix is typically very sparse, as most users have only rated a small fraction of the available items.
  • Goal: Predict the missing values in R to recommend new items to users.

5.2. Latent Factor Model

The core idea is to assume that both users and items can be characterized by a small number of k latent (hidden) factors.

  • For movies, factors might be "amount of comedy," "level of action," "oriented towards children," etc.
  • For users, factors represent their preference for each of these underlying attributes.

5.3. Matrix Factorization Approach

We aim to find two lower-rank matrices whose product approximates the original ratings matrix R:

TEXT
R ≈ P Qᵀ

  • P: An m x k user-factor matrix. Row p_u is a k-dimensional vector representing user u's preferences in the latent space.
  • Q: An n x k item-factor matrix. Row q_i is a k-dimensional vector representing item i's attributes in the latent space.
  • The predicted rating r̂_ui for user u on item i is simply the dot product of their latent vectors: r̂_ui = p_u ⋅ q_i = p_uᵀq_i.

5.4. Learning the Latent Factors

Since P and Q are unknown, we must learn them from the available ratings. This is framed as an optimization problem.

Objective Function: We want to minimize the error on the ratings we do know. A common choice is the sum of squared errors, with regularization to prevent overfitting.

TEXT
L(P, Q) = Σ_{(u,i) in K} (r_ui - p_uᵀq_i)² + λ(||P||²_F + ||Q||²_F)

  • K: The set of (u, i) pairs for which a rating r_ui is known.
  • λ: A regularization parameter to control the magnitude of the factor vectors, preventing them from becoming too large and overfitting the training data.
  • ||.||²_F: The Frobenius norm.

Optimization Algorithms:

  1. Stochastic Gradient Descent (SGD):

    • Iterate through each known rating r_ui.
    • Calculate the prediction error: e_ui = r_ui - p_uᵀq_i.
    • Update the user and item vectors by moving them slightly in the direction that reduces the error:
      TEXT
            p_u ← p_u + α * (e_ui * q_i - λ * p_u)
            q_i ← q_i + α * (e_ui * p_u - λ * q_i)
            
    • α is the learning rate.
    • This is a simple, fast, and popular method.
  2. Alternating Least Squares (ALS):

    • The objective function is not convex in both P and Q together. However, if we fix one matrix, the problem becomes a standard quadratic (least squares) problem for the other.
    • ALS alternates between these two steps until convergence:
      1. Fix P, solve for Q: Each q_i can be solved for independently using least squares.
      2. Fix Q, solve for P: Each p_u can be solved for independently using least squares.
    • ALS is deterministic and easy to parallelize, making it suitable for large-scale distributed systems.