1

Define Eigen decomposition. Explain the geometric interpretation of eigenvectors and eigenvalues in the context of linear transformations.

2

Discuss the primary limitations of Eigen decomposition when applied to general machine learning problems, particularly concerning matrix properties.

3

Define Singular Value Decomposition (SVD). List and briefly describe the components it decomposes a matrix into.

4

Explain the geometric intuition behind Singular Value Decomposition (SVD). How can it be visualized?

5

Derive the relationship between Singular Value Decomposition (SVD) and Eigen decomposition. Specifically, how can singular values and singular vectors be found through Eigen decomposition?

6

Explain the concept of low-rank approximation using SVD. How is it beneficial for data compression and noise reduction?

7

Describe the objective of Principal Component Analysis (PCA) from a geometric perspective. How does it achieve dimensionality reduction?

8

Derive the first principal component using an optimization approach, specifically by maximizing variance. Assume the data is centered.

9

Explain the relationship between PCA and SVD. How can SVD be used to perform PCA?

PCA and SVD are very closely related, and SVD provides a robust and numerically stable way to compute PCA. In fact, PCA can be seen as a specific application of SVD to the centered data matrix.

Let $X$ be an $N \times D$ data matrix where $N$ is the number of samples and $D$ is the number of features, and assume $X$ is centered (mean of each column is zero). The covariance matrix is $C = \frac{1}{N-1} X^T X$ (or $\frac{1}{N} X^T X$ for population covariance).

From the PCA derivation, the principal components are the eigenvectors of the covariance matrix $C$ .

Now, consider the SVD of the centered data matrix $X$ :
$X = U \Sigma V^T$

From the relationship between SVD and Eigen decomposition, we know:

The columns of $V$ (right singular vectors of $X$ ) are the eigenvectors of $X^T X$ .
The squares of the singular values ( $\sigma_i^2$ ) are proportional to the eigenvalues of $X^T X$ (specifically, $\frac{1}{N-1}\sigma_i^2$ are the eigenvalues of $C$ ).

Therefore:

Principal Components: The principal components (the directions of maximum variance) are precisely the right singular vectors (columns of $V$ ) of the centered data matrix $X$ .
Variances Explained: The variance explained by each principal component is proportional to the square of its corresponding singular value (i.e., $\sigma_i^2 / (N-1)$ ).

How SVD is used for PCA:

To perform PCA using SVD:

Center the Data: Subtract the mean of each feature from its respective column in the data matrix $X$ .
Compute SVD: Compute the SVD of the centered data matrix $X = U \Sigma V^T$ .
Extract Principal Components: The columns of $V$ are the principal components. They are already sorted by the amount of variance they explain (corresponding to the decreasing singular values in $\Sigma$ ).
Project Data: To reduce dimensionality, select the first $k$ columns of $V$ (the top $k$ principal components), denoted as $V_k$ . The projected data in the new lower-dimensional space is $X_{proj} = X V_k$ .

SVD is often preferred over directly computing the Eigen decomposition of the covariance matrix because it is more numerically stable, especially for large matrices, and can handle rectangular data matrices directly without explicitly forming $X^T X$ (which can be numerically problematic for very wide or very tall matrices).

10

Discuss the criteria and methods for choosing the optimal number of principal components ( $k$ ) in PCA for dimensionality reduction.

Choosing the optimal number of principal components ( $k$ ) is crucial in PCA, balancing data compression/noise reduction with information retention. Several methods are commonly used:

Scree Plot (Elbow Method):
- Plot the eigenvalues (or percentage of variance explained) in decreasing order against the number of principal components.
- Look for an "elbow" in the plot, where the rate of decrease of eigenvalues sharply changes. This point suggests that additional components contribute less significant variance and might primarily capture noise.
Cumulative Explained Variance:
- Calculate the cumulative sum of the percentage of variance explained by each principal component.
- Choose $k$ such that a desired percentage of total variance is retained (e.g., 90% or 95%). This is a common and quantitative approach.
Kaiser's Criterion:
- Retain only those principal components whose eigenvalues are greater than 1 (if using the correlation matrix) or greater than the average eigenvalue (if using the covariance matrix with standardized features). The rationale is that if an eigenvalue is less than 1, that component explains less variance than a single original standardized variable.
Cross-Validation:
- Use $k$ as a hyperparameter and evaluate the performance of a downstream machine learning model (e.g., classifier, regressor) using cross-validation with varying $k$ . Select $k$ that yields the best model performance.
Domain Knowledge/Interpretability:
- Sometimes, domain experts might suggest a reasonable number of components based on prior knowledge of the data and what underlying factors are expected. The chosen components should ideally be interpretable if that's a goal.
Computational Constraints:
- In cases of very high-dimensional data, practical computational limits might dictate a smaller $k$ , even if more variance could theoretically be captured.

11

Compare and contrast Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), highlighting their primary objectives and when each is preferred.

PCA and LDA are both dimensionality reduction techniques, but they operate with different objectives and assumptions:

Principal Component Analysis (PCA):

Objective: To find directions (principal components) that maximize the total variance in the data. It seeks to capture the most significant data variability regardless of class labels. It is an unsupervised technique.
Perspective: Looks at the overall structure of the data and aims to project it onto a lower-dimensional subspace where the data points are most spread out.
What it does: Identifies a new set of orthogonal axes (principal components) that explain the maximum amount of variance in the dataset.
Input: Requires only the data matrix $X$ .
Use Cases: Data compression, noise reduction, visualization, feature engineering when class labels are unknown or irrelevant to the dimensionality reduction goal.
Preference: Preferred when the goal is to discover underlying patterns, reduce data complexity, or when class separability is not the primary concern, or when labels are unavailable.

Linear Discriminant Analysis (LDA):

Objective: To find directions (linear discriminants) that maximize the separability between known classes while minimizing the variance within each class. It is a supervised technique.
Perspective: Seeks a projection that maximizes the ratio of between-class variance to within-class variance.
What it does: Identifies a new set of axes that best separate the different classes in the dataset.
Input: Requires both the data matrix $X$ and the class labels $y$ .
Use Cases: Classification preprocessing, feature extraction for classification tasks, face recognition, medical diagnosis.
Preference: Preferred when the goal is to enhance class separability and the dataset has known class labels. It often leads to better classification performance than PCA if the assumption of normal distribution and equal covariance matrices holds.

Key Differences Summarized:

Supervised vs. Unsupervised: LDA is supervised (uses class labels), PCA is unsupervised (does not use labels).
Objective: PCA maximizes total variance; LDA maximizes class separability.
Max Components: PCA can have up to $\min(N-1, D)$ components; LDA can have at most $C-1$ components, where $C$ is the number of classes.
Assumptions: LDA makes assumptions about data distribution (e.g., Gaussian, equal covariances), while PCA is non-parametric.

In essence, PCA is about finding the best data representation in terms of variance, while LDA is about finding the best data representation for classification.

12

Explain the core idea behind Linear Discriminant Analysis (LDA) and describe the concepts of within-class scatter matrix ( $S_W$ ) and between-class scatter matrix ( $S_B$ ).

The core idea behind Linear Discriminant Analysis (LDA) is to find a linear transformation that projects high-dimensional data onto a lower-dimensional space, such that the data points of different classes are well-separated, while data points belonging to the same class are clustered together. Unlike PCA which focuses on maximizing total variance, LDA explicitly considers class labels to find features that are most discriminant.

To achieve this, LDA aims to:

Maximize the distance between the means of different classes in the projected space.
Minimize the variance within each class in the projected space.

These goals are quantified using scatter matrices:

Within-Class Scatter Matrix ( $S_W$ ):
- $S_W$ measures the scatter (variance) of samples within each individual class. It quantifies how spread out the data points are around their respective class means.
- A small $S_W$ indicates that samples within each class are tightly clustered.
- It is calculated by summing the covariance matrices of each class, weighted by the number of samples in that class.
- For $C$ classes, $S_W = \sum_{i=1}^C S_i$ , where $S_i = \sum_{x \in D_i} (x - \mu_i)(x - \mu_i)^T$ , and $\mu_i$ is the mean of class $i$ .
Between-Class Scatter Matrix ( $S_B$ ):
- $S_B$ measures the scatter (variance) of the class means around the overall mean of the entire dataset. It quantifies how well-separated the different class means are from each other.
- A large $S_B$ indicates that the class means are far apart, implying good class separability.
- It is calculated based on the difference between each class mean and the global mean of all data points.
- $S_B = \sum_{i=1}^C N_i (\mu_i - \mu)( \mu_i - \mu)^T$ , where $N_i$ is the number of samples in class $i$ , $\mu_i$ is the mean of class $i$ , and $\mu$ is the global mean of all data.

LDA seeks to find a projection matrix $W$ that maximizes the ratio $\frac{\text{det}(W^T S_B W)}{\text{det}(W^T S_W W)}$ . The columns of $W$ are the linear discriminants, which are the eigenvectors corresponding to the largest eigenvalues of $S_W^{-1} S_B$ .

13

Describe the steps involved in performing Linear Discriminant Analysis (LDA) for a classification task.

14

Discuss a significant limitation of Linear Discriminant Analysis (LDA) and explain why it can be problematic in certain real-world datasets.

A significant limitation of Linear Discriminant Analysis (LDA) is its inherent assumption that the covariance matrices of all classes are equal (or very similar). This is often referred to as homoscedasticity.

Why this is problematic in real-world datasets:

Non-Spherical/Differing Variances: In many real-world datasets, the distribution of features for different classes can vary significantly. Some classes might be tightly clustered (small covariance), while others might be widely spread out (large covariance). Additionally, the shape of the clusters (e.g., spherical vs. elongated) might differ across classes, implying non-equal covariance matrices.
Suboptimal Projections: If the class covariance matrices are substantially different, LDA's assumption of a shared within-class scatter matrix ( $S_W$ ) might not accurately represent the true within-class variability. Consequently, the projection found by LDA (which optimizes for the ratio of between-class to within-class scatter based on this averaged $S_W$ ) might not be optimal for separating classes with distinct covariance structures.
Performance Degradation: When the equal covariance assumption is violated, the decision boundaries learned by LDA (which are linear) may not effectively separate the classes, leading to suboptimal classification performance. More complex models like Quadratic Discriminant Analysis (QDA), which allow each class to have its own covariance matrix, would be more appropriate in such scenarios, but they come with a higher risk of overfitting due to more parameters.
Sensitivity to Outliers: Like many mean-based methods, LDA can be sensitive to outliers, which can distort the class means and covariance estimates, affecting the calculated scatter matrices.

15

Explain how matrix factorization is utilized in collaborative filtering for recommendation systems. Provide a high-level overview of the process.

Matrix factorization is a fundamental technique in collaborative filtering for recommendation systems. The core idea is to decompose the sparse user-item interaction matrix into two lower-rank matrices representing latent features for users and items.

High-level Overview:

User-Item Interaction Matrix (R): A large, sparse matrix $R$ is constructed where rows represent users, columns represent items, and entries $R_{ui}$ represent a user's explicit (e.g., ratings) or implicit (e.g., purchases, views) interaction with an item. Most entries are missing (unknown) because a user interacts with only a small fraction of available items.
Factorization into Latent Features: Matrix factorization aims to approximate this $R$ matrix by multiplying two much smaller, dense matrices:
$R \approx P Q^T$
- $P$ : A user-feature matrix (e.g., $M \times K$ , where $M$ is the number of users, $K$ is the number of latent features). Each row $p_u$ represents user $u$ 's preferences across $K$ latent features.
- $Q^T$ : An item-feature matrix (e.g., $K \times N$ , where $N$ is the number of items). Each column $q_i$ represents item $i$ 's characteristics across $K$ latent features.
  The number of latent features $K$ is chosen to be significantly smaller than $M$ or $N$ .
Latent Features: These $K$ latent features are not predefined but learned from the data. They can represent abstract concepts like "genre preference," "action-packed vs. romantic," "intellectual vs. casual," or other underlying dimensions that explain user preferences and item attributes.
Prediction and Recommendation: Once $P$ and $Q$ are learned, the rating or preference of user $u$ for item $i$ can be predicted by taking the dot product of their respective latent feature vectors:
$\hat{R}_{ui} = p_u \cdot q_i = \sum_{k=1}^K p_{uk} q_{ik}$
For any user $u$ , items they haven't interacted with yet are scored, and the top-scoring items are recommended.
Learning (Optimization): The matrices $P$ and $Q$ are learned by minimizing a loss function (e.g., squared error) between the predicted ratings $\hat{R}_{ui}$ and the actual known ratings $R_{ui}$ , usually with regularization terms to prevent overfitting to the sparse known data:
$\min_{P, Q} \sum_{(u,i) \in \text{known ratings}} (R_{ui} - p_u q_i^T)^2 + \lambda (||P||_F^2 + ||Q||_F^2)$
Optimization algorithms like Stochastic Gradient Descent (SGD) or Alternating Least Squares (ALS) are commonly used to solve this.

This approach effectively discovers the underlying factors that drive user preferences and item similarities, leading to highly personalized recommendations.

16

Describe the advantages of using SVD-based matrix factorization for personalized recommendations, specifically mentioning the role of latent features.

SVD-based matrix factorization (and its variations) offers several significant advantages for personalized recommendations, primarily due to its ability to uncover latent features:

Identification of Latent Features: SVD naturally decomposes the user-item interaction matrix into meaningful, albeit abstract, "latent features" or "factors." These factors represent underlying dimensions that explain why users prefer certain items. For example, in movies, latent features might correspond to genre, actors, director, mood, or target audience. These features are not explicitly defined but are learned from the data.
Handling Sparsity: Real-world recommendation matrices are extremely sparse (most users interact with very few items). SVD-based methods excel at generalizing from the limited observed interactions to fill in the missing entries. By reducing the dimensionality, it effectively finds a lower-rank approximation that captures the main patterns, implicitly inferring unknown preferences.
Personalization: By representing each user and item as a vector in a shared latent feature space, SVD allows for highly personalized recommendations. A user's preference for an item is estimated by the similarity (e.g., dot product) between their respective latent feature vectors. This directly measures how well an item's latent attributes match a user's latent preferences.
Scalability (with modifications): While a naive SVD on a very large, sparse matrix can be computationally expensive, modern matrix factorization techniques inspired by SVD (like Funk SVD, Asymmetric SVD, etc.) use iterative optimization methods (e.g., SGD, ALS) that efficiently learn the user and item latent factor matrices even for massive datasets.
Addressing Cold Start (partially): While not a complete solution, latent features can help with the cold start problem to some extent. For new users, recommendations can be made if they rate a few items, as their latent feature vector can be quickly estimated. For new items, if they have some metadata, they can be placed into the latent feature space, allowing existing users to receive recommendations. If completely new and no information is available, it's still challenging.
Improved Accuracy: By capturing the latent structure in the data, SVD-based models can often provide more accurate and relevant recommendations compared to simpler methods like neighborhood-based collaborative filtering, especially for diverse preferences.

In essence, latent features act as a powerful intermediate representation that effectively models complex user-item relationships, leading to robust and insightful recommendations.

17

Define the concept of an orthogonal matrix. Explain its significance in the context of Eigen decomposition and SVD.

An orthogonal matrix is a square matrix $Q$ whose columns (and rows) are orthonormal vectors. This means:

Each column vector has a Euclidean norm of 1 ( $||q_i|| = 1$ ).
Any two distinct column vectors are orthogonal (their dot product is 0, $q_i \cdot q_j = 0$ for $i \neq j$ ).

Mathematically, an orthogonal matrix $Q$ satisfies the property $Q^T Q = Q Q^T = I$ , where $I$ is the identity matrix. This implies that $Q^{-1} = Q^T$ .

Significance:

Eigen decomposition: For a symmetric matrix $A$ , its Eigen decomposition is $A = P D P^T$ , where $P$ is an orthogonal matrix whose columns are the eigenvectors of $A$ , and $D$ is a diagonal matrix of eigenvalues. The orthogonality of $P$ is crucial because it ensures that the eigenvectors form an orthonormal basis, meaning they are mutually independent and span the entire space without redundancy. This simplifies transformations and ensures numerical stability.
Singular Value Decomposition (SVD): In $A = U \Sigma V^T$ , both $U$ and $V$ are orthogonal matrices. Their orthogonality is fundamental because:
- $U$ contains the left singular vectors, forming an orthonormal basis for the column space of $A$ . This means the vectors in $U$ are perfectly aligned to describe the transformed data's directions.
- $V$ contains the right singular vectors, forming an orthonormal basis for the row space of $A$ . These vectors define the principal directions in the original data space.
- The orthogonality guarantees that rotations and reflections are preserved (no shear or skewing) when applying these transformations, making the decomposition geometrically interpretable as a sequence of rotation, scaling, and another rotation. It also ensures that information is preserved during these transformations and back-transformations, as $U^{-1}=U^T$ and $V^{-1}=V^T$ . This is vital for applications like low-rank approximation, where we want to accurately reconstruct the original data from its reduced representation.

18

Explain why Eigen decomposition is generally not suitable for dimensionality reduction of rectangular data matrices, and how SVD addresses this limitation.

Eigen decomposition is primarily defined for square matrices. Its factorization $A = PDP^{-1}$ requires $A$ to be $n \times n$ . Many real-world datasets in machine learning are represented by rectangular matrices (e.g., $m$ samples and $n$ features, where $m \neq n$ ). If we have a data matrix $X$ of size $m \times n$ , we cannot directly apply Eigen decomposition to $X$ .

Even if we try to work around this by forming the covariance matrix $X^T X$ (which is square, $n \times n$ ) or $X X^T$ (which is square, $m \times m$ ), this has limitations:

Computational Cost: For very large $m$ or $n$ , forming $X^T X$ or $X X^T$ explicitly can be computationally very expensive and memory-intensive, potentially leading to numerical instability due to precision errors.
Loss of Information/Focus: While $X^T X$ provides information about feature variance and covariance (used in PCA), directly working with $X$ itself is often more desirable for general matrix factorization.

How SVD Addresses This Limitation:

Singular Value Decomposition (SVD) is specifically designed to work with any rectangular matrix $A$ (of size $m \times n$ ). It decomposes $A$ into:
$A = U \Sigma V^T$

$U$ is $m \times m$ , $\Sigma$ is $m \times n$ , and $V^T$ is $n \times n$ .
Crucially, SVD does not require the input matrix to be square or symmetric.
It directly provides an orthonormal basis for both the row space (via $V$ ) and the column space (via $U$ ) of the original, possibly rectangular, data matrix.

This general applicability makes SVD immensely powerful for dimensionality reduction of any dataset. For instance, in PCA, instead of computing the covariance matrix and then its Eigen decomposition, one can directly compute the SVD of the centered data matrix. The right singular vectors (columns of $V$ ) provide the principal components, and the singular values determine the variance explained by each component. This approach is numerically more stable and efficient for large, rectangular data.

19

Describe two real-world applications of matrix factorization techniques (beyond recommendation systems) in machine learning or data analysis.

Beyond recommendation systems, matrix factorization techniques, particularly SVD, have wide-ranging applications:

Image Compression and Denoising:
- Application: SVD is highly effective for compressing and denoising images. An image can be represented as a matrix (or multiple matrices for color channels). By performing SVD and using a low-rank approximation, one can reconstruct the image using significantly fewer values.
- How it works: For an image matrix $A$ , we compute its SVD $A = U \Sigma V^T$ . By keeping only the top $k$ singular values (and corresponding $k$ left and right singular vectors), we get an approximated matrix $A_k$ . The largest singular values typically capture the most significant features and structures of the image, while smaller ones often correspond to noise or fine details. Truncating these smaller singular values leads to a compressed (fewer components needed for storage) and denoised (noise-related components are discarded) version of the image.
Topic Modeling (e.g., Latent Semantic Analysis - LSA):
- Application: In Natural Language Processing (NLP), matrix factorization is used to uncover hidden thematic structures (topics) within a collection of documents. This is the core idea behind Latent Semantic Analysis (LSA).
- How it works: A document-term matrix is constructed, where rows represent documents and columns represent terms (words), and entries represent term frequency (e.g., TF-IDF). This matrix is often very high-dimensional and sparse. Applying SVD to this matrix () helps in:
  - Dimensionality Reduction: Reducing the number of dimensions from thousands of terms to a few hundred or less latent "topics" (the $k$ singular values and vectors).
  - Semantic Grouping: The latent features (topics) captured by SVD can group semantically related words and documents, even if they don't explicitly share common words. For instance, documents about "cars" and "automobiles" might be linked by a common latent topic. This allows for better document retrieval, clustering, and summarization by focusing on the underlying meaning rather than just keywords.

20

When might PCA fail or provide suboptimal results, and what are its inherent limitations?

While PCA is a powerful dimensionality reduction technique, it has certain limitations and can provide suboptimal results in specific scenarios:

Linearity Assumption: PCA assumes that the principal components are linear combinations of the original features. If the underlying data structure is inherently non-linear (e.g., data lying on a manifold), PCA may fail to capture the true low-dimensional structure. Kernel PCA is an extension that addresses this.
Variance Maximization Focus: PCA's objective is solely to maximize variance. It does not consider class labels (unsupervised). If the directions of maximum variance do not align with the directions that best separate different classes, PCA might project data into a lower-dimensional space where classes are less separable, leading to poor performance in classification tasks. In such cases, LDA is often preferred.
Sensitivity to Scale: PCA is sensitive to the scaling of the features. Features with larger variances will naturally contribute more to the first principal components. It's often crucial to standardize (scale) features before applying PCA to prevent features with larger numerical ranges from dominating the principal components purely due to their magnitude.
Interpretability Issues: While principal components offer directions of maximum variance, their interpretability can be challenging. Each component is a linear combination of all original features, making it difficult to assign a simple meaning to a specific principal component, especially when many features contribute.
Outlier Sensitivity: PCA is based on covariance calculations, which are sensitive to outliers. Outliers can significantly skew the calculated principal components, leading to an inaccurate representation of the underlying data structure.
Loss of Information (potentially): While PCA aims to minimize information loss in terms of variance, some information might still be lost, especially if the discarded components contain subtle but important signals (e.g., relevant for a specific task but not contributing much to overall variance).

Unit2 - Subjective Questions