1What is the fundamental equation for an eigenvector and its corresponding eigenvalue for a matrix ?
Eigen decomposition and its limitations in ML
Easy
A.
B.
C.
D.
Correct Answer:
Explanation:
The definition of an eigenvector of a matrix is a non-zero vector that, when multiplied by , results in a scaled version of itself. The scaling factor is the eigenvalue .
Incorrect! Try again.
2A major limitation of eigen decomposition is that it can only be applied to what kind of matrices?
Eigen decomposition and its limitations in ML
Easy
A.Zero matrices
B.Rectangular matrices
C.Square matrices
D.Identity matrices
Correct Answer: Square matrices
Explanation:
Eigen decomposition is defined only for square matrices (), which limits its direct application in many machine learning scenarios where data matrices are often rectangular.
Incorrect! Try again.
3In the eigen decomposition of a matrix as , what does the diagonal matrix contain?
Eigen decomposition and its limitations in ML
Easy
A.The eigenvalues of A
B.The eigenvectors of A
C.The singular values of A
D.The inverse of A
Correct Answer: The eigenvalues of A
Explanation:
The eigen decomposition of a matrix factors it into , where is the matrix of eigenvectors and is a diagonal matrix containing the corresponding eigenvalues.
Incorrect! Try again.
4What do the eigenvectors of a covariance matrix represent in the context of data?
Eigen decomposition and its limitations in ML
Easy
A.The mean of the data
B.The number of data points
C.The directions of maximum variance in the data
D.The median of the data
Correct Answer: The directions of maximum variance in the data
Explanation:
The eigenvectors of a covariance matrix point in the directions of the greatest variance (spread) of the data. The corresponding eigenvalues indicate the magnitude of this variance. This is the core idea behind PCA.
Incorrect! Try again.
5Singular Value Decomposition (SVD) can be applied to which type of matrices?
Singular value decomposition (SVD)
Easy
A.Only diagonal matrices
B.Only square matrices
C.Only symmetric matrices
D.Any rectangular matrix
Correct Answer: Any rectangular matrix
Explanation:
Unlike eigen decomposition, SVD is a general factorization that can be applied to any rectangular matrix, making it more widely applicable in machine learning.
Incorrect! Try again.
6In the SVD of a matrix , what does the matrix contain?
Singular value decomposition (SVD)
Easy
A.Eigenvectors
B.Class labels
C.Singular values
D.Eigenvalues
Correct Answer: Singular values
Explanation:
The SVD of a matrix is given by . The matrix is a diagonal matrix containing the singular values of , which are always non-negative and are ordered from largest to smallest.
Incorrect! Try again.
7What property do the matrices and have in the SVD of a matrix ?
Singular value decomposition (SVD)
Easy
A.They are zero matrices.
B.They are diagonal matrices.
C.They are inverse matrices of each other.
D.They are orthogonal matrices.
Correct Answer: They are orthogonal matrices.
Explanation:
In SVD, both and are orthogonal matrices. This means their columns are orthonormal vectors, and their transpose is equal to their inverse ( and ).
Incorrect! Try again.
8What do larger singular values in SVD generally represent?
Singular value decomposition (SVD)
Easy
A.Less important information in the matrix
B.The dimensions of the matrix
C.More important information or structure in the matrix
D.Noise in the data
Correct Answer: More important information or structure in the matrix
Explanation:
The magnitude of the singular values indicates their importance. Larger singular values correspond to directions that capture more of the variance or "energy" of the data in the matrix.
Incorrect! Try again.
9What is the primary goal of Principal Component Analysis (PCA)?
Principal component analysis (PCA) from a geometric and optimization perspective
Easy
A.To predict a continuous target variable
B.To reduce the dimensionality of the data while preserving the most variance
C.To classify data points into different groups
D.To find the mean of the dataset
Correct Answer: To reduce the dimensionality of the data while preserving the most variance
Explanation:
PCA is a dimensionality reduction technique used to transform a high-dimensional dataset into a lower-dimensional one by finding new uncorrelated variables (principal components) that capture the maximum variance.
Incorrect! Try again.
10Geometrically, what does PCA find?
Principal component analysis (PCA) from a geometric and optimization perspective
Easy
A.The outliers in the dataset
B.The shortest path between data points
C.The clusters present in the data
D.A new coordinate system where axes point in the directions of maximum variance
Correct Answer: A new coordinate system where axes point in the directions of maximum variance
Explanation:
Geometrically, PCA performs a rotation of the original coordinate system to a new one. The axes of this new system, called principal components, are aligned with the directions of maximum variance in the data.
Incorrect! Try again.
11The first principal component (PC1) is the direction that...
Principal component analysis (PCA) from a geometric and optimization perspective
Easy
A.Points towards the origin.
B.Is parallel to one of the original axes.
C.Maximizes the variance of the projected data.
D.Minimizes the variance of the projected data.
Correct Answer: Maximizes the variance of the projected data.
Explanation:
From an optimization perspective, the first principal component is defined as the direction (a linear combination of the original features) along which the projected data has the largest possible variance.
Incorrect! Try again.
12Principal components are calculated as the eigenvectors of which matrix?
Principal component analysis (PCA) from a geometric and optimization perspective
Easy
A.The original data matrix
B.The covariance matrix of the data
C.The inverse of the data matrix
D.The identity matrix
Correct Answer: The covariance matrix of the data
Explanation:
The principal components are the eigenvectors of the data's covariance matrix (or the correlation matrix, if the data is standardized). The eigenvalues indicate the amount of variance captured by each component.
Incorrect! Try again.
13Are the principal components found by PCA correlated with each other?
Principal component analysis (PCA) from a geometric and optimization perspective
Easy
A.Only the first two components are correlated.
B.It depends on the dataset.
C.No, they are uncorrelated (orthogonal).
D.Yes, they are highly correlated.
Correct Answer: No, they are uncorrelated (orthogonal).
Explanation:
By construction, the principal components are orthogonal to each other. This means that in the new feature space, the resulting variables are linearly uncorrelated, which is a desirable property.
Incorrect! Try again.
14What is the primary goal of Linear Discriminant Analysis (LDA)?
Linear discriminant analysis (LDA)
Easy
A.To reduce dimensions by ignoring class labels
B.To cluster unlabeled data
C.To find a lower-dimensional space that maximizes the separability between classes
D.To maximize the variance within each class
Correct Answer: To find a lower-dimensional space that maximizes the separability between classes
Explanation:
LDA is a supervised dimensionality reduction technique. Its main objective is to project data onto a lower-dimensional space in a way that maximizes the distance between the means of different classes while minimizing the variance within each class.
Incorrect! Try again.
15How does LDA differ fundamentally from PCA?
Linear discriminant analysis (LDA)
Easy
A.There is no fundamental difference.
B.LDA is supervised (uses class labels), while PCA is unsupervised.
C.LDA is unsupervised, while PCA is supervised.
D.LDA always finds more dimensions than PCA.
Correct Answer: LDA is supervised (uses class labels), while PCA is unsupervised.
Explanation:
The key difference is that LDA is a supervised algorithm because it uses the class labels of the data to find the best projection for classification. PCA is unsupervised and only considers the variance of the data, ignoring any class labels.
Incorrect! Try again.
16To achieve good class separation, LDA aims to maximize the ratio of...
Linear discriminant analysis (LDA)
Easy
A.between-class variance to total variance.
B.within-class variance to between-class variance.
C.total variance to within-class variance.
D.between-class variance to within-class variance.
Correct Answer: between-class variance to within-class variance.
Explanation:
The optimization objective of LDA is to find a projection that maximizes the distance between the centers of the different classes (between-class variance) and simultaneously minimizes the spread of data within each class (within-class variance).
Incorrect! Try again.
17What is a common application of LDA?
Linear discriminant analysis (LDA)
Easy
A.Pre-processing for classification tasks
B.Recommending products to users
C.Anomaly detection
D.Data compression for storage
Correct Answer: Pre-processing for classification tasks
Explanation:
Because LDA explicitly tries to model the difference between classes, it is often used as a dimensionality reduction step before applying a classification model, such as in face recognition or text classification.
Incorrect! Try again.
18In the context of a recommendation system, what does the user-item interaction matrix typically contain?
Applications of matrix factorization in recommendation systems
Easy
A.Ratings that users have given to items
B.The number of items in stock
C.User demographics
D.Item prices
Correct Answer: Ratings that users have given to items
Explanation:
A user-item interaction matrix is a common way to represent user preferences. The rows represent users, the columns represent items, and the cells contain the ratings users have given to items (or a 1 if they've interacted with it, 0 otherwise).
Incorrect! Try again.
19When we factorize a user-item matrix into two smaller matrices (a user-feature matrix and an item-feature matrix), what do the "features" represent?
Applications of matrix factorization in recommendation systems
Easy
A.The number of users and items
B.Latent (hidden) features that describe users and items
C.Explicit features like genre or price
D.The original ratings
Correct Answer: Latent (hidden) features that describe users and items
Explanation:
Matrix factorization for recommendation systems learns latent features. These are not explicitly defined but are abstract characteristics that the model discovers to explain the observed ratings. For example, a latent feature for movies might correspond to the "amount of action" or "suitability for children".
Incorrect! Try again.
20How can matrix factorization be used to predict a rating for an item a user has not yet seen?
Applications of matrix factorization in recommendation systems
Easy
A.By taking the dot product of the user's latent feature vector and the item's latent feature vector
B.By finding the average rating of that item
C.By copying the rating from the most similar user
D.It cannot be used for prediction, only for data compression.
Correct Answer: By taking the dot product of the user's latent feature vector and the item's latent feature vector
Explanation:
After decomposing the user-item matrix into user-feature () and item-feature () matrices, the predicted rating for user and item is calculated by taking the dot product of the user's latent vector (row in ) and the item's latent vector (column in ). This reconstructs the original matrix, filling in the missing values.
Incorrect! Try again.
21A data matrix is tall and thin ( with ). Why can't we directly compute the eigen decomposition of ?
Eigen decomposition and its limitations in ML
Medium
A.The matrix does not have a full set of linearly independent columns.
B.Eigen decomposition is computationally too expensive for tall matrices.
C.The matrix must be symmetric for eigen decomposition.
D.Eigen decomposition is only defined for square matrices.
Correct Answer: Eigen decomposition is only defined for square matrices.
Explanation:
Eigen decomposition is a factorization of a matrix into its eigenvectors and eigenvalues, defined as . This definition is strictly for square matrices (), as non-square matrices do not have eigenvalues in the same sense. To analyze a non-square matrix like , we often use SVD or compute the eigen decomposition of the square covariance matrix .
Incorrect! Try again.
22If is an eigenvector of a matrix with eigenvalue , what is the corresponding eigenvalue for the matrix ?
Eigen decomposition and its limitations in ML
Medium
A.It cannot be determined without knowing .
B.
C.
D.
Correct Answer:
Explanation:
By definition, . If we apply again, we get . Applying it a third time gives . Thus, is an eigenvector of with eigenvalue .
Incorrect! Try again.
23For a real symmetric matrix, what is the geometric relationship between eigenvectors corresponding to distinct (different) eigenvalues?
Eigen decomposition and its limitations in ML
Medium
A.They form an acute angle.
B.They are orthogonal.
C.They are parallel.
D.There is no guaranteed relationship.
Correct Answer: They are orthogonal.
Explanation:
A key property of real symmetric matrices is that their eigenvectors corresponding to distinct eigenvalues are mutually orthogonal. This property is fundamental to Principal Component Analysis (PCA), where the eigenvectors of the symmetric covariance matrix form an orthogonal basis representing the principal components.
Incorrect! Try again.
24Eigen decomposition is often applied to a covariance matrix in machine learning. What is a significant limitation of this approach if the features have vastly different scales (e.g., one feature in meters and another in kilometers)?
Eigen decomposition and its limitations in ML
Medium
A.The eigenvalues become negative, which is not interpretable.
B.The eigenvector corresponding to the feature with the largest scale will dominate the analysis.
C.The computation of eigenvectors becomes numerically unstable.
D.The covariance matrix becomes non-symmetric, making decomposition impossible.
Correct Answer: The eigenvector corresponding to the feature with the largest scale will dominate the analysis.
Explanation:
Eigen decomposition of a covariance matrix is sensitive to the scale of the features. A feature with a much larger variance (due to scale) will dominate the first principal component, as PCA (which uses this decomposition) seeks to maximize variance. This can mask the contribution of other important but smaller-scale features. Standardization is typically required to mitigate this.
Incorrect! Try again.
25Given the SVD of a matrix as , where is an matrix. The singular values in are the square roots of the non-zero eigenvalues of which matrix?
Singular value decomposition (SVD)
Medium
A.
B.
C. itself
D.
Correct Answer:
Explanation:
The SVD is closely related to the eigen decomposition of and . Specifically, . This is the eigen decomposition of . The diagonal entries of are the squared singular values (), which are the eigenvalues of . Therefore, the singular values are the square roots of the eigenvalues of .
Incorrect! Try again.
26You perform SVD on a matrix . What is the maximum possible number of non-zero singular values?
Singular value decomposition (SVD)
Medium
A.500
B.1000
C.250
D.1500
Correct Answer: 500
Explanation:
The number of non-zero singular values of a matrix is equal to its rank. For an matrix, the rank is at most . In this case, and , so the maximum possible rank (and thus the maximum number of non-zero singular values) is .
Incorrect! Try again.
27In the context of low-rank approximation, truncating the SVD of a matrix to keep the top singular values gives a matrix . What optimization problem does solve?
Singular value decomposition (SVD)
Medium
A.It maximizes the determinant of .
B.It minimizes the Frobenius norm among all rank- matrices.
C.It minimizes the sum of the singular values of .
D.It ensures that is an orthogonal matrix.
Correct Answer: It minimizes the Frobenius norm among all rank- matrices.
Explanation:
The Eckart-Young-Mirsky theorem states that the best rank- approximation of a matrix in the sense of minimizing the Frobenius norm (or the spectral norm) is obtained by truncating the SVD. That is, is the closest rank- matrix to .
Incorrect! Try again.
28If a square matrix is invertible, what is the relationship between the singular values of and its inverse ?
Singular value decomposition (SVD)
Medium
A.The singular values of are the negatives of the singular values of .
B.The singular values of are the same as the singular values of .
C.The singular values of are the reciprocals of the singular values of .
D.There is no direct relationship between them.
Correct Answer: The singular values of are the reciprocals of the singular values of .
Explanation:
If , then its inverse is . The singular values of are the diagonal entries of , which are , where are the singular values of . This holds because an invertible matrix must have all non-zero singular values.
Incorrect! Try again.
29From a geometric perspective, what do the principal components of a dataset represent?
Principal component analysis (PCA) from a geometric and optimization perspective
Medium
A.A sequence of orthogonal directions that capture the maximum variance in the data.
B.The vectors pointing from the origin to the densest regions of the data.
C.The axes of the original feature space.
D.Directions that best separate the different classes in the data.
Correct Answer: A sequence of orthogonal directions that capture the maximum variance in the data.
Explanation:
Geometrically, PCA performs a rotation of the coordinate system. The first principal component is the direction in which the data varies the most. The second principal component is the direction orthogonal to the first that captures the most remaining variance, and so on. These components form a new orthogonal basis for the data.
Incorrect! Try again.
30PCA can be viewed as an optimization problem where we seek to minimize the reconstruction error. What does this reconstruction error physically represent?
Principal component analysis (PCA) from a geometric and optimization perspective
Medium
A.The number of data points misclassified by the projection.
B.The variance of the data projected onto the last principal component.
C.The total variance of the original dataset.
D.The sum of squared distances from each data point to its projection onto the principal component subspace.
Correct Answer: The sum of squared distances from each data point to its projection onto the principal component subspace.
Explanation:
Minimizing the reconstruction error (the error between the original data points and their compressed-and-decompressed versions) is equivalent to maximizing the variance of the projected data. This error is precisely the sum of the squared Euclidean distances from the original points to their lower-dimensional projections.
Incorrect! Try again.
31You apply PCA to a dataset and find the eigenvalues of the covariance matrix are [10, 8, 0.1, 0.05]. What does this suggest about the dimensionality of your data?
Principal component analysis (PCA) from a geometric and optimization perspective
Medium
A.The features are completely uncorrelated.
B.The data is uniformly distributed in a 4-dimensional space.
C.The data can be effectively represented in 2 dimensions with minimal information loss.
D.The data requires all 4 dimensions for an accurate representation.
Correct Answer: The data can be effectively represented in 2 dimensions with minimal information loss.
Explanation:
The eigenvalues of the covariance matrix represent the variance captured by each corresponding principal component. The large drop-off after the second eigenvalue (from 8 to 0.1) indicates that the first two components capture the vast majority of the variance in the data (). The remaining two components contribute very little (total of $0.15$). This implies the data is intrinsically low-dimensional and can be reduced to 2D.
Incorrect! Try again.
32What is the primary reason for mean-centering the data (subtracting the mean of each feature) before performing PCA?
Principal component analysis (PCA) from a geometric and optimization perspective
Medium
A.To reduce the number of principal components needed.
B.To make the data matrix invertible.
C.To ensure all eigenvalues are positive.
D.To ensure the first principal component describes the direction of maximum variance, not the mean of the data.
Correct Answer: To ensure the first principal component describes the direction of maximum variance, not the mean of the data.
Explanation:
PCA calculates the eigenvectors of the covariance matrix. The formula for covariance involves the mean of the data. If the data is not centered at the origin, the first principal component might simply point from the origin to the center of the data cloud, rather than capturing the direction of maximum variance within the data cloud. Mean-centering ensures that the analysis focuses solely on the variance.
Incorrect! Try again.
33What is the primary objective of Linear Discriminant Analysis (LDA) in the context of dimensionality reduction?
Linear discriminant analysis (LDA)
Medium
A.To find a projection that maximizes the separation between classes.
B.To find a projection that minimizes the within-class variance, regardless of class separation.
C.To find a projection that maximizes the variance of the entire dataset.
D.To find a projection that makes the features uncorrelated.
Correct Answer: To find a projection that maximizes the separation between classes.
Explanation:
Unlike PCA, which is an unsupervised method that maximizes total variance, LDA is a supervised method that explicitly uses class labels. Its goal is to find a lower-dimensional space where the classes are as well-separated as possible. It achieves this by maximizing the ratio of between-class scatter (variance between class means) to within-class scatter (variance within each class).
Incorrect! Try again.
34You are working on a classification problem with 4 distinct classes. What is the maximum number of dimensions you can reduce your data to using LDA?
Linear discriminant analysis (LDA)
Medium
A.3
B.Dependent on the number of features.
C.4
D.2
Correct Answer: 3
Explanation:
The number of linear discriminants (the dimensions of the output space) in LDA is at most , where is the number of classes. This is because with classes, there are at most degrees of freedom for the positions of the class means relative to each other. For 4 classes, the maximum number of dimensions is .
Incorrect! Try again.
35LDA finds its projection vectors by solving a generalized eigenvalue problem of the form . What do and represent?
Linear discriminant analysis (LDA)
Medium
A. is the sample covariance matrix and is the identity matrix.
B. is the data matrix and is its transpose .
C. is the between-class scatter matrix and is the within-class scatter matrix.
D. is the between-class scatter matrix and is the total scatter matrix.
Correct Answer: is the between-class scatter matrix and is the within-class scatter matrix.
Explanation:
The core of LDA is to find a projection that maximizes the ratio of between-class scatter to within-class scatter. This optimization problem is mathematically equivalent to solving the generalized eigenvalue problem , or . The eigenvectors are the linear discriminants.
Incorrect! Try again.
36Under what condition would PCA and LDA produce very similar results for dimensionality reduction in a classification task?
Linear discriminant analysis (LDA)
Medium
A.When the number of features is much larger than the number of samples.
B.PCA and LDA can never produce similar results because their objectives are fundamentally different.
C.When the direction of maximum variance in the data also happens to be the direction that best separates the classes.
D.When the data is perfectly balanced across all classes.
Correct Answer: When the direction of maximum variance in the data also happens to be the direction that best separates the classes.
Explanation:
PCA seeks the direction of maximum variance, while LDA seeks the direction of maximum class separability. If the primary source of variance in the dataset is the difference between the classes themselves, then the direction that maximizes variance (PC1) will likely be very similar to the direction that maximizes class separation (LD1).
Incorrect! Try again.
37In a recommendation system based on matrix factorization, we decompose a user-item rating matrix into two lower-rank matrices (users) and (items). What is the main purpose of this decomposition?
Applications of matrix factorization in recommendation systems
Medium
A.To find the exact, original ratings for every user-item pair.
B.To learn latent features for users and items that can be used to predict missing ratings.
C.To reduce the storage space of the rating matrix by a guaranteed factor.
D.To identify the most popular items across all users.
Correct Answer: To learn latent features for users and items that can be used to predict missing ratings.
Explanation:
The factorization learns a low-dimensional representation (latent features) for each user (rows of ) and each item (rows of ). The core idea is that user preferences and item attributes can be described by these latent factors. The dot product of a user's latent vector and an item's latent vector then gives a prediction for a missing rating.
Incorrect! Try again.
38A key challenge in matrix factorization for recommendation systems is the sparsity of the user-item matrix. How do algorithms typically handle the many missing entries during the training process?
Applications of matrix factorization in recommendation systems
Medium
A.They treat all missing entries as a rating of zero.
B.They remove all users and items with too many missing ratings.
C.They calculate the prediction error and update the model parameters only on the observed ratings.
D.They fill the missing entries with the global average rating before factorization.
Correct Answer: They calculate the prediction error and update the model parameters only on the observed ratings.
Explanation:
A crucial aspect of matrix factorization algorithms like Alternating Least Squares (ALS) or Stochastic Gradient Descent (SGD) is that they only evaluate the loss function over the known ratings. The objective is to find latent factors that best reconstruct the observed part of the matrix, and then use these factors to generalize and predict the unobserved ratings.
Incorrect! Try again.
39In a matrix factorization model, we have a user latent vector and two item latent vectors, and . Assuming ratings are predicted by the dot product, which item would be recommended to the user and why?
Applications of matrix factorization in recommendation systems
Medium
A.Both items, because they have positive values in their latent vectors.
B.Item , because its dot product with the user vector is higher.
C.Item , because its dot product with the user vector is higher.
D.Neither item, because the user vector is not a unit vector.
Correct Answer: Item , because its dot product with the user vector is higher.
Explanation:
The predicted rating is calculated by the dot product. For user and item , the predicted rating is . For user and item , the prediction is . Since , the model predicts a higher preference for item , so it would be recommended.
Incorrect! Try again.
40What is a common method to prevent overfitting in matrix factorization models for recommendation systems?
Applications of matrix factorization in recommendation systems
Medium
A.Initializing the user and item matrices with random noise from a uniform distribution.
B.Using only users who have rated a large number of items.
C.Increasing the number of latent factors () until the training error is zero.
D.Adding a regularization term (e.g., L2 regularization) to the loss function.
Correct Answer: Adding a regularization term (e.g., L2 regularization) to the loss function.
Explanation:
Overfitting occurs when the model learns the training data too well, including its noise, and fails to generalize to unseen data. Regularization is a standard technique to combat this. By adding a penalty term to the loss function (e.g., ), we discourage the model from learning excessively large values in the latent factor matrices, promoting a simpler and more generalizable model.
Incorrect! Try again.
41A non-symmetric matrix is not guaranteed to be diagonalizable. In the context of a machine learning model where represents state transitions in a discrete-time dynamical system, what is the primary implication if its eigen decomposition does not exist?
Eigen decomposition and its limitations in ML
Hard
A.The system is inherently unstable and the state will always diverge to infinity.
B.The Jordan Normal Form must be used to understand the system's dynamics, which may involve transformations beyond simple scaling along eigenvector directions (e.g., shear transformations).
C.The model's long-term behavior cannot be analyzed using powers of .
D.The matrix must be singular, meaning its determinant is zero.
Correct Answer: The Jordan Normal Form must be used to understand the system's dynamics, which may involve transformations beyond simple scaling along eigenvector directions (e.g., shear transformations).
Explanation:
If a matrix is not diagonalizable, it means it lacks a full set of linearly independent eigenvectors. However, its behavior can still be fully characterized by the Jordan Normal Form. This decomposition reveals that the system's evolution may not just be simple scaling along axes, but can also include shearing effects represented by Jordan blocks, which are crucial for understanding the system's true dynamics.
Incorrect! Try again.
42Given the SVD of a matrix , where and rank() = . The pseudoinverse is . What is the precise geometric interpretation of the operator represented by the projection matrix ?
Singular value decomposition (SVD)
Hard
A.It is an orthogonal projection from onto the column space of (the range of ).
B.It is the identity operator on the column space of .
C.It is an orthogonal projection from onto the row space of (the range of ).
D.It is an orthogonal projection from onto the null space of (the left null space of ).
Correct Answer: It is an orthogonal projection from onto the column space of (the range of ).
Explanation:
The matrix simplifies to . The matrix is a diagonal matrix of size with ones in the first diagonal positions and zeros elsewhere. The matrix is the standard form for an orthogonal projection matrix onto the subspace spanned by the first columns of . This subspace is, by definition, the column space of .
Incorrect! Try again.
43Consider a dataset where the covariance matrix has eigenvalues . You apply a linear transformation to the data , where is an orthogonal matrix (). What are the principal components and their corresponding variances (eigenvalues) for the transformed data ?
Principal component analysis (PCA) from a geometric and optimization perspective
Hard
A.The principal components are the eigenvectors of , and the variances are unchanged.
B.The principal components are the columns of , and the variances are the diagonal elements of .
C.The principal components are rotated, but the variances (eigenvalues) remain unchanged.
D.Both the principal components and their variances change unpredictably.
Correct Answer: The principal components are rotated, but the variances (eigenvalues) remain unchanged.
Explanation:
The covariance matrix of the new data is . Since is orthogonal, is similar to . Similar matrices have the same eigenvalues. Therefore, the variances of the principal components (the eigenvalues) remain exactly the same. The new principal components themselves (the eigenvectors of ) are rotated versions of the original ones.
Incorrect! Try again.
44In Fisher's LDA, we maximize , where is the between-class scatter and is the within-class scatter. If is singular, which can happen in high-dimensional settings (p > N), what is the most robust and standard procedure to make LDA applicable?
Linear discriminant analysis (LDA)
Hard
A.Add a small identity matrix to (i.e., ) to make it invertible, a form of regularization.
B.Use the Moore-Penrose pseudoinverse of to solve the generalized eigenvalue problem.
C.The problem is unsolvable as the Fisher criterion is undefined.
D.First apply PCA to reduce dimensionality to a subspace where becomes non-singular, then apply LDA in that subspace.
Correct Answer: First apply PCA to reduce dimensionality to a subspace where becomes non-singular, then apply LDA in that subspace.
Explanation:
When the number of features is greater than the number of samples , the data lies in a subspace of dimension at most , causing to be singular. The standard and most principled approach is to first use PCA to project the data into the -dimensional subspace where class structure is preserved and is guaranteed to be non-singular (C is number of classes). After this pre-processing step, LDA can be successfully applied. This two-stage process is often called PCA+LDA or Regularized LDA.
Incorrect! Try again.
45In a recommendation system using matrix factorization, the objective function is . If a user has rated only one item with a rating of 5, and the regularization parameter is very large (approaching infinity), what will the learned latent vector converge to?
Applications of matrix factorization in recommendation systems
Hard
A.A vector with a very large norm, pointing in the same direction as .
B.A near-zero vector.
C.A vector with a very large norm, orthogonal to .
D.The exact zero vector.
Correct Answer: The exact zero vector.
Explanation:
The objective function balances reconstruction error with a regularization penalty on the magnitudes of the latent vectors. As , the penalty term dominates the objective function. To minimize the total loss, the model must make this penalty term as small as possible. The only way to do this is to force to zero, which means must converge to the zero vector, regardless of the reconstruction error for the single rating.
Incorrect! Try again.
46A real symmetric matrix is always diagonalizable by an orthogonal matrix. If a real matrix is known to be diagonalizable, but is not symmetric, what can we definitively conclude about its eigenvectors?
Eigen decomposition and its limitations in ML
Hard
A.The matrix of eigenvectors can always be chosen to be an orthogonal matrix.
B.The eigenvectors corresponding to distinct real eigenvalues are orthogonal.
C.The eigenvectors are linearly independent but may not be orthogonal.
D.The eigenvectors are guaranteed to be orthogonal.
Correct Answer: The eigenvectors are linearly independent but may not be orthogonal.
Explanation:
The condition for a matrix to be diagonalizable is that it must possess a full set of linearly independent eigenvectors that can form a basis for the vector space. Orthogonality of eigenvectors is a stronger condition guaranteed only for normal matrices (), a class which includes symmetric matrices but not all diagonalizable matrices. Therefore, for a general non-symmetric diagonalizable matrix, we can only be sure of linear independence.
Incorrect! Try again.
47According to the Eckart-Young-Mirsky theorem, the best rank-k approximation of a matrix in the Frobenius norm is . If is a square, invertible matrix of size with singular values , what is the exact Frobenius norm of the error of the best rank- approximation, ?
Singular value decomposition (SVD)
Hard
A.
B.
C.
D.0
Correct Answer:
Explanation:
The error of the best rank-k approximation is given by the Frobenius norm of the matrix formed by the discarded singular components. In this case, . The Frobenius norm of a rank-1 matrix is . Since and are unit vectors, the norm is simply the remaining singular value. More formally, . For , this sum only has one term, .
Incorrect! Try again.
48You perform PCA on a dataset with features and samples (). You find that the last eigenvalues of the data's covariance matrix are exactly zero (). What is the most precise geometric interpretation of this result?
Principal component analysis (PCA) from a geometric and optimization perspective
Hard
A.The data points lie perfectly within a -dimensional affine subspace of the original -dimensional space.
B.The first components capture all the information, and the last can be discarded with no information loss.
C.The dataset contains categorical features that were improperly encoded.
D.The dataset has features that are pure noise.
Correct Answer: The data points lie perfectly within a -dimensional affine subspace of the original -dimensional space.
Explanation:
An eigenvalue of the covariance matrix represents the variance of the data along the corresponding eigenvector's direction. A zero eigenvalue means there is zero variance in that direction. This is only possible if all data points are confined to a hyperplane (an affine subspace) that is orthogonal to that eigenvector. If there are zero eigenvalues, it implies the data lies in an intersection of such hyperplanes, which defines an affine subspace of dimension . This indicates perfect multicollinearity among the features.
Incorrect! Try again.
49Consider a binary classification problem where the two classes have identical, spherical covariance matrices (i.e., ) and their means are and . In this specific scenario, the optimal projection vector found by LDA is parallel to which vector?
Linear discriminant analysis (LDA)
Hard
A.A vector orthogonal to the vector connecting the class means.
B.The vector connecting the class means, .
C.The direction is undefined because the within-class scatter matrix is proportional to the identity matrix.
D.The first principal component of the combined dataset.
Correct Answer: The vector connecting the class means, .
Explanation:
The LDA projection vector is the leading eigenvector of . For two classes, . The within-class scatter is . Thus, . This is a rank-1 matrix, and its only non-zero eigenvector must be in the direction of the vector that spans its column space, which is .
Incorrect! Try again.
50When using Alternating Least Squares (ALS) for matrix factorization, the algorithm alternates between solving for user factors (given item factors ) and item factors (given ). Why is this approach particularly well-suited for large-scale distributed computation compared to simultaneous Stochastic Gradient Descent (SGD)?
Applications of matrix factorization in recommendation systems
Hard
A.ALS requires significantly fewer iterations to converge than SGD.
B.Each step of ALS involves solving for user (or item) factors independently, which areembarrassingly parallel subproblems.
C.SGD cannot handle the sparse rating matrix, while ALS is specifically designed for it.
D.ALS is guaranteed to find the global minimum of the non-convex problem, whereas SGD is not.
Correct Answer: Each step of ALS involves solving for user (or item) factors independently, which areembarrassingly parallel subproblems.
Explanation:
The key advantage of ALS is its parallel structure. When item factors are fixed, the objective function decouples into independent quadratic minimization problems for each user factor . This means all user factors can be computed in parallel. The same holds true when solving for item factors. This property allows ALS to scale effectively on distributed computing platforms like Spark, where the independent computations can be spread across many machines, making it more efficient than the inherently sequential updates of SGD in such environments.
Incorrect! Try again.
51A Markov chain is described by a row-stochastic transition matrix . The Perron-Frobenius theorem guarantees that its largest eigenvalue is 1. What is the significance of the corresponding left eigenvector, , which satisfies ?
Eigen decomposition and its limitations in ML
Hard
A.It is the stationary distribution of the Markov chain, describing the long-term probability of being in each state.
B.It represents the initial distribution of the states.
C.It is always a uniform distribution, indicating all states are equally likely in the long run.
D.It is a vector of all ones, which is the right eigenvector for the eigenvalue 1.
Correct Answer: It is the stationary distribution of the Markov chain, describing the long-term probability of being in each state.
Explanation:
The equation defines the stationary distribution of a Markov chain. It means that if the probability distribution of states is given by the vector , then after one transition step, the new distribution will still be . This vector represents the equilibrium state of the system, where the probability of being in any given state becomes constant over time.
Incorrect! Try again.
52A very large, dense matrix has a singular value spectrum that decays exponentially fast (e.g., ). What is the most important practical implication of this property?
Singular value decomposition (SVD)
Hard
A.The matrix columns are nearly orthogonal, making it well-conditioned.
B.The matrix represents a chaotic system with high intrinsic dimensionality.
C.The matrix can be accurately approximated by a matrix of very low rank, enabling significant data compression and faster computations.
D.The matrix is nearly singular and numerically difficult to invert.
Correct Answer: The matrix can be accurately approximated by a matrix of very low rank, enabling significant data compression and faster computations.
Explanation:
A rapid decay in singular values signifies that most of the matrix's 'energy' or information is captured by the first few singular values and vectors. This means that a truncated SVD, for a small , will be an excellent approximation of the full matrix . This is the basis for techniques like PCA and data compression, as the high-dimensional matrix can be effectively represented by much smaller matrices .
Incorrect! Try again.
53How does Probabilistic PCA (PPCA) fundamentally differ from standard PCA in its formulation and assumptions?
Principal component analysis (PCA) from a geometric and optimization perspective
Hard
A.PPCA allows for non-orthogonal principal components, while standard PCA enforces orthogonality.
B.Standard PCA is an iterative algorithm while PPCA has a closed-form solution.
C.PPCA maximizes data likelihood under a generative latent variable model with isotropic Gaussian noise, while standard PCA is the deterministic, zero-noise limit of this model.
D.Standard PCA minimizes the L2 reconstruction error, whereas PPCA minimizes the L1 reconstruction error, making it more robust to outliers.
Correct Answer: PPCA maximizes data likelihood under a generative latent variable model with isotropic Gaussian noise, while standard PCA is the deterministic, zero-noise limit of this model.
Explanation:
Standard PCA is an algorithm that finds a low-dimensional projection maximizing variance. PPCA is a generative model that assumes observed data is generated from lower-dimensional latent variables via a linear map plus Gaussian noise. PPCA's parameters are fit by maximizing data likelihood. This probabilistic framing allows for handling missing data naturally and provides a measure of uncertainty. Standard PCA can be recovered as a special case of PPCA when the variance of the noise term approaches zero.
Incorrect! Try again.
54For a multi-class classification problem with classes and features, what is the maximum rank of the between-class scatter matrix , and what is the direct consequence of this for the dimensionality reduction performed by LDA?
Linear discriminant analysis (LDA)
Hard
A.The rank is at most , where is the number of samples.
B.The rank is at most , so LDA can project to at most dimensions.
C.The rank is at most , so LDA can project to at most dimensions.
D.The rank is at most , so LDA can project to at most dimensions.
Correct Answer: The rank is at most , so LDA can project to at most dimensions.
Explanation:
The between-class scatter matrix is formed by the sum of outer products of vectors , where are class means and is the overall mean. These vectors are linearly dependent (they sum to zero when weighted by class sizes) and thus span a subspace of dimension at most . Since LDA finds eigenvectors of , and the rank of this product is limited by the rank of , there can be at most non-zero eigenvalues and thus at most useful discriminant directions.
Incorrect! Try again.
55In collaborative filtering, a common first step before applying SVD is to fill missing ratings in the user-item matrix . How does the naive strategy of imputing missing values with the global mean rating bias the resulting model, especially for users with very few ratings?
Applications of matrix factorization in recommendation systems
Hard
A.It causes the model to shrink all predictions towards the global mean, an effect identical to L2 regularization.
B.It has no significant effect as SVD is robust to such imputations.
C.It correctly centers the data, leading to a more accurate model.
D.It strongly biases the latent factor vectors of sparse users toward a 'generic' profile that primarily reflects average rating behavior, obscuring their unique tastes.
Correct Answer: It strongly biases the latent factor vectors of sparse users toward a 'generic' profile that primarily reflects average rating behavior, obscuring their unique tastes.
Explanation:
Mean imputation forces the factorization model to learn latent factors that can reconstruct the global mean for a large number of entries. For a sparse user with only a few ratings, the imputed mean values will dominate their row in the matrix. Consequently, their learned latent vector will be optimized to represent an 'average' user, not their specific, sparse preferences. This washes out individual taste and leads to generic recommendations for these users.
Incorrect! Try again.
56You are given a 2D dataset with two classes that form two long, thin, parallel clusters. The direction of maximum variance for the combined data is along the length of the clusters, while the direction that best separates them is orthogonal to their length. If you must reduce the data to 1 dimension for classification, which statement is most accurate?
Principal component analysis (PCA) from a geometric and optimization perspective
Hard
A.PCA will perform better because it captures the global structure of the data.
B.LDA will perform much better because it will find the projection that maximizes the separation between the class means.
C.Both will perform equally well as they will identify the same primary axis.
D.Neither will be effective; a non-linear method like kernel PCA is required.
Correct Answer: LDA will perform much better because it will find the projection that maximizes the separation between the class means.
Explanation:
This is a classic scenario highlighting the difference between unsupervised PCA and supervised LDA. PCA will find the direction of maximum variance, which is along the clusters' length. Projecting onto this axis will cause the two classes to overlap completely, making classification impossible. LDA, being supervised, will ignore the direction of high variance and instead find the direction that maximizes class separability. This direction is orthogonal to the clusters' length, and projecting onto it will perfectly separate the two classes.
Incorrect! Try again.
57For a tall matrix with , computing the full SVD () is inefficient due to the size of . How does the 'Thin SVD' (or 'Economy SVD') provide a more efficient but still exact representation?
Singular value decomposition (SVD)
Hard
A.Thin SVD sets all singular values below a threshold to zero, yielding a low-rank approximation.
B.Thin SVD computes the SVD of the smaller matrix to avoid dealing with the large dimension .
C.Thin SVD only computes the first columns of (as ) and the top-left block of , which is sufficient to perfectly reconstruct .
D.Thin SVD is an iterative algorithm that approximates the SVD, while full SVD is a direct method.
Correct Answer: Thin SVD only computes the first columns of (as ) and the top-left block of , which is sufficient to perfectly reconstruct .
Explanation:
In the SVD of a tall matrix (), there are at most non-zero singular values. In the product , the last columns of are always multiplied by zero. The Thin SVD avoids computing these unnecessary columns of and the zero-rows of . It produces , , and such that . This decomposition is exact (not an approximation) and more memory- and computationally-efficient.
Incorrect! Try again.
58The standard Power Iteration algorithm finds the eigenvector corresponding to the eigenvalue with the largest magnitude. How can this method be adapted to find the eigenvalue of a matrix that is closest to a specific target value ?
Eigen decomposition and its limitations in ML
Hard
A.By applying Power Iteration to the matrix and taking the reciprocal of the result.
B.By applying Power Iteration to the matrix .
C.It is not possible; Power Iteration is fundamentally limited to finding the dominant eigenvalue.
D.By applying Power Iteration to the matrix , an approach known as Inverse Iteration with a shift.
Correct Answer: By applying Power Iteration to the matrix , an approach known as Inverse Iteration with a shift.
Explanation:
The eigenvalues of the matrix are , where are the eigenvalues of . If is the eigenvalue of closest to the shift , then will be the smallest among all . Consequently, its reciprocal, , will be the largest. Therefore, applying Power Iteration to the shifted-inverse matrix will cause it to converge to the eigenvector corresponding to this dominant eigenvalue, which is the same eigenvector corresponding to the desired eigenvalue of the original matrix .
Incorrect! Try again.
59From an optimization perspective, PCA can be derived by finding a low-dimensional representation of data and a transformation matrix that minimizes the reconstruction error . What essential constraint must be placed on the columns of for the solution to be the standard PCA projection?
Principal component analysis (PCA) from a geometric and optimization perspective
Hard
A. must be a lower triangular matrix to ensure a unique solution.
B.The columns of must form an orthonormal set ().
C.No constraints are needed; ordinary least squares minimization automatically yields the principal components.
D.The rows of must form an orthonormal set ().
Correct Answer: The columns of must form an orthonormal set ().
Explanation:
Without constraints, the problem is ill-posed because one could arbitrarily scale up the latent factors and scale down the transformation matrix without changing the product . To obtain a unique solution that corresponds to principal components, we require the basis vectors (the columns of ) to be orthonormal. This constraint forces the solution for to be the matrix whose columns are the top eigenvectors of the covariance matrix of , which is the definition of the principal components.
Incorrect! Try again.
60In modern matrix factorization models, the prediction for a rating is often modeled as . What is the primary motivation for explicitly modeling the global bias , user bias , and item bias ?
Applications of matrix factorization in recommendation systems
Hard
A.To ensure the latent factors and have a zero mean and unit variance.
B.It is a form of regularization that is more effective at preventing overfitting than a simple penalty.
C.To make the overall optimization problem convex, guaranteeing a global minimum.
D.To account for systematic rating tendencies (e.g., some users are consistently harsh raters, some items are universally popular) so that the latent factors can model true user-item preference interactions.
Correct Answer: To account for systematic rating tendencies (e.g., some users are consistently harsh raters, some items are universally popular) so that the latent factors can model true user-item preference interactions.
Explanation:
Different users have different rating baselines, and different items have different average ratings. These are significant sources of variance in the data. By explicitly modeling these main effects with bias terms, we allow the more complex interaction term, , to focus on modeling the residual signal: the specific affinity of a user for an item that isn't explained by their general tendencies. This separation of concerns leads to a more accurate and interpretable model.