1What is the primary goal of dimensionality reduction in machine learning?
Need for Dimensionality Reduction
Easy
A.To automatically label unlabeled data
B.To increase the number of features in a dataset
C.To change continuous variables into categorical variables
D.To reduce the number of features while retaining the most important information
Correct Answer: To reduce the number of features while retaining the most important information
Explanation:
Dimensionality reduction aims to map high-dimensional data into a lower-dimensional space without losing significant underlying patterns or information.
Incorrect! Try again.
2Which term describes the phenomenon where data becomes exceedingly sparse as the number of features increases, negatively impacting model performance?
Need for Dimensionality Reduction
Easy
A.Feature Explosion
B.The Curse of Dimensionality
C.The Manifold Hypothesis
D.Overfitting Factor
Correct Answer: The Curse of Dimensionality
Explanation:
The 'Curse of Dimensionality' refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces, primarily because the volume of the space increases so fast that the available data becomes sparse.
Incorrect! Try again.
3How does dimensionality reduction generally affect the computational time of machine learning algorithms?
Need for Dimensionality Reduction
Easy
A.It decreases training time by reducing the amount of data to process.
B.It has no effect on training time.
C.It significantly increases training time.
D.It makes the training time unpredictable.
Correct Answer: It decreases training time by reducing the amount of data to process.
Explanation:
By reducing the number of features (dimensions), algorithms have fewer calculations to perform, leading to faster training times and lower computational costs.
Incorrect! Try again.
4What mathematical property does Principal Component Analysis (PCA) aim to maximize when finding new axes?
Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition)
Easy
A.Number of outliers
B.Correlation between features
C.Variance of the data
D.Sparsity of the data
Correct Answer: Variance of the data
Explanation:
PCA projects data onto new axes (principal components) sequentially, such that each axis captures the maximum possible variance of the data.
Incorrect! Try again.
5In PCA, the principal components are geometrically related in what specific way?
Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition)
Easy
A.They are randomly oriented.
B.They are parallel to each other.
C.They are orthogonal (perpendicular) to each other.
D.They form a 45-degree angle with the original axes.
Correct Answer: They are orthogonal (perpendicular) to each other.
Explanation:
Principal components are orthogonal to each other, meaning they represent completely uncorrelated dimensions in the newly transformed feature space.
Incorrect! Try again.
6What type of transformation does PCA perform on the data?
Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition)
Easy
A.Non-linear transformation
B.Linear transformation
C.Polynomial transformation
D.Logarithmic transformation
Correct Answer: Linear transformation
Explanation:
PCA is a linear dimensionality reduction technique. It applies linear transformations to project the original data into a lower-dimensional linear subspace.
Incorrect! Try again.
7Which principal component captures the most information (variance) from the original dataset?
Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition)
Easy
A.The first principal component
B.The last principal component
C.The median principal component
D.The second principal component
Correct Answer: The first principal component
Explanation:
By definition, the first principal component is chosen to capture the maximum possible variance of the original data.
Incorrect! Try again.
8What is the core assumption of the 'Manifold Hypothesis' in representation learning?
Manifold Learning Overview
Easy
A.Higher dimensions always provide better separation of data classes.
B.All datasets are completely linear and require simple scaling.
C.High-dimensional data actually lies on or near a lower-dimensional manifold embedded within the high-dimensional space.
D.Data can only be modeled properly using neural networks.
Correct Answer: High-dimensional data actually lies on or near a lower-dimensional manifold embedded within the high-dimensional space.
Explanation:
The manifold hypothesis states that many real-world high-dimensional datasets actually lie on low-dimensional manifolds embedded within the higher-dimensional space.
Incorrect! Try again.
9Which of the following is a classic "toy dataset" often used to demonstrate manifold learning techniques?
Manifold Learning Overview
Easy
A.The Titanic dataset
B.The Swiss Roll dataset
C.The Iris dataset
D.The Boston Housing dataset
Correct Answer: The Swiss Roll dataset
Explanation:
The Swiss Roll is a classic 3D dataset where a 2D plane is rolled up into a spiral shape. It is perfectly suited for demonstrating how non-linear algorithms can unroll a manifold.
Incorrect! Try again.
10Why might a data scientist choose a manifold learning algorithm over PCA?
Manifold Learning Overview
Easy
A.Because PCA only captures linear relationships, while manifold learning can capture complex, non-linear structures.
B.Because PCA is a supervised learning technique.
C.Because manifold learning increases the dimensions, solving the curse of dimensionality.
D.Because PCA is always slower than manifold learning algorithms.
Correct Answer: Because PCA only captures linear relationships, while manifold learning can capture complex, non-linear structures.
Explanation:
Linear techniques like PCA fail when data lies on a non-linear manifold. Manifold learning algorithms are specifically designed to uncover these complex, non-linear relationships.
Incorrect! Try again.
11What is the most common use case for t-SNE (t-Distributed Stochastic Neighbor Embedding)?
A.It prioritizes preserving global structure over local structure.
B.It prioritizes preserving local structure (keeping similar data points close together).
C.It perfectly preserves the exact pairwise distances of all points globally.
D.It ignores both structures and places points randomly.
Correct Answer: It prioritizes preserving local structure (keeping similar data points close together).
Explanation:
t-SNE models the probability of points being neighbors. It heavily prioritizes local structure, meaning it focuses on grouping similar data points closely together in the reduced space.
The 't' stands for the Student's t-distribution, which is used in the low-dimensional space to compute the similarity between points. This helps solve the 'crowding problem'.
Incorrect! Try again.
14What does UMAP stand for?
UMAP – Conceptual
Easy
A.Universal Machine Algorithm Predictor
B.Unified Model for Automated Processing
C.Uniform Manifold Approximation and Projection
D.Unsupervised Mapping and Partitioning
Correct Answer: Uniform Manifold Approximation and Projection
Explanation:
UMAP stands for Uniform Manifold Approximation and Projection, representing its mathematical foundations in Riemannian geometry and algebraic topology.
Incorrect! Try again.
15Which is a widely recognized advantage of UMAP over t-SNE?
UMAP – Conceptual
Easy
A.UMAP is generally faster and preserves global structure better than t-SNE.
B.UMAP can only reduce data to a single dimension.
C.UMAP requires labeled data to function.
D.UMAP is strictly a linear model.
Correct Answer: UMAP is generally faster and preserves global structure better than t-SNE.
Explanation:
Compared to t-SNE, UMAP provides faster execution times on large datasets and tends to preserve more of the global structure (the relationships between distinct clusters) of the data.
Incorrect! Try again.
16Like t-SNE, UMAP is fundamentally what type of dimensionality reduction technique?
UMAP is a non-linear dimensionality reduction algorithm based on manifold learning, used for visualizing and organizing high-dimensional data.
Incorrect! Try again.
17What are the two main functional components of an Autoencoder?
Autoencoder Intuition
Easy
A.Actor and Critic
B.Convolution and Pooling
C.Generator and Discriminator
D.Encoder and Decoder
Correct Answer: Encoder and Decoder
Explanation:
An autoencoder consists of an Encoder (which compresses the input into a latent representation) and a Decoder (which reconstructs the input from that representation).
Incorrect! Try again.
18In an Autoencoder used for dimensionality reduction, what is the 'bottleneck'?
Autoencoder Intuition
Easy
A.The hidden layer with the smallest number of nodes, representing the compressed data
B.The final output layer of the decoder
C.The input layer with the highest number of nodes
D.The loss function used for optimization
Correct Answer: The hidden layer with the smallest number of nodes, representing the compressed data
Explanation:
The bottleneck is the central hidden layer with the lowest dimensionality. It forces the network to learn a compressed representation (latent space) of the input data.
Incorrect! Try again.
19What is the primary objective of a basic Autoencoder during training?
Autoencoder Intuition
Easy
A.To maximize the variance of the input data
B.To predict future values in a sequence
C.To classify data into predefined categories
D.To reconstruct its own input data at the output layer
Correct Answer: To reconstruct its own input data at the output layer
Explanation:
Autoencoders are trained using a reconstruction loss. The goal is to make the output as close as possible to the original input .
Incorrect! Try again.
20An Autoencoder learns a representation without requiring manual labels. This falls under which category of machine learning?
Because it uses the input data itself as the target output without requiring external human-annotated labels, autoencoders are a form of unsupervised (often called self-supervised) learning.
Incorrect! Try again.
21Which of the following best describes the geometric impact of the 'Curse of Dimensionality' on distance-based algorithms like k-Nearest Neighbors?
Need for Dimensionality Reduction
Medium
A.As dimensions increase, the Euclidean distance between points approaches zero.
B.As dimensions increase, the ratio of the distance to the nearest neighbor to the distance of the farthest neighbor approaches 1.
C.As dimensions increase, the variance of the data linearly decreases.
D.As dimensions increase, all points converge to a single spatial coordinate.
Correct Answer: As dimensions increase, the ratio of the distance to the nearest neighbor to the distance of the farthest neighbor approaches 1.
Explanation:
In high-dimensional spaces, the volume expands exponentially, causing data points to become equidistant from one another. This makes distance metrics like Euclidean distance lose their discriminative power.
Incorrect! Try again.
22How does dimensionality reduction help in mitigating the Hughes phenomenon in machine learning models?
Need for Dimensionality Reduction
Medium
A.By converting continuous numerical features into discrete categories.
B.By replacing missing values with the mean of the principal components.
C.By reducing the feature space, thereby decreasing the likelihood of overfitting when the number of training samples is limited.
D.By increasing the number of features to artificially boost training accuracy.
Correct Answer: By reducing the feature space, thereby decreasing the likelihood of overfitting when the number of training samples is limited.
Explanation:
The Hughes phenomenon states that with a fixed number of training samples, predictive power initially increases as dimensions increase but then degrades. Dimensionality reduction keeps the feature space small, preventing overfitting.
Incorrect! Try again.
23If a dataset has highly collinear features, what is the primary benefit of applying a dimensionality reduction technique?
Need for Dimensionality Reduction
Medium
A.It increases the absolute magnitude of the feature weights.
B.It completely removes the need to normalize or scale the data before modeling.
C.It eliminates collinearity by creating orthogonal or independent representations of the original features.
D.It allows the model to map the data to a higher-dimensional space to find a linear boundary.
Correct Answer: It eliminates collinearity by creating orthogonal or independent representations of the original features.
Explanation:
Techniques like PCA project the original correlated variables into a new set of linearly uncorrelated variables (principal components), solving the multicollinearity problem.
Incorrect! Try again.
24Geometrically, what does the first principal component (PC1) represent in PCA?
Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition)
Medium
A.The vector that is orthogonal to the direction of maximum variance.
B.The direction in space along which the data has the minimum variance.
C.The direction in space that minimizes the orthogonal projection distance of the data points to that line.
D.The centroid of the data points in the original high-dimensional space.
Correct Answer: The direction in space that minimizes the orthogonal projection distance of the data points to that line.
Explanation:
Geometrically, maximizing the variance of the projected data points is mathematically equivalent to minimizing the mean squared orthogonal distance from the data points to the projection line.
Incorrect! Try again.
25Suppose you apply PCA to a 3D dataset shaped exactly like a hollow sphere. What will the eigenvalues of the first three principal components look like?
Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition)
Medium
A.The first eigenvalue will be much larger than the other two.
B.The first eigenvalue will be zero, and the other two will be equal.
C.All three eigenvalues will be approximately equal.
D.The first two eigenvalues will be large and equal, and the third will be zero.
Correct Answer: All three eigenvalues will be approximately equal.
Explanation:
A sphere has equal variance in all directions. Therefore, PCA will find that the variance (represented by the eigenvalues) is approximately equal along any three mutually orthogonal axes.
Incorrect! Try again.
26Why is it crucial to mean-center and scale data (e.g., using standard scaling) before applying PCA?
Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition)
Medium
A.Because PCA is sensitive to the scale of the features; variables with larger scales will dominate the principal components.
B.Because PCA cannot be computed on matrices containing negative numbers.
C.To ensure that the covariance matrix is an identity matrix.
D.To convert non-linear relationships into linear ones before finding orthogonal vectors.
Correct Answer: Because PCA is sensitive to the scale of the features; variables with larger scales will dominate the principal components.
Explanation:
PCA seeks directions of maximum variance. If features are on vastly different scales, the feature with the largest scale will artificially exhibit the highest variance, dominating the first principal component.
Incorrect! Try again.
27If a dataset forms a highly curved 'Swiss Roll' shape in 3D space, why might PCA perform poorly when reducing it to 2D?
Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition)
Medium
A.PCA will automatically convert the continuous values into discrete clusters, destroying the roll structure.
B.PCA requires the data to have exactly zero variance in the third dimension.
C.PCA relies on gradient descent, which gets stuck in local minima on curved surfaces.
D.PCA attempts to map the data to a 2D Euclidean subspace, which will overlap points from different layers of the roll.
Correct Answer: PCA attempts to map the data to a 2D Euclidean subspace, which will overlap points from different layers of the roll.
Explanation:
PCA is a linear technique. It projects data onto flat planes (subspaces). When unrolling a Swiss Roll, a non-linear transformation is needed; a flat projection will crush the rolled layers on top of each other.
Incorrect! Try again.
28What is the fundamental assumption underlying manifold learning algorithms?
Manifold Learning Overview
Medium
A.High-dimensional data actually lies on or near a lower-dimensional, potentially non-linear surface embedded within the high-dimensional space.
B.High-dimensional data consists purely of independent, uniformly distributed random variables.
C.Any high-dimensional dataset can be losslessly compressed into exactly two dimensions.
D.High-dimensional data can be accurately clustered using only Euclidean distance from the origin.
Correct Answer: High-dimensional data actually lies on or near a lower-dimensional, potentially non-linear surface embedded within the high-dimensional space.
Explanation:
The manifold hypothesis posits that real-world high-dimensional data (like images or speech) is highly structured and concentrates near a lower-dimensional topological space (a manifold).
Incorrect! Try again.
29In the context of manifold learning, how does 'geodesic distance' differ from 'Euclidean distance'?
Manifold Learning Overview
Medium
A.Geodesic distance is only applicable to linear subspaces, whereas Euclidean distance is used for curved manifolds.
B.Geodesic distance is the straight-line distance in the high-dimensional space, while Euclidean distance is the shortest path along the manifold.
C.Geodesic distance measures the shortest path along the curved surface of the manifold, whereas Euclidean distance cuts straight through the ambient space.
D.There is no difference; they are mathematically equivalent in all dimensionality reduction techniques.
Correct Answer: Geodesic distance measures the shortest path along the curved surface of the manifold, whereas Euclidean distance cuts straight through the ambient space.
Explanation:
Euclidean distance measures the direct 'bird-flight' straight line between two points. Geodesic distance measures the distance an ant would have to walk while staying strictly on the curved surface (the manifold) to get from one point to another.
Incorrect! Try again.
30Which of the following is a primary reason why manifold learning techniques are preferred over linear methods for complex datasets?
Manifold Learning Overview
Medium
A.They calculate projections much faster than linear methods using singular value decomposition.
B.They map the data to a higher-dimensional space where classes are linearly separable.
C.They are able to capture and unroll non-linear relationships that linear projections would overlap or distort.
D.They guarantee a globally convex optimization problem with no hyperparameters.
Correct Answer: They are able to capture and unroll non-linear relationships that linear projections would overlap or distort.
Explanation:
Manifold learning focuses on preserving the local neighborhood structures of data on curved surfaces, allowing it to "unroll" non-linear shapes (like a Swiss Roll) successfully, unlike linear methods like PCA.
Incorrect! Try again.
31How does t-SNE solve the 'crowding problem' that often occurs in standard SNE?
Non-Linear Dimensionality Reduction Concepts - t-SNE
Medium
A.By strictly penalizing the distance between faraway points using an L1 regularization term.
B.By performing PCA first and restricting the t-SNE projection to the top 2 principal components.
C.By utilizing a Student's t-distribution with one degree of freedom in the low-dimensional space to model similarities.
D.By using a Gaussian distribution in the low-dimensional space and a Cauchy distribution in the high-dimensional space.
Correct Answer: By utilizing a Student's t-distribution with one degree of freedom in the low-dimensional space to model similarities.
Explanation:
The crowding problem refers to points clumping together in the center of the map. t-SNE solves this by using a heavy-tailed Student's t-distribution in the low-dimensional space, which pushes moderately distant points further apart.
Incorrect! Try again.
32In t-SNE, the cost function minimizes the Kullback-Leibler (KL) divergence between two probability distributions. What do these distributions represent?
Non-Linear Dimensionality Reduction Concepts - t-SNE
Medium
A.The variance of the high-dimensional features and the variance of the low-dimensional projection.
B.The pairwise similarities of data points in the high-dimensional space and the low-dimensional space.
C.The Euclidean distances and the Cosine similarities of the data points.
D.The prior and posterior probabilities of the cluster centroids.
Correct Answer: The pairwise similarities of data points in the high-dimensional space and the low-dimensional space.
Explanation:
t-SNE converts high-dimensional Euclidean distances into conditional probabilities representing similarities, does the same in the low-dimensional space, and minimizes the KL divergence between these two distributions to ensure the low-dimensional map reflects the high-dimensional structure.
Incorrect! Try again.
33What is the role of the 'perplexity' parameter in t-SNE?
Non-Linear Dimensionality Reduction Concepts - t-SNE
Medium
A.It dictates the number of iterations the gradient descent algorithm will run.
B.It defines the degrees of freedom for the Student's t-distribution used in the low-dimensional space.
C.It acts as a continuous measure of the effective number of nearest neighbors each point has, balancing attention between local and global aspects.
D.It sets the learning rate for the KL divergence optimization.
Correct Answer: It acts as a continuous measure of the effective number of nearest neighbors each point has, balancing attention between local and global aspects.
Explanation:
Perplexity loosely defines how many neighbors t-SNE considers for a specific point. A lower perplexity focuses heavily on local variations, while a higher perplexity considers more of the global structure.
Incorrect! Try again.
34When interpreting a t-SNE plot, which of the following statements is generally true regarding the distances between distinct clusters?
Non-Linear Dimensionality Reduction Concepts - t-SNE
Medium
A.Clusters that are close together in the t-SNE plot are mathematically guaranteed to be close in high-dimensional space.
B.The distance between clusters accurately reflects their absolute distance in the high-dimensional space.
C.The size (spread) of a cluster in a t-SNE plot is proportional to its variance in the high-dimensional space.
D.The global distances between distant clusters in a t-SNE plot are often meaningless and highly dependent on initialization.
Correct Answer: The global distances between distant clusters in a t-SNE plot are often meaningless and highly dependent on initialization.
Explanation:
t-SNE is designed to preserve local structure. It adapts its notion of distance to local density. As a result, the distance between distant clusters and the visual spread of clusters in the 2D plot do not reliably reflect high-dimensional global distances or variances.
Incorrect! Try again.
35Conceptually, how does UMAP handle the balance between local and global structure compared to t-SNE?
UMAP – Conceptual
Medium
A.UMAP requires users to explicitly label global clusters, making it a supervised learning algorithm.
B.UMAP completely ignores global structure to compute projections faster than t-SNE.
C.UMAP preserves global structure by using linear combinations of features, whereas t-SNE uses non-linear combinations.
D.UMAP uses cross-entropy as a cost function and different initialization, often preserving more global structure than t-SNE.
Correct Answer: UMAP uses cross-entropy as a cost function and different initialization, often preserving more global structure than t-SNE.
Explanation:
UMAP relies on algebraic topology and Riemannian geometry. Its use of a different cost function (cross-entropy instead of KL divergence) and graph-based initialization (like Laplacian Eigenmaps) allows it to capture a better balance of local and global structure compared to t-SNE.
Incorrect! Try again.
36Which two main hyperparameters control the geometry of the UMAP projection?
UMAP – Conceptual
Medium
A.Number of neighbors (n_neighbors) and Minimum distance (min_dist).
B.Number of components and Maximum iterations.
C.Perplexity and Epsilon.
D.Learning rate and Momentum.
Correct Answer: Number of neighbors (n_neighbors) and Minimum distance (min_dist).
Explanation:
n_neighbors controls the balance between local and global structure (similar to perplexity in t-SNE), and min_dist controls how tightly UMAP packs points together in the low-dimensional space.
Incorrect! Try again.
37If you increase the min_dist parameter in UMAP, what visual effect will it likely have on the resulting embedding?
UMAP – Conceptual
Medium
A.The embedding will look more spread out, preventing points from bunching up too closely.
B.The algorithm will fall back to performing standard Principal Component Analysis.
C.The number of dimensions of the output will increase.
D.The clusters will become extremely tight, with points overlapping each other.
Correct Answer: The embedding will look more spread out, preventing points from bunching up too closely.
Explanation:
The min_dist parameter dictates the minimum distance apart that points are allowed to be in the low-dimensional representation. Increasing it forces points to spread out, preserving broad topological structure rather than packing points into dense clusters.
Incorrect! Try again.
38In a standard autoencoder used for dimensionality reduction, what is the primary purpose of the 'bottleneck' layer?
Autoencoder Intuition
Medium
A.To classify the input data into predefined categories directly without a decoder.
B.To increase the dimensionality of the data so that non-linear relations become linearly separable.
C.To force the network to learn a compressed, lower-dimensional latent representation of the input data.
D.To introduce noise into the input data to prevent the model from overfitting.
Correct Answer: To force the network to learn a compressed, lower-dimensional latent representation of the input data.
Explanation:
The bottleneck layer has fewer nodes than the input layer. By passing information through this restriction and attempting to reconstruct the original input, the network is forced to learn a compressed, salient representation (latent space) of the data.
Incorrect! Try again.
39If an autoencoder has strictly linear activation functions and uses Mean Squared Error (MSE) loss, its bottleneck layer will learn a representation that spans the same subspace as which other technique?
Autoencoder Intuition
Medium
A.K-Means Clustering
B.t-SNE
C.UMAP
D.Principal Component Analysis (PCA)
Correct Answer: Principal Component Analysis (PCA)
Explanation:
A linear autoencoder trained with MSE minimizes the same reconstruction error as PCA. Therefore, the weights of the linear autoencoder will span the exact same principal subspace as PCA (though not necessarily orthogonal like standard PCA components).
Incorrect! Try again.
40Why is the reconstruction loss essential in training an autoencoder?
Autoencoder Intuition
Medium
A.It calculates the distance between the encoded representations to maximize cluster separation.
B.It measures the classification error of the labels provided in the dataset.
C.It ensures that the weights of the encoder and decoder remain orthogonal.
D.It quantifies how well the decoder can rebuild the original input from the compressed latent representation.
Correct Answer: It quantifies how well the decoder can rebuild the original input from the compressed latent representation.
Explanation:
Autoencoders are trained in an unsupervised manner by setting the target output to be equal to the input. The reconstruction loss (e.g., MSE between input and output) acts as the penalty guiding the network to retain as much useful information as possible in the bottleneck.
Incorrect! Try again.
41Let and be the maximum and minimum Euclidean distances from a randomly selected query point to a set of points uniformly distributed in a -dimensional hypercube. As the dimension , what is the limiting behavior of the relative contrast ?
Need for Dimensionality Reduction
Hard
A.It diverges to .
B.It converges to $0$.
C.It converges to a constant strictly dependent on the data's variance.
D.It converges to $1$.
Correct Answer: It converges to $0$.
Explanation:
Under broad conditions, the distances to the nearest and furthest neighbor become relatively identical as dimensions increase (a phenomenon known as distance concentration). This collapses the metric space's contrast, making distance-based algorithms like k-NN fail in high dimensions.
Incorrect! Try again.
42Consider a -dimensional multivariate Gaussian distribution with a zero mean and an identity covariance matrix. As becomes very large, where is the vast majority of the probability mass located geometrically?
Need for Dimensionality Reduction
Hard
A.Highly concentrated strictly at the origin, which is the mode of the distribution.
B.In a thin spherical shell (annulus) at a distance of approximately from the origin.
C.Uniformly distributed throughout the -dimensional hypercube enclosing the distribution.
D.Along the axes of the standard basis vectors, forming a sparse star-like shape.
Correct Answer: In a thin spherical shell (annulus) at a distance of approximately from the origin.
Explanation:
Due to the geometry of high-dimensional spaces, the volume of a hypersphere scales such that the probability mass concentrates in a thin annulus (the "Gaussian soap bubble" effect) at radius . The origin has the highest density, but virtually zero volume.
Incorrect! Try again.
43You apply Principal Component Analysis (PCA) on a data matrix . If you deliberately do NOT mean-center the data prior to computing the covariance matrix surrogate , how does the first principal component (PC1) geometrically behave?
Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition)
Hard
A.The matrix becomes rank-deficient, preventing the extraction of PC1.
B.PC1 remains mathematically identical to the centered case, but the corresponding eigenvalue will be uniformly shifted by the magnitude of the mean.
C.PC1 will be strictly orthogonal to the mean vector of the dataset.
D.PC1 will point approximately towards the mean vector of the dataset, effectively capturing the distance from the origin rather than the direction of maximum variance.
Correct Answer: PC1 will point approximately towards the mean vector of the dataset, effectively capturing the distance from the origin rather than the direction of maximum variance.
Explanation:
If data is not centered, the uncentered covariance matrix is heavily dominated by the mean vector. Consequently, the first eigenvector will tend to align with the mean, failing to capture the true axes of maximum variation.
Incorrect! Try again.
44Let be the thin Singular Value Decomposition (SVD) of a centered data matrix . How can the ratio of variance explained by the first principal components be mathematically expressed?
Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition)
Hard
A.
B.
C.
D.
Correct Answer:
Explanation:
The variance along a principal component is given by the corresponding eigenvalue of the covariance matrix. Since where are the singular values of , the explained variance ratio uses the squares of the singular values.
Incorrect! Try again.
45Which of the following generative assumptions mathematically guarantees that standard PCA will recover the true underlying lower-dimensional representation of a dataset?
Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition)
Hard
A.The data lies on an isometric Riemannian manifold where geodesic distance is proportional to Euclidean distance.
B.The data is generated from a lower-dimensional affine subspace with added isotropic Gaussian noise.
C.The dataset has extremely high variance along non-linear curves, and its covariance matrix is strictly full rank.
D.The features of the dataset are strictly independent and follow a continuous uniform distribution.
Correct Answer: The data is generated from a lower-dimensional affine subspace with added isotropic Gaussian noise.
Explanation:
PCA implicitly relies on a linear generative model. Probabilistic PCA shows that PCA optimally recovers the true subspace if the data is generated by a linear combination of independent Gaussian variables plus isotropic (spherical) Gaussian noise.
Incorrect! Try again.
46If the covariance matrix of a dataset has a condition number , what is the strict geometric implication for PCA?
Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition)
Hard
A.The dataset lies entirely within a lower-dimensional hyperplane, meaning at least one principal component will capture exactly zero variance.
B.The data contains extreme, infinitely distant outliers that permanently distort the principal components.
C.The intrinsic dimensionality of the data is strictly equal to its extrinsic dimensionality.
D.The dataset is perfectly isotropic (spherical), making all principal directions equally valid.
Correct Answer: The dataset lies entirely within a lower-dimensional hyperplane, meaning at least one principal component will capture exactly zero variance.
Explanation:
A condition number approaching infinity implies . Geometrically, this means there is zero variance along at least one axis, so the data is completely contained within a lower-dimensional hyperplane.
Incorrect! Try again.
47In Kernel PCA using an RBF (Gaussian) kernel , how does the dimension of the mapped feature space fundamentally affect the extraction of principal components?
Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition)
Hard
A.The feature space is infinite-dimensional, but the maximum number of non-zero principal components is strictly bounded by the number of data points .
B.The feature space dimension is explicitly computed, restricting Kernel PCA to be performed only in time.
C.The number of principal components extracted is directly determined by the hyperparameter , regardless of dataset size.
D.The feature space maps the data to a maximum of dimensions, allowing polynomial time extraction of components.
Correct Answer: The feature space is infinite-dimensional, but the maximum number of non-zero principal components is strictly bounded by the number of data points .
Explanation:
The RBF kernel maps data into an infinite-dimensional Hilbert space. However, because the data only spans a subspace of at most dimensions (where is the number of samples), Kernel PCA leverages the kernel trick to extract at most non-zero components without explicitly calculating the infinite-dimensional mapping.
Incorrect! Try again.
48Manifold learning relies heavily on neighborhood graphs. Which of the following is a critical edge-case vulnerability when defining neighborhoods using -nearest neighbors (-NN) as opposed to an -radius graph?
Manifold Learning Overview
Hard
A.-NN can falsely connect distant regions of the manifold (short-circuiting) if the sampling density varies wildly, whereas -radius graphs might yield disconnected components in sparse regions.
B.-radius graphs guarantee a fully connected graph regardless of data density, while -NN naturally isolates outliers.
C.-NN graphs are always undirected by default, while -radius graphs are naturally directed due to asymmetric distance calculations.
D.-NN graphs fail completely on non-convex manifolds, whereas -radius graphs are mathematically invariant to manifold convexity.
Correct Answer: -NN can falsely connect distant regions of the manifold (short-circuiting) if the sampling density varies wildly, whereas -radius graphs might yield disconnected components in sparse regions.
Explanation:
In -NN, a point in a very sparse region must connect to neighbors, which can bridge gaps between separate parts of the manifold (short-circuiting). Conversely, -radius strictly connects points within a distance , which avoids short-circuits but can create disconnected subgraphs where data is sparse.
Incorrect! Try again.
49Consider Isomap and Locally Linear Embedding (LLE). Which of the following statements mathematically contrasts their core structure preservation objectives?
Manifold Learning Overview
Hard
A.Isomap assumes the manifold is globally linear and applies SVD, whereas LLE models the manifold using a global high-degree polynomial function.
B.Isomap seeks to globally preserve approximated geodesic distances using Multidimensional Scaling, whereas LLE seeks to preserve local barycentric coordinates used to reconstruct each point from its neighbors.
C.Isomap minimizes a cross-entropy loss based on topological neighborhood probabilities, whereas LLE maximizes the variance of the projected data subject to orthogonality constraints.
D.Isomap preserves the exact Euclidean distances for all pairs in the dataset, whereas LLE only preserves Euclidean distances for the -nearest neighbors.
Correct Answer: Isomap seeks to globally preserve approximated geodesic distances using Multidimensional Scaling, whereas LLE seeks to preserve local barycentric coordinates used to reconstruct each point from its neighbors.
Explanation:
Isomap builds a global distance matrix based on shortest paths (geodesics) and applies MDS to preserve these global distances. LLE takes a local approach, characterizing the local geometry of each neighborhood by linear coefficients (barycentric coordinates) and preserving these coefficients in the lower-dimensional space.
Incorrect! Try again.
50In Spectral Embedding (Laplacian Eigenmaps), the objective is to minimize subject to . Mathematically, this is equivalent to finding the eigenvectors associated with:
Manifold Learning Overview
Hard
A.The smallest non-zero eigenvalues of the normalized graph Laplacian .
B.The largest eigenvalues of the adjacency matrix .
C.The largest eigenvalues of the unnormalized graph Laplacian .
D.The smallest eigenvalues of the local covariance matrix .
Correct Answer: The smallest non-zero eigenvalues of the normalized graph Laplacian .
Explanation:
The objective function can be rewritten as a trace minimization problem: . Subject to , this becomes a generalized eigenvalue problem , which is equivalent to finding the smallest non-zero eigenvalues of the random walk normalized Laplacian .
Incorrect! Try again.
51In t-SNE, the gradient of the Kullback-Leibler divergence with respect to the mapped points dictates the dynamics of the embedding. What primarily governs the repulsive force between two points and in this low-dimensional space?
Non-Linear Dimensionality Reduction Concepts - t-SNE
Hard
A.The heavy-tailed Student t-distribution , which causes the repulsion to act universally across all points based on low-dimensional proximity, regardless of their high-dimensional distance.
B.The exact Euclidean distance in the high-dimensional space, which explicitly pushes distant points infinitely far apart.
C.The perplexity parameter , which directly scales the repulsive strength inversely proportional to the local data density.
D.The degree of freedom of the Student t-distribution, which acts as a hard threshold, clipping the repulsion at a fixed radial distance.
Correct Answer: The heavy-tailed Student t-distribution , which causes the repulsion to act universally across all points based on low-dimensional proximity, regardless of their high-dimensional distance.
Explanation:
The t-SNE gradient is proportional to . The term represents the repulsive force. Because evaluates the Student-t density in the low-dimensional space, points that are close in the embedding naturally repel each other, ensuring they don't crowd together into a single point.
Incorrect! Try again.
52Why is t-SNE theoretically prone to creating artificial, tight clusters when applied to a dataset consisting entirely of uniformly distributed random noise?
Non-Linear Dimensionality Reduction Concepts - t-SNE
Hard
A.The heavy tails of the Student-t distribution cause all points to be repelled equally, forcing them into a strict, crystalline lattice structure.
B.Uniform noise has an intrinsic dimensionality of zero, which t-SNE handles by projecting all points onto the surface of a hypersphere.
C.The KL divergence penalty is asymmetric; it heavily penalizes placing high-dimensional neighbors far apart, but is highly lenient if distant noise points are placed close together, allowing arbitrary clumps to form.
D.The Gaussian kernel used in the high-dimensional space mathematically forces the overall variance of the random noise to collapse to exactly zero.
Correct Answer: The KL divergence penalty is asymmetric; it heavily penalizes placing high-dimensional neighbors far apart, but is highly lenient if distant noise points are placed close together, allowing arbitrary clumps to form.
Explanation:
t-SNE minimizes . If is large (points are close in high dimension), must be large to avoid a huge penalty. However, if is small (distant points), a large yields only a minimal penalty. This asymmetry means t-SNE doesn't aggressively fix false positives, allowing random noise to arbitrarily group into visual clusters.
Incorrect! Try again.
53If you naively set the perplexity parameter in t-SNE to a value strictly greater than the total number of data points , what mathematical failure occurs during the calculation of the high-dimensional similarities ?
Non-Linear Dimensionality Reduction Concepts - t-SNE
Hard
A.The similarities evaluate to exactly 1 for all pairs, causing the KL divergence loss function to equal zero before optimization begins.
B.The algorithm defaults to standard PCA behavior because the probability matrix immediately becomes a perfectly sparse diagonal matrix.
C.The variance converges to exactly zero, transforming the neighborhood graph into a completely disconnected set of nodes.
D.The binary search for the variance fails to converge because the required Shannon entropy exceeds the maximum theoretically possible entropy of the discrete distribution ().
Correct Answer: The binary search for the variance fails to converge because the required Shannon entropy exceeds the maximum theoretically possible entropy of the discrete distribution ().
Explanation:
Perplexity is defined as , where is the Shannon entropy. The maximum possible entropy for a distribution over neighbors is . Therefore, a perplexity greater than implies an entropy target that is mathematically impossible to achieve, causing the variance search to fail.
Incorrect! Try again.
54In Barnes-Hut t-SNE, the computational complexity is reduced from to . Which mathematical approximation fundamentally enables this massive reduction in complexity?
Non-Linear Dimensionality Reduction Concepts - t-SNE
Hard
A.Ignoring the repulsive forces entirely for points that are strictly outside the predefined -nearest neighbors graph.
B.Replacing the Student t-distribution with a uniform distribution for points whose distance exceeds a hyperparameter threshold .
C.Grouping spatially distant points into a single center of mass using a spatial tree (quadtree/octree) to compute aggregate repulsive forces.
D.Using a truncated Singular Value Decomposition (SVD) on the high-dimensional probability matrix before gradient descent optimization.
Correct Answer: Grouping spatially distant points into a single center of mass using a spatial tree (quadtree/octree) to compute aggregate repulsive forces.
Explanation:
Barnes-Hut t-SNE relies on the Barnes-Hut approximation used in N-body simulations. It builds a quadtree (in 2D) and approximates the repulsive force of a distant cluster of points by treating them as a single point at their center of mass, eliminating the need to compute pairwise repulsions for every single distant point.
Incorrect! Try again.
55UMAP utilizes a Cross-Entropy loss function to optimize its low-dimensional embeddings, setting it apart from t-SNE. How does this specific loss function fundamentally allow UMAP to better preserve global structure?
UMAP – Conceptual
Hard
A.It completely eliminates repulsive forces entirely, allowing global distances to be determined strictly by a deterministic eigenvalue decomposition.
B.It uses a symmetric Gaussian distribution in both the high and low-dimensional spaces, avoiding the geometric distortion introduced by heavy tails.
C.It contains a term which explicitly penalizes placing high-dimensional distant points close together in the low-dimensional space.
D.It enforces a strict metric constraint that the low-dimensional Euclidean distances must perfectly match the high-dimensional geodesic distances.
Correct Answer: It contains a term which explicitly penalizes placing high-dimensional distant points close together in the low-dimensional space.
Explanation:
Unlike t-SNE which uses KL Divergence (primarily penalizing false negatives), UMAP's Fuzzy Set Cross Entropy loss includes a second term that penalizes false positives (when is small but is large). This active penalty for squashing distant points together helps UMAP retain better global topology.
Incorrect! Try again.
56UMAP's theoretical foundation relies on the assumption that data is uniformly distributed across a Riemannian manifold. Since real-world data density is highly variable, how does UMAP mathematically enforce this uniform assumption?
UMAP – Conceptual
Hard
A.By performing an initial manifold unrolling using a global algorithm like Isomap to evenly distribute the data in the ambient space.
B.By defining a custom Riemannian metric around each data point where the distance to its nearest neighbor is locally normalized to be constant.
C.By deliberately injecting uniform Gaussian noise into the high-dimensional dataset prior to computing the topological simplicial complex.
D.By assuming the ambient space is inherently hyperbolic and mapping all points onto the surface of a Poincaré disk.
Correct Answer: By defining a custom Riemannian metric around each data point where the distance to its nearest neighbor is locally normalized to be constant.
Explanation:
To assume uniform distribution on the manifold, UMAP warps the notion of distance based on local density. It subtracts the distance to the first nearest neighbor () and scales the local metric. Thus, in dense regions, the metric "stretches" space, and in sparse regions, it "shrinks" space, making the data appear uniformly distributed under this custom local metric.
Incorrect! Try again.
57What role does the min_dist hyperparameter play in UMAP's low-dimensional optimization, and how does it manifest mathematically in the low-dimensional probability function ?
UMAP – Conceptual
Hard
A.It acts as a hard threshold in the high-dimensional space; any points closer than min_dist are permanently merged into a single topological simplex.
B.It defines a plateau in the low-dimensional distance function , controlling how tightly points are allowed to pack together before repulsion scales sharply.
C.It strictly determines the learning rate decay schedule during the Stochastic Gradient Descent optimization of the cross-entropy loss.
D.It sets the absolute minimum Euclidean distance required between any two disconnected topological components in the final embedding graph.
Correct Answer: It defines a plateau in the low-dimensional distance function , controlling how tightly points are allowed to pack together before repulsion scales sharply.
Explanation:
The min_dist parameter in UMAP controls the aesthetic tightness of clusters. Mathematically, it fits the parameters and in the low-dimensional similarity function . This creates a flat plateau at distances less than min_dist, preventing points from overlapping completely and controlling the local crowding.
Incorrect! Try again.
58Consider a shallow Linear Autoencoder (a single hidden layer, linear activations, trained to minimize MSE) and standard PCA applied to the same dataset. Which of the following accurately describes the mathematical relationship between the Autoencoder's bottleneck weights and the PCA principal components?
Autoencoder Intuition
Hard
A.The Autoencoder implicitly enforces an norm constraint during gradient descent, resulting in a sparse subspace unlike the dense PCA components.
B.The row space of spans the exact same principal subspace as the top- PCA components, but the individual rows of are not necessarily orthogonal or variance-ordered.
C.The Autoencoder will inevitably converge to arbitrary local minima, causing to capture a subspace completely orthogonal to the PCA components.
D.The rows of will mathematically converge to the exact same ordered and strictly orthogonal eigenvectors found by the PCA covariance matrix.
Correct Answer: The row space of spans the exact same principal subspace as the top- PCA components, but the individual rows of are not necessarily orthogonal or variance-ordered.
Explanation:
A linear autoencoder minimizing MSE learns to project data onto the same optimal low-dimensional subspace as PCA (the principal subspace). However, without explicit orthogonality constraints, the weight vectors in the bottleneck form an arbitrary (often non-orthogonal) basis for that exact same subspace.
Incorrect! Try again.
59According to manifold learning theory, when you train a Denoising Autoencoder (DAE) by corrupting inputs to and reconstructing them, what is the geometric interpretation of the learned vector field ?
Autoencoder Intuition
Hard
A.It approximates the score function (gradient of the log-density, ) of the data distribution, effectively pointing corrupted points toward the highest density regions on the manifold.
B.It maps out a set of strictly orthogonal basis functions that define the null space (the zero-variance directions) of the data manifold.
C.It calculates the exact geodesics of the manifold by strictly minimizing the path integral of the Euclidean distance between and .
D.It perfectly aligns with the principal eigenvectors of the global covariance matrix, always pointing along the manifold's flattest linear dimensions.
Correct Answer: It approximates the score function (gradient of the log-density, ) of the data distribution, effectively pointing corrupted points toward the highest density regions on the manifold.
Explanation:
Alain and Bengio (2014) showed mathematically that a DAE trained with small Gaussian corruption learns a vector field proportional to the score function of the data distribution. This means the reconstruction error vector points in the direction of steepest ascent of the log-probability density, pushing points back onto the data manifold.
Incorrect! Try again.
60A Contractive Autoencoder (CAE) introduces a penalty term , where is the Jacobian matrix of the encoder activations with respect to the input. What is the fundamental representation learning goal of this specific regularization?
Autoencoder Intuition
Hard
A.To align the bottleneck activations with a standard Gaussian uniform distribution by minimizing the Kullback-Leibler divergence to a prior.
B.To force the learned representation to be insensitive to small local variations in the input space, ensuring features only change along the manifold while remaining flat in orthogonal directions.
C.To strictly force the weights of the network to become sparse, mathematically minimizing the number of active hidden neurons for any given input.
D.To ensure the decoder generates an output whose dimensionality is strictly greater than the input, encouraging an overcomplete and entangled representation.
Correct Answer: To force the learned representation to be insensitive to small local variations in the input space, ensuring features only change along the manifold while remaining flat in orthogonal directions.
Explanation:
The Jacobian penalty penalizes the derivative of the hidden representations with respect to the input. This forces the encoder to be robust (contractive) to small perturbations around the training data. The autoencoder only retains sensitivity to directions along the data manifold (to reconstruct the data), effectively mapping out the manifold structure.