1 $What is the primary goal of dimensionality reduction in machine learning?$

Need for Dimensionality Reduction Easy

A.

To automatically label unlabeled data

B.

To increase the number of features in a dataset

C.

To change continuous variables into categorical variables

D.

To reduce the number of features while retaining the most important information

2 $Which term describes the phenomenon where data becomes exceedingly sparse as the number of features increases, negatively impacting model performance?$

Need for Dimensionality Reduction Easy

A.

Feature Explosion

B.

The Curse of Dimensionality

C.

The Manifold Hypothesis

D.

Overfitting Factor

3 $How does dimensionality reduction generally affect the computational time of machine learning algorithms?$

Need for Dimensionality Reduction Easy

A.

It decreases training time by reducing the amount of data to process.

B.

It has no effect on training time.

C.

It significantly increases training time.

D.

It makes the training time unpredictable.

4 $What mathematical property does Principal Component Analysis (PCA) aim to maximize when finding new axes?$

Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition) Easy

A.

Number of outliers

B.

Correlation between features

C.

Variance of the data

D.

Sparsity of the data

5 $In PCA, the principal components are geometrically related in what specific way?$

Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition) Easy

A.

They are randomly oriented.

B.

They are parallel to each other.

C.

They are orthogonal (perpendicular) to each other.

D.

They form a 45-degree angle with the original axes.

6 $What type of transformation does PCA perform on the data?$

Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition) Easy

A.

Non-linear transformation

B.

Linear transformation

C.

Polynomial transformation

D.

Logarithmic transformation

7 $Which principal component captures the most information (variance) from the original dataset?$

Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition) Easy

A.

The first principal component

B.

The last principal component

C.

The median principal component

D.

The second principal component

8 $What is the core assumption of the 'Manifold Hypothesis' in representation learning?$

Manifold Learning Overview Easy

A.

Higher dimensions always provide better separation of data classes.

B.

All datasets are completely linear and require simple scaling.

C.

High-dimensional data actually lies on or near a lower-dimensional manifold embedded within the high-dimensional space.

D.

Data can only be modeled properly using neural networks.

9 $Which of the following is a classic "toy dataset" often used to demonstrate manifold learning techniques?$

Manifold Learning Overview Easy

A.

The Titanic dataset

B.

The Swiss Roll dataset

C.

The Iris dataset

D.

The Boston Housing dataset

10 $Why might a data scientist choose a manifold learning algorithm over PCA?$

Manifold Learning Overview Easy

A.

Because PCA only captures linear relationships, while manifold learning can capture complex, non-linear structures.

B.

Because PCA is a supervised learning technique.

C.

Because manifold learning increases the dimensions, solving the curse of dimensionality.

D.

Because PCA is always slower than manifold learning algorithms.

11 $What is the most common use case for t-SNE (t-Distributed Stochastic Neighbor Embedding)?$

Non-Linear Dimensionality Reduction Concepts - t-SNE Easy

A.

Compressing images for storage

B.

Visualizing high-dimensional data in 2D or 3D space

C.

Extracting features for linear regression models

D.

Predicting future stock prices

12 $Does t-SNE prioritize preserving local or global structure of the data?$

Non-Linear Dimensionality Reduction Concepts - t-SNE Easy

A.

It prioritizes preserving global structure over local structure.

B.

It prioritizes preserving local structure (keeping similar data points close together).

C.

It perfectly preserves the exact pairwise distances of all points globally.

D.

It ignores both structures and places points randomly.

13 $What does the 't' in t-SNE stand for?$

Non-Linear Dimensionality Reduction Concepts - t-SNE Easy

A.

Time-series

B.

Tensor

C.

Student's t-distribution

D.

Transformational

14 $What does UMAP stand for?$

UMAP – Conceptual Easy

A.

Universal Machine Algorithm Predictor

B.

Unified Model for Automated Processing

C.

Uniform Manifold Approximation and Projection

D.

Unsupervised Mapping and Partitioning

15 $Which is a widely recognized advantage of UMAP over t-SNE?$

UMAP – Conceptual Easy

A.

UMAP is generally faster and preserves global structure better than t-SNE.

B.

UMAP can only reduce data to a single dimension.

C.

UMAP requires labeled data to function.

D.

UMAP is strictly a linear model.

16 $Like t-SNE, UMAP is fundamentally what type of dimensionality reduction technique?$

UMAP – Conceptual Easy

A.

Supervised classification technique

B.

Decision tree-based feature selection

C.

Linear dimensionality reduction

D.

Non-linear dimensionality reduction (manifold learning)

17 $What are the two main functional components of an Autoencoder?$

Autoencoder Intuition Easy

A.

Actor and Critic

B.

Convolution and Pooling

C.

Generator and Discriminator

D.

Encoder and Decoder

18 $In an Autoencoder used for dimensionality reduction, what is the 'bottleneck'?$

Autoencoder Intuition Easy

A.

The hidden layer with the smallest number of nodes, representing the compressed data

B.

The final output layer of the decoder

C.

The input layer with the highest number of nodes

D.

The loss function used for optimization

19 $What is the primary objective of a basic Autoencoder during training?$

Autoencoder Intuition Easy

A.

To maximize the variance of the input data

B.

To predict future values in a sequence

C.

To classify data into predefined categories

D.

To reconstruct its own input data at the output layer

20 $An Autoencoder learns a representation without requiring manual labels. This falls under which category of machine learning?$

Autoencoder Intuition Easy

A.

Reinforcement Learning

B.

Supervised Learning

C.

Active Learning

D.

Unsupervised (or Self-Supervised) Learning

21 $Which of the following best describes the geometric impact of the 'Curse of Dimensionality' on distance-based algorithms like k-Nearest Neighbors?$

Need for Dimensionality Reduction Medium

A.

As dimensions increase, the Euclidean distance between points approaches zero.

B.

As dimensions increase, the ratio of the distance to the nearest neighbor to the distance of the farthest neighbor approaches 1.

C.

As dimensions increase, the variance of the data linearly decreases.

D.

As dimensions increase, all points converge to a single spatial coordinate.

22 $How does dimensionality reduction help in mitigating the Hughes phenomenon in machine learning models?$

Need for Dimensionality Reduction Medium

A.

By converting continuous numerical features into discrete categories.

B.

By replacing missing values with the mean of the principal components.

C.

By reducing the feature space, thereby decreasing the likelihood of overfitting when the number of training samples is limited.

D.

By increasing the number of features to artificially boost training accuracy.

23 $If a dataset has highly collinear features, what is the primary benefit of applying a dimensionality reduction technique?$

Need for Dimensionality Reduction Medium

A.

It increases the absolute magnitude of the feature weights.

B.

It completely removes the need to normalize or scale the data before modeling.

C.

It eliminates collinearity by creating orthogonal or independent representations of the original features.

D.

It allows the model to map the data to a higher-dimensional space to find a linear boundary.

24 $Geometrically, what does the first principal component (PC1) represent in PCA?$

Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition) Medium

A.

The vector that is orthogonal to the direction of maximum variance.

B.

The direction in space along which the data has the minimum variance.

C.

The direction in space that minimizes the orthogonal projection distance of the data points to that line.

D.

The centroid of the data points in the original high-dimensional space.

25 $Suppose you apply PCA to a 3D dataset shaped exactly like a hollow sphere. What will the eigenvalues of the first three principal components look like?$

Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition) Medium

A.

The first eigenvalue will be much larger than the other two.

B.

The first eigenvalue will be zero, and the other two will be equal.

C.

All three eigenvalues will be approximately equal.

D.

The first two eigenvalues will be large and equal, and the third will be zero.

26 $Why is it crucial to mean-center and scale data (e.g., using standard scaling) before applying PCA?$

Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition) Medium

A.

Because PCA is sensitive to the scale of the features; variables with larger scales will dominate the principal components.

B.

Because PCA cannot be computed on matrices containing negative numbers.

C.

To ensure that the covariance matrix is an identity matrix.

D.

To convert non-linear relationships into linear ones before finding orthogonal vectors.

27 $If a dataset forms a highly curved 'Swiss Roll' shape in 3D space, why might PCA perform poorly when reducing it to 2D?$

Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition) Medium

A.

PCA will automatically convert the continuous values into discrete clusters, destroying the roll structure.

B.

PCA requires the data to have exactly zero variance in the third dimension.

C.

PCA relies on gradient descent, which gets stuck in local minima on curved surfaces.

D.

PCA attempts to map the data to a 2D Euclidean subspace, which will overlap points from different layers of the roll.

28 $What is the fundamental assumption underlying manifold learning algorithms?$

Manifold Learning Overview Medium

A.

High-dimensional data actually lies on or near a lower-dimensional, potentially non-linear surface embedded within the high-dimensional space.

B.

High-dimensional data consists purely of independent, uniformly distributed random variables.

C.

Any high-dimensional dataset can be losslessly compressed into exactly two dimensions.

D.

High-dimensional data can be accurately clustered using only Euclidean distance from the origin.

29 $In the context of manifold learning, how does 'geodesic distance' differ from 'Euclidean distance'?$

Manifold Learning Overview Medium

A.

Geodesic distance is only applicable to linear subspaces, whereas Euclidean distance is used for curved manifolds.

B.

Geodesic distance is the straight-line distance in the high-dimensional space, while Euclidean distance is the shortest path along the manifold.

C.

Geodesic distance measures the shortest path along the curved surface of the manifold, whereas Euclidean distance cuts straight through the ambient space.

D.

There is no difference; they are mathematically equivalent in all dimensionality reduction techniques.

30 $Which of the following is a primary reason why manifold learning techniques are preferred over linear methods for complex datasets?$

Manifold Learning Overview Medium

A.

They calculate projections much faster than linear methods using singular value decomposition.

B.

They map the data to a higher-dimensional space where classes are linearly separable.

C.

They are able to capture and unroll non-linear relationships that linear projections would overlap or distort.

D.

They guarantee a globally convex optimization problem with no hyperparameters.

31 $How does t-SNE solve the 'crowding problem' that often occurs in standard SNE?$

Non-Linear Dimensionality Reduction Concepts - t-SNE Medium

A.

By strictly penalizing the distance between faraway points using an L1 regularization term.

B.

By performing PCA first and restricting the t-SNE projection to the top 2 principal components.

C.

By utilizing a Student's t-distribution with one degree of freedom in the low-dimensional space to model similarities.

D.

By using a Gaussian distribution in the low-dimensional space and a Cauchy distribution in the high-dimensional space.

32 $In t-SNE, the cost function minimizes the Kullback-Leibler (KL) divergence between two probability distributions. What do these distributions represent?$

Non-Linear Dimensionality Reduction Concepts - t-SNE Medium

A.

The variance of the high-dimensional features and the variance of the low-dimensional projection.

B.

The pairwise similarities of data points in the high-dimensional space and the low-dimensional space.

C.

The Euclidean distances and the Cosine similarities of the data points.

D.

The prior and posterior probabilities of the cluster centroids.

33 $What is the role of the 'perplexity' parameter in t-SNE?$

Non-Linear Dimensionality Reduction Concepts - t-SNE Medium

A.

It dictates the number of iterations the gradient descent algorithm will run.

B.

It defines the degrees of freedom for the Student's t-distribution used in the low-dimensional space.

C.

It acts as a continuous measure of the effective number of nearest neighbors each point has, balancing attention between local and global aspects.

D.

It sets the learning rate for the KL divergence optimization.

34 $When interpreting a t-SNE plot, which of the following statements is generally true regarding the distances between distinct clusters?$

Non-Linear Dimensionality Reduction Concepts - t-SNE Medium

A.

Clusters that are close together in the t-SNE plot are mathematically guaranteed to be close in high-dimensional space.

B.

The distance between clusters accurately reflects their absolute distance in the high-dimensional space.

C.

The size (spread) of a cluster in a t-SNE plot is proportional to its variance in the high-dimensional space.

D.

The global distances between distant clusters in a t-SNE plot are often meaningless and highly dependent on initialization.

35 $Conceptually, how does UMAP handle the balance between local and global structure compared to t-SNE?$

UMAP – Conceptual Medium

A.

UMAP requires users to explicitly label global clusters, making it a supervised learning algorithm.

B.

UMAP completely ignores global structure to compute projections faster than t-SNE.

C.

UMAP preserves global structure by using linear combinations of features, whereas t-SNE uses non-linear combinations.

D.

UMAP uses cross-entropy as a cost function and different initialization, often preserving more global structure than t-SNE.

36 $Which two main hyperparameters control the geometry of the UMAP projection?$

UMAP – Conceptual Medium

A.

Number of neighbors (n_neighbors) and Minimum distance (min_dist).

B.

Number of components and Maximum iterations.

C.

Perplexity and Epsilon.

D.

Learning rate and Momentum.

37 $If you increase the min_dist parameter in UMAP, what visual effect will it likely have on the resulting embedding?$

UMAP – Conceptual Medium

A.

The embedding will look more spread out, preventing points from bunching up too closely.

B.

The algorithm will fall back to performing standard Principal Component Analysis.

C.

The number of dimensions of the output will increase.

D.

The clusters will become extremely tight, with points overlapping each other.

38 $In a standard autoencoder used for dimensionality reduction, what is the primary purpose of the 'bottleneck' layer?$

Autoencoder Intuition Medium

A.

To classify the input data into predefined categories directly without a decoder.

B.

To increase the dimensionality of the data so that non-linear relations become linearly separable.

C.

To force the network to learn a compressed, lower-dimensional latent representation of the input data.

D.

To introduce noise into the input data to prevent the model from overfitting.

39 $If an autoencoder has strictly linear activation functions and uses Mean Squared Error (MSE) loss, its bottleneck layer will learn a representation that spans the same subspace as which other technique?$

Autoencoder Intuition Medium

A.

K-Means Clustering

B.

t-SNE

C.

UMAP

D.

Principal Component Analysis (PCA)

40 $Why is the reconstruction loss essential in training an autoencoder?$

Autoencoder Intuition Medium

A.

It calculates the distance between the encoded representations to maximize cluster separation.

B.

It measures the classification error of the labels provided in the dataset.

C.

It ensures that the weights of the encoder and decoder remain orthogonal.

D.

It quantifies how well the decoder can rebuild the original input from the compressed latent representation.

41 $Let and be the maximum and minimum Euclidean distances from a randomly selected query point to a set of points uniformly distributed in a -dimensional hypercube. As the dimension, what is the limiting behavior of the relative contrast ?$

Need for Dimensionality Reduction Hard

A.

It diverges to .

B.

It converges to $0$.

C.

It converges to a constant strictly dependent on the data's variance.

D.

It converges to $1$.

42 $Consider a -dimensional multivariate Gaussian distribution with a zero mean and an identity covariance matrix. As becomes very large, where is the vast majority of the probability mass located geometrically?$

Need for Dimensionality Reduction Hard

A.

Highly concentrated strictly at the origin, which is the mode of the distribution.

B.

In a thin spherical shell (annulus) at a distance of approximately from the origin.

C.

Uniformly distributed throughout the -dimensional hypercube enclosing the distribution.

D.

Along the axes of the standard basis vectors, forming a sparse star-like shape.

43 $You apply Principal Component Analysis (PCA) on a data matrix . If you deliberately do NOT mean-center the data prior to computing the covariance matrix surrogate, how does the first principal component (PC1) geometrically behave?$

Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition) Hard

A.

The matrix becomes rank-deficient, preventing the extraction of PC1.

B.

PC1 remains mathematically identical to the centered case, but the corresponding eigenvalue will be uniformly shifted by the magnitude of the mean.

C.

PC1 will be strictly orthogonal to the mean vector of the dataset.

D.

PC1 will point approximately towards the mean vector of the dataset, effectively capturing the distance from the origin rather than the direction of maximum variance.

44 $Let be the thin Singular Value Decomposition (SVD) of a centered data matrix . How can the ratio of variance explained by the first principal components be mathematically expressed?$

Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition) Hard

A.

B.

C.

D.

45 $Which of the following generative assumptions mathematically guarantees that standard PCA will recover the true underlying lower-dimensional representation of a dataset?$

Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition) Hard

A.

The data lies on an isometric Riemannian manifold where geodesic distance is proportional to Euclidean distance.

B.

The data is generated from a lower-dimensional affine subspace with added isotropic Gaussian noise.

C.

The dataset has extremely high variance along non-linear curves, and its covariance matrix is strictly full rank.

D.

The features of the dataset are strictly independent and follow a continuous uniform distribution.

46 $If the covariance matrix of a dataset has a condition number, what is the strict geometric implication for PCA?$

Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition) Hard

A.

The dataset lies entirely within a lower-dimensional hyperplane, meaning at least one principal component will capture exactly zero variance.

B.

The data contains extreme, infinitely distant outliers that permanently distort the principal components.

C.

The intrinsic dimensionality of the data is strictly equal to its extrinsic dimensionality.

D.

The dataset is perfectly isotropic (spherical), making all principal directions equally valid.

47 $In Kernel PCA using an RBF (Gaussian) kernel, how does the dimension of the mapped feature space fundamentally affect the extraction of principal components?$

Linear Dimensionality Reduction Techniques (PCA – Geometric Intuition) Hard

A.

The feature space is infinite-dimensional, but the maximum number of non-zero principal components is strictly bounded by the number of data points .

B.

The feature space dimension is explicitly computed, restricting Kernel PCA to be performed only in time.

C.

The number of principal components extracted is directly determined by the hyperparameter, regardless of dataset size.

D.

The feature space maps the data to a maximum of dimensions, allowing polynomial time extraction of components.

48 $Manifold learning relies heavily on neighborhood graphs. Which of the following is a critical edge-case vulnerability when defining neighborhoods using -nearest neighbors (-NN) as opposed to an -radius graph?$

Manifold Learning Overview Hard

A.

-NN can falsely connect distant regions of the manifold (short-circuiting) if the sampling density varies wildly, whereas -radius graphs might yield disconnected components in sparse regions.

B.

-radius graphs guarantee a fully connected graph regardless of data density, while -NN naturally isolates outliers.

C.

-NN graphs are always undirected by default, while -radius graphs are naturally directed due to asymmetric distance calculations.

D.

-NN graphs fail completely on non-convex manifolds, whereas -radius graphs are mathematically invariant to manifold convexity.

49 $Consider Isomap and Locally Linear Embedding (LLE). Which of the following statements mathematically contrasts their core structure preservation objectives?$

Manifold Learning Overview Hard

A.

Isomap assumes the manifold is globally linear and applies SVD, whereas LLE models the manifold using a global high-degree polynomial function.

B.

Isomap seeks to globally preserve approximated geodesic distances using Multidimensional Scaling, whereas LLE seeks to preserve local barycentric coordinates used to reconstruct each point from its neighbors.

C.

Isomap minimizes a cross-entropy loss based on topological neighborhood probabilities, whereas LLE maximizes the variance of the projected data subject to orthogonality constraints.

D.

Isomap preserves the exact Euclidean distances for all pairs in the dataset, whereas LLE only preserves Euclidean distances for the -nearest neighbors.

50 $In Spectral Embedding (Laplacian Eigenmaps), the objective is to minimize subject to . Mathematically, this is equivalent to finding the eigenvectors associated with:$

Manifold Learning Overview Hard

A.

The smallest non-zero eigenvalues of the normalized graph Laplacian .

B.

The largest eigenvalues of the adjacency matrix .

C.

The largest eigenvalues of the unnormalized graph Laplacian .

D.

The smallest eigenvalues of the local covariance matrix .

51 $In t-SNE, the gradient of the Kullback-Leibler divergence with respect to the mapped points dictates the dynamics of the embedding. What primarily governs the repulsive force between two points and in this low-dimensional space?$

Non-Linear Dimensionality Reduction Concepts - t-SNE Hard

A.

The heavy-tailed Student t-distribution, which causes the repulsion to act universally across all points based on low-dimensional proximity, regardless of their high-dimensional distance.

B.

The exact Euclidean distance in the high-dimensional space, which explicitly pushes distant points infinitely far apart.

C.

The perplexity parameter, which directly scales the repulsive strength inversely proportional to the local data density.

D.

The degree of freedom of the Student t-distribution, which acts as a hard threshold, clipping the repulsion at a fixed radial distance.

52 $Why is t-SNE theoretically prone to creating artificial, tight clusters when applied to a dataset consisting entirely of uniformly distributed random noise?$

Non-Linear Dimensionality Reduction Concepts - t-SNE Hard

A.

The heavy tails of the Student-t distribution cause all points to be repelled equally, forcing them into a strict, crystalline lattice structure.

B.

Uniform noise has an intrinsic dimensionality of zero, which t-SNE handles by projecting all points onto the surface of a hypersphere.

C.

The KL divergence penalty is asymmetric; it heavily penalizes placing high-dimensional neighbors far apart, but is highly lenient if distant noise points are placed close together, allowing arbitrary clumps to form.

D.

The Gaussian kernel used in the high-dimensional space mathematically forces the overall variance of the random noise to collapse to exactly zero.

53 $If you naively set the perplexity parameter in t-SNE to a value strictly greater than the total number of data points, what mathematical failure occurs during the calculation of the high-dimensional similarities ?$

Non-Linear Dimensionality Reduction Concepts - t-SNE Hard

A.

The similarities evaluate to exactly 1 for all pairs, causing the KL divergence loss function to equal zero before optimization begins.

B.

The algorithm defaults to standard PCA behavior because the probability matrix immediately becomes a perfectly sparse diagonal matrix.

C.

The variance converges to exactly zero, transforming the neighborhood graph into a completely disconnected set of nodes.

D.

The binary search for the variance fails to converge because the required Shannon entropy exceeds the maximum theoretically possible entropy of the discrete distribution ().

54 $In Barnes-Hut t-SNE, the computational complexity is reduced from to . Which mathematical approximation fundamentally enables this massive reduction in complexity?$

Non-Linear Dimensionality Reduction Concepts - t-SNE Hard

A.

Ignoring the repulsive forces entirely for points that are strictly outside the predefined -nearest neighbors graph.

B.

Replacing the Student t-distribution with a uniform distribution for points whose distance exceeds a hyperparameter threshold .

C.

Grouping spatially distant points into a single center of mass using a spatial tree (quadtree/octree) to compute aggregate repulsive forces.

D.

Using a truncated Singular Value Decomposition (SVD) on the high-dimensional probability matrix before gradient descent optimization.

55 $UMAP utilizes a Cross-Entropy loss function to optimize its low-dimensional embeddings, setting it apart from t-SNE. How does this specific loss function fundamentally allow UMAP to better preserve global structure?$

UMAP – Conceptual Hard

A.

It completely eliminates repulsive forces entirely, allowing global distances to be determined strictly by a deterministic eigenvalue decomposition.

B.

It uses a symmetric Gaussian distribution in both the high and low-dimensional spaces, avoiding the geometric distortion introduced by heavy tails.

C.

It contains a term which explicitly penalizes placing high-dimensional distant points close together in the low-dimensional space.

D.

It enforces a strict metric constraint that the low-dimensional Euclidean distances must perfectly match the high-dimensional geodesic distances.

56 $UMAP's theoretical foundation relies on the assumption that data is uniformly distributed across a Riemannian manifold. Since real-world data density is highly variable, how does UMAP mathematically enforce this uniform assumption?$

UMAP – Conceptual Hard

A.

By performing an initial manifold unrolling using a global algorithm like Isomap to evenly distribute the data in the ambient space.

B.

By defining a custom Riemannian metric around each data point where the distance to its nearest neighbor is locally normalized to be constant.

C.

By deliberately injecting uniform Gaussian noise into the high-dimensional dataset prior to computing the topological simplicial complex.

D.

By assuming the ambient space is inherently hyperbolic and mapping all points onto the surface of a Poincaré disk.

57 $What role does the min_dist hyperparameter play in UMAP's low-dimensional optimization, and how does it manifest mathematically in the low-dimensional probability function ?$

UMAP – Conceptual Hard

A.

It acts as a hard threshold in the high-dimensional space; any points closer than min_dist are permanently merged into a single topological simplex.

B.

It defines a plateau in the low-dimensional distance function, controlling how tightly points are allowed to pack together before repulsion scales sharply.

C.

It strictly determines the learning rate decay schedule during the Stochastic Gradient Descent optimization of the cross-entropy loss.

D.

It sets the absolute minimum Euclidean distance required between any two disconnected topological components in the final embedding graph.

58 $Consider a shallow Linear Autoencoder (a single hidden layer, linear activations, trained to minimize MSE) and standard PCA applied to the same dataset. Which of the following accurately describes the mathematical relationship between the Autoencoder's bottleneck weights and the PCA principal components?$

Autoencoder Intuition Hard

A.

The Autoencoder implicitly enforces an norm constraint during gradient descent, resulting in a sparse subspace unlike the dense PCA components.

B.

The row space of spans the exact same principal subspace as the top- PCA components, but the individual rows of are not necessarily orthogonal or variance-ordered.

C.

The Autoencoder will inevitably converge to arbitrary local minima, causing to capture a subspace completely orthogonal to the PCA components.

D.

The rows of will mathematically converge to the exact same ordered and strictly orthogonal eigenvectors found by the PCA covariance matrix.

59 $According to manifold learning theory, when you train a Denoising Autoencoder (DAE) by corrupting inputs to and reconstructing them, what is the geometric interpretation of the learned vector field ?$

Autoencoder Intuition Hard

A.

It approximates the score function (gradient of the log-density,) of the data distribution, effectively pointing corrupted points toward the highest density regions on the manifold.

B.

It maps out a set of strictly orthogonal basis functions that define the null space (the zero-variance directions) of the data manifold.

C.

It calculates the exact geodesics of the manifold by strictly minimizing the path integral of the Euclidean distance between and .

D.

It perfectly aligns with the principal eigenvectors of the global covariance matrix, always pointing along the manifold's flattest linear dimensions.

60 $A Contractive Autoencoder (CAE) introduces a penalty term, where is the Jacobian matrix of the encoder activations with respect to the input. What is the fundamental representation learning goal of this specific regularization?$

Autoencoder Intuition Hard

A.

To align the bottleneck activations with a standard Gaussian uniform distribution by minimizing the Kullback-Leibler divergence to a prior.

B.

To force the learned representation to be insensitive to small local variations in the input space, ensuring features only change along the manifold while remaining flat in orthogonal directions.

C.

To strictly force the weights of the network to become sparse, mathematically minimizing the number of active hidden neurons for any given input.

D.

To ensure the decoder generates an output whose dimensionality is strictly greater than the input, encouraging an overcomplete and entangled representation.

Unit 4 - Practice Quiz