1 $What is the primary difference between internal and external clustering validation metrics?$

Internal and External Clustering Validation Metrics Easy

A.

Internal metrics use ground truth labels, while external metrics do not.

B.

Both internal and external metrics require ground truth labels.

C.

Neither metric type uses ground truth labels.

D.

External metrics use ground truth labels, while internal metrics do not.

2 $Which of the following is a common example of an external clustering validation metric?$

Internal and External Clustering Validation Metrics Easy

A.

Silhouette Score

B.

Adjusted Rand Index (ARI)

C.

Davies-Bouldin Index

D.

Within-Cluster Sum of Squares (WCSS)

3 $Internal clustering validation metrics typically evaluate the quality of clusters based on which two properties?$

Internal and External Clustering Validation Metrics Easy

A.

Accuracy and Precision

B.

True Positives and False Negatives

C.

Stability and Randomness

D.

Cohesion and Separation

4 $What is the range of possible values for the Silhouette Score?$

Silhouette Score and Cohesion–Separation Intuition Easy

A.

to

B.

$0$ to $1$

C.

to $1$

D.

$0$ to $100$

5 $In the context of clustering evaluation, what does "cohesion" measure?$

Silhouette Score and Cohesion–Separation Intuition Easy

A.

How distinct different clusters are from each other

B.

How closely related objects are within the same cluster

C.

The computation time of the algorithm

D.

The accuracy of the labels assigned

6 $What does a Silhouette Score near indicate about a specific data point?$

Silhouette Score and Cohesion–Separation Intuition Easy

A.

It has been assigned to the wrong cluster.

B.

It is perfectly clustered.

C.

It is an outlier that should be deleted.

D.

It lies exactly on the boundary between two clusters.

7 $What is the primary goal of stability-based evaluation in clustering?$

Stability-Based Evaluation Easy

A.

To assign clear, human-readable labels to unlabelled data

B.

To convert high-dimensional data into a 2D plot

C.

To measure the speed and memory usage of the clustering algorithm

D.

To measure how consistent the clustering results are when the data is perturbed or resampled

8 $Which statistical technique is commonly used to test cluster stability by repeatedly drawing random samples with replacement from the dataset?$

Stability-Based Evaluation Easy

A.

Principal Component Analysis (PCA)

B.

Linear Regression

C.

Bootstrapping

D.

One-hot encoding

9 $Why is interpretability often a major challenge in unsupervised learning?$

Interpretability Challenges in Unsupervised Learning Easy

A.

Unsupervised models only output binary $0$ or $1$ values.

B.

The datasets used are always too small to generalize.

C.

There are no predefined ground truth labels to give context to the discovered patterns.

D.

The algorithms are generally too simple to produce meaningful results.

10 $Which of the following best describes the role of "domain expertise" in unsupervised learning?$

Interpretability Challenges in Unsupervised Learning Easy

A.

Using knowledge of a specific field (e.g., medicine or marketing) to assign meaning to clusters

B.

The ability to write advanced Python code for neural networks

C.

Choosing the cloud computing platform for the model

D.

The mathematical understanding of convergence proofs

11 $In topic modeling (a form of unsupervised learning), why might a human struggle to interpret a discovered topic?$

Interpretability Challenges in Unsupervised Learning Easy

A.

Topic models only work on numerical image data.

B.

The model forces all words to have the same frequency.

C.

The words grouped together might not form a coherent theme to a human reader.

D.

The algorithm always translates the words into a different language.

12 $Which of the following is a classic real-world application of unsupervised anomaly detection?$

Real-World Case Studies Easy

A.

Translating a text document from English to French

B.

Credit card fraud detection

C.

Generating realistic human faces

D.

Predicting tomorrow's exact stock market prices

13 $Unsupervised learning is commonly used in recommendation systems to achieve which goal?$

Real-World Case Studies Easy

A.

Determine the raw material cost of a product

B.

Calculate the exact shipping cost of an item

C.

Enforce password complexity rules

D.

Discover latent groups of users with similar preferences

14 $In bioinformatics, clustering algorithms are frequently used for what purpose?$

Real-World Case Studies Easy

A.

Grouping genes with similar expression patterns

B.

Designing hospital building layouts

C.

Diagnosing broken medical equipment

D.

Predicting a patient's exact time of death

15 $What is a major ethical risk when unsupervised learning models discover patterns in demographic data?$

Ethical Considerations in Pattern Discovery Easy

A.

The model might inadvertently discover and amplify historical biases or discrimination.

B.

The model will permanently delete the original dataset.

C.

The model will automatically charge users for data access.

D.

The model will encrypt the data so no one can read it.

16 $Why is data privacy a significant concern in unsupervised pattern discovery?$

Ethical Considerations in Pattern Discovery Easy

A.

Unsupervised algorithms are designed to sell data to third parties.

B.

Data privacy laws do not apply to machine learning.

C.

Unsupervised models cannot be protected by standard passwords.

D.

Clustering might infer sensitive underlying traits about individuals that were not explicitly stated.

17 $An unsupervised algorithm clusters banking customers by zip code, which inadvertently groups them by race, leading to unfair loan rejections. This is an example of:$

Ethical Considerations in Pattern Discovery Easy

A.

Algorithmic bias and proxy discrimination

B.

Dimensionality expansion

C.

Perfect cluster cohesion

D.

Optimal feature selection

18 $Which dimensionality reduction technique is most famous for successfully visualizing the MNIST handwritten digit dataset in 2D or 3D spaces?$

Case Study: Visualizing handwritten digits (MNIST) or customer segmentation data Easy

A.

Linear Regression

B.

K-Means Clustering

C.

t-SNE (t-Distributed Stochastic Neighbor Embedding)

D.

Apriori Algorithm

19 $In a customer segmentation case study, what do the resulting distinct clusters typically represent?$

Case Study: Visualizing handwritten digits (MNIST) or customer segmentation data Easy

A.

Groups of customers with similar purchasing behaviors or demographic profiles

B.

Randomly distributed anomalies in a database system

C.

The specific names and home addresses of the customers

D.

The exact time of day a physical retail store should open

20 $When visually plotting the MNIST dataset using t-SNE, what does a dense cluster of points of the same color typically represent?$

Case Study: Visualizing handwritten digits (MNIST) or customer segmentation data Easy

A.

Images that took the exact same amount of time for the user to draw

B.

The mathematical formula for converting text to speech

C.

Images that are completely unrelated to one another

D.

Images of the same handwritten digit that share structural similarities

21 $Which of the following metrics is most appropriate for evaluating a clustering algorithm when ground-truth labels are available, and you want to account for chance agreement?$

Internal and External Clustering Validation Metrics Medium

A.

Silhouette Coefficient

B.

Dunn Index

C.

Adjusted Rand Index (ARI)

D.

Davies-Bouldin Index

22 $The Dunn Index evaluates clustering quality based on cluster compactness and separation. If is the minimum inter-cluster distance and is the maximum intra-cluster distance, how is the Dunn Index interpreted?$

Internal and External Clustering Validation Metrics Medium

A.

A lower Dunn Index indicates better clustering because it minimizes .

B.

A Dunn Index close to zero indicates optimal clustering.

C.

The Dunn Index must be exactly 1 for a perfectly stable cluster structure.

D.

A higher Dunn Index indicates better clustering, meaning clusters are well-separated and compact.

23 $When using the Davies-Bouldin (DB) Index to evaluate clustering performance, which of the following scenarios represents the best clustering result?$

Internal and External Clustering Validation Metrics Medium

A.

A highly negative DB index

B.

The lowest possible positive DB index

C.

The highest possible positive DB index

D.

A DB index exactly equal to 1

24 $Let be the mean intra-cluster distance for a sample, and be the mean nearest-cluster distance. The silhouette score is given by . What does a score of indicate?$

Silhouette Score and Cohesion–Separation Intuition Medium

A.

The sample is located on or very close to the decision boundary between two clusters.

B.

The clustering algorithm failed to converge.

C.

The sample is perfectly clustered at the center of its own cluster.

D.

The sample is misclassified and belongs to a different cluster.

25 $In the context of cluster cohesion and separation, how does K-Means optimize these two intuitive properties?$

Silhouette Score and Cohesion–Separation Intuition Medium

A.

It explicitly maximizes separation without affecting cohesion.

B.

It uses a penalty term to balance cohesion and separation equally during gradient descent.

C.

It minimizes the within-cluster sum of squares (cohesion), which mathematically maximizes the between-cluster sum of squares (separation) for a fixed dataset.

D.

It maximizes the silhouette score at each iteration.

26 $If a dataset yields a highly negative average Silhouette Score, what is the most likely geometric interpretation of the clusters?$

Silhouette Score and Cohesion–Separation Intuition Medium

A.

The number of clusters is perfectly optimal.

B.

The clusters are perfectly spherical and well-separated.

C.

The clusters represent dense, arbitrary shapes similar to those found by DBSCAN.

D.

The clusters are highly overlapping, and most data points have been assigned to the wrong clusters.

27 $Stability-based evaluation involves repeatedly clustering perturbed versions of the dataset. Which parameter is often determined using this method?$

Stability-Based Evaluation Medium

A.

The maximum number of iterations

B.

The optimal number of clusters,

C.

The distance metric to be used

D.

The learning rate of the algorithm

28 $When comparing two clustering partitions produced during stability-based evaluation via bootstrapping, which metric is commonly used to quantify the agreement between the two partitions?$

Stability-Based Evaluation Medium

A.

Principal Component Variance

B.

Silhouette Score

C.

Jaccard Coefficient

D.

Within-Cluster Sum of Squares (WCSS)

29 $A data scientist applies stability-based evaluation by adding small amounts of Gaussian noise to the dataset. If the resulting clusters change drastically, what does this imply about the original clustering?$

Stability-Based Evaluation Medium

A.

The original clusters represent genuine, deep topological structures in the data.

B.

The clustering model is overfitting to the specific noise or outliers in the original dataset.

C.

The dataset has too few features for unsupervised learning.

D.

The distance metric is too robust to outliers.

30 $Why is interpreting the principal components generated by PCA often challenging in high-dimensional domains like genomics?$

Interpretability Challenges in Unsupervised Learning Medium

A.

PCA components are completely random vectors.

B.

PCA inherently introduces non-linear distortions to the data.

C.

PCA discards the features with the highest variance, removing important information.

D.

Each principal component is a linear combination of potentially all original features, making it hard to assign a single semantic meaning.

31 $When interpreting cluster centroids obtained from K-Means on a dataset with standardized features (mean 0, variance 1), what does a centroid value of $1.5$ for a specific feature signify?$

Interpretability Challenges in Unsupervised Learning Medium

A.

This feature contributed 1.5 times more to the clustering distance than other features.

B.

The feature has an absolute value of 1.5 in the original raw data.

C.

The points in this cluster have an average value for this feature that is 1.5 standard deviations above the global mean.

D.

The cluster spans a distance of 1.5 units across this feature.

32 $Which of the following describes a common approach to improving the interpretability of an autoencoder's latent space representation?$

Interpretability Challenges in Unsupervised Learning Medium

A.

Using exclusively linear activation functions in all layers.

B.

Increasing the dimensionality of the latent space to exceed the input space.

C.

Applying sparsity constraints (e.g., L1 regularization) to the latent activations.

D.

Training the autoencoder without any reconstruction loss.

33 $In a real-world anomaly detection case study for credit card fraud, an unsupervised Isolation Forest model is deployed. What is the most significant practical limitation of relying solely on this unsupervised approach?$

Real-World Case Studies Medium

A.

It scales exponentially with the number of transactions, making it unusable in real-time.

B.

It may flag rare but legitimate transactions as anomalies, leading to a high false-positive rate.

C.

It can only detect fraud types that have been explicitly labeled in the past.

D.

It requires a strictly linear relationship between transaction features.

34 $When performing topic modeling (e.g., using LDA) on a large corpus of news articles, how is the quality of the unsupervised topics usually evaluated in a real-world setting?$

Real-World Case Studies Medium

A.

Through human evaluation of topic coherence (e.g., seeing if the top words in a topic make semantic sense together).

B.

By checking if the perplexity score is perfectly zero.

C.

By ensuring every document belongs to exactly one topic with a probability of 1.0.

D.

By calculating the exact Silhouette score of the text embeddings.

35 $An unsupervised clustering algorithm groups job applicants based on resume text. It ends up creating a cluster that predominantly contains female applicants, despite 'gender' being removed from the data. What ethical issue does this highlight?$

Ethical Considerations in Pattern Discovery Medium

A.

Data privacy was violated during the text tokenization phase.

B.

The algorithm failed to minimize the within-cluster variance.

C.

The clustering model was under-fitted and requires more epochs.

D.

The presence of proxy variables (e.g., women's colleges, specific clubs) implicitly captured the sensitive attribute.

36 $Why is 'reinforcement of historical bias' a significant concern in Unsupervised Learning algorithms used for pattern discovery?$

Ethical Considerations in Pattern Discovery Medium

A.

Unsupervised models extract patterns inherent in the data; if the historical data reflects societal biases, the model will identify and potentially codify those biases as objective clusters.

B.

Unsupervised algorithms are programmed to intentionally alter data distributions.

C.

Unsupervised models require a perfectly uniform distribution of classes to function ethically.

D.

Because the algorithms rely on labels, biased labels will immediately corrupt the model.

37 $In the context of clustering user data for a targeted advertising system, which of the following poses the greatest risk to user privacy (deanonymization)?$

Ethical Considerations in Pattern Discovery Medium

A.

Standardizing the features to have a mean of zero.

B.

Applying PCA to reduce the data from 100 dimensions to 10 dimensions before clustering.

C.

Allowing the algorithm to form 'micro-clusters' consisting of only one or two individuals.

D.

Using a small number of very large, general clusters (e.g.,).

38 $When applying t-SNE to visualize the 784-dimensional MNIST handwritten digit dataset in 2D, a data scientist notices that the distance between the cluster of '0's and the cluster of '1's is very large. How should this distance be interpreted?$

Case Study: Visualizing handwritten digits (MNIST) or customer segmentation data Medium

A.

t-SNE primarily preserves local neighborhood structures; global distances between distinct clusters in the 2D plot are not strictly meaningful or proportional to true distances.

B.

t-SNE preserves global distances perfectly, so this represents the exact Euclidean distance in the 784D space.

C.

The large distance implies that the perplexity parameter was set too low.

D.

It strictly indicates that '0's and '1's are the most visually dissimilar digits in the entire dataset.

39 $In a customer segmentation case study, a dataset contains 'Age' (ranging 18-80) and 'Annual Income' (ranging $20,000-$150,000). Before applying K-Means clustering, the data scientist forgets to scale the data. What is the most likely consequence?$

Case Study: Visualizing handwritten digits (MNIST) or customer segmentation data Medium

A.

The algorithm will automatically normalize the distances internally.

B.

K-Means will fail to converge entirely.

C.

The clusters will be determined almost entirely by 'Annual Income', as its scale and variance are vastly larger, dominating the Euclidean distance.

D.

The clusters will be determined almost entirely by 'Age' because its variance is mathematically harder to compute.

40 $When reducing the dimensionality of the MNIST dataset to 2D for visualization, a researcher compares PCA and UMAP. The PCA plot shows overlapping classes, while the UMAP plot shows clearly distinct islands for each digit. What explains this difference?$

Case Study: Visualizing handwritten digits (MNIST) or customer segmentation data Medium

A.

PCA attempts to maximize global variance linearly, which cannot capture the non-linear manifold of the digits, whereas UMAP captures non-linear local neighborhood relationships.

B.

PCA is a non-linear technique, making it prone to overlapping classes, while UMAP is strictly linear.

C.

PCA removes the mean of the data, which destroys the structural information of images.

D.

UMAP utilizes supervised labels during its default projection, whereas PCA is purely unsupervised.

41 $Suppose you are evaluating a clustering algorithm using the Normalized Mutual Information (NMI) and the Adjusted Rand Index (ARI). The ground truth consists of roughly equal-sized clusters. The algorithm degenerates and places every single data point into its own individual cluster (i.e., clusters for points). How will the Homogeneity, Completeness, and ARI behave in this edge case?$

Internal and External Clustering Validation Metrics Hard

A.

Homogeneity approaches 0, Completeness = 1, ARI = 0

B.

Homogeneity approaches 0, Completeness approaches 0, ARI approaches 0

C.

Homogeneity = 1, Completeness approaches 0, ARI approaches -1

D.

Homogeneity = 1, Completeness approaches 0, ARI approaches 0

42 $The Calinski-Harabasz (CH) Index is defined as . What is the mathematical vulnerability of the CH Index when evaluating algorithms like DBSCAN that can produce an arbitrary number of clusters along with noise points, assuming noise points are assigned to a single 'noise' cluster?$

Internal and External Clustering Validation Metrics Hard

A.

The factor causes the CH index to become negative when noise points exceed valid cluster points.

B.

The CH index strictly monotonically increases as approaches, favoring absolute fragmentation.

C.

The inclusion of noise points inflates disproportionately, heavily penalizing dense, well-separated clusters.

D.

The CH index assumes spherical clusters and uses the global centroid; non-convex clusters or a widely dispersed 'noise' cluster will artificially inflate, severely dropping the CH score.

43 $Consider the Davies-Bouldin (DB) Index, defined as . If you apply a clustering algorithm to a high-dimensional dataset where the distance metric suffers from the 'curse of dimensionality' (i.e., all pairwise distances converge to a similar value), what is the asymptotic behavior of the DB Index?$

Internal and External Clustering Validation Metrics Hard

A.

It approaches infinity because the centroid distances () converge to 0.

B.

It becomes exactly 1 for all possible cluster assignments regardless of the data distribution.

C.

It converges to a constant ratio dependent only on the arbitrary cluster assignment sizes, losing its discriminative power.

D.

It converges to 0 because the scatter () approaches 0 in high dimensions.

44 $Which of the following scenarios describes a theoretical flaw when using the Fowlkes-Mallows Index (FMI) to compare two clusterings of highly imbalanced ground truth data?$

Internal and External Clustering Validation Metrics Hard

A.

FMI is unaffected by true negatives, meaning it completely ignores the vast majority of point pairs that correctly do not belong to the same cluster.

B.

FMI converges to the Jaccard Index as the cluster sizes become increasingly imbalanced.

C.

FMI strictly requires an equal number of predicted clusters and ground truth clusters to be mathematically defined.

D.

FMI penalizes false negatives more than false positives, causing it to favor over-segmentation.

45 $The Silhouette coefficient for a data point is . Suppose you cluster a dataset consisting of two perfectly concentric circles using DBSCAN, which correctly identifies the inner circle as Cluster 1 and the outer circle as Cluster 2. What will be the general characteristic of the Silhouette scores for the points in Cluster 2 (the outer circle)?$

Silhouette Score and Cohesion–Separation Intuition Hard

A.

They will be close to +1 because the clusters are perfectly separated topologically.

B.

They will be close to 0 or negative because the mean distance to points in the inner circle () can be smaller than the mean distance to points across the outer circle ().

C.

They will be undefined because DBSCAN does not use centroids for cluster assignment.

D.

They will fluctuate uniformly between -1 and +1 depending strictly on the density parameter .

46 $A researcher is optimizing the hyperparameter in K-Means by maximizing the average Silhouette score. The dataset fundamentally consists of three clusters: one highly dense spherical cluster of 10,000 points, and two sparse, elongated clusters of 100 points each. How might Silhouette maximization mislead the researcher?$

Silhouette Score and Cohesion–Separation Intuition Hard

A.

It will likely choose by grouping the two sparse clusters together to minimize intra-cluster distance penalties associated with elongated shapes.

B.

It will bias towards splitting the massive dense cluster into multiple sub-clusters because maximizing the global average silhouette heavily weights the dense cluster's internal cohesion.

C.

It will inherently fail to compute because Silhouette cannot handle clusters with differing sample sizes.

D.

It will prefer but force the sparse clusters to merge, leaving one cluster empty.

47 $By convention, if a cluster contains only a single data point (a singleton), its Silhouette score is set to 0. If this convention were instead evaluated mathematically using the standard formula without overriding, what logical paradox would occur?$

Silhouette Score and Cohesion–Separation Intuition Hard

A.

The denominator would become negative, invalidating the metric.

B.

The equation would perfectly compute to 0 naturally without needing a convention.

C.

The term would equal 0, making the numerator negative and yielding .

D.

The term (mean intra-cluster distance) would be undefined or zero, causing division by zero if is also zero, or incorrectly evaluating to .

48 $When using stability-based evaluation to determine the optimal number of clusters, you repeatedly subsample the data and measure the agreement (e.g., using Adjusted Rand Index) between the clusterings. In a dataset drawn from a completely uniform distribution with no true clusters, what is the expected behavior of the stability curve as increases from 2 to ?$

Stability-Based Evaluation Hard

A.

Stability will remain near 1.0 for all, indicating that uniform data is perfectly stable.

B.

Stability will linearly increase as grows, eventually reaching 1.0.

C.

Stability will oscillate predictably between -1 and 1 depending on whether is even or odd.

D.

Stability will be consistently low (near 0) across all because the arbitrary cluster boundaries will shift wildly with different subsamples.

49 $A modeler applies a stability-based method to evaluate a K-Means clustering model. They bootstrap the dataset times, cluster each sample, and calculate the pairwise Jaccard coefficient of the cluster assignments. Why might bootstrapping introduce a pessimistic bias (underestimating true stability) compared to subsampling without replacement in this specific context?$

Stability-Based Evaluation Hard

A.

Bootstrapping inherently reduces the dimensionality of the dataset, distorting the distance metric.

B.

Bootstrapping changes the total number of points in each sample, making Jaccard coefficients mathematically impossible to compute.

C.

Bootstrapping creates duplicate data points, which shifts K-Means centroids toward dense duplicated regions and alters boundaries more drastically than mere subsetting.

D.

Bootstrapping ensures every original point appears exactly once across all samples, preventing proper variance estimation.

50 $Consider evaluating cluster stability via a Prediction Strength metric. The dataset is split into training and test sets; clusters are found on both. Test points are then assigned to the nearest training cluster centroid. What represents a critical failure mode of this specific evaluation strategy when applied to clusters with highly irregular, non-convex shapes?$

Stability-Based Evaluation Hard

A.

Non-convex clusters always have overlapping training and test subsets, violating the independence assumption of the metric.

B.

The test set will always contain out-of-distribution points, making prediction strength naturally inflate to 1.

C.

Prediction strength requires computing the determinant of the covariance matrix, which is singular for non-convex shapes.

D.

Prediction strength assumes clusters are best represented by global centroids; nearest-centroid assignment will incorrectly classify test points of non-convex clusters, yielding falsely low stability.

51 $To improve interpretability in generative unsupervised models, researchers often use a -VAE to enforce disentangled representations. Mathematically, this is achieved by scaling the Kullback-Leibler (KL) divergence term in the ELBO by . What is the primary theoretical trade-off encountered when enforcing this interpretability constraint?$

Interpretability Challenges in Unsupervised Learning Hard

A.

It increases the dimensionality of the latent space to infinity, causing the 'curse of dimensionality'.

B.

It severely degrades the reconstruction quality (the likelihood term) because the model is forced to prioritize matching an isotropic Gaussian prior over capturing complex data variance.

C.

It converts the unsupervised learning problem into a supervised one, requiring labeled data for convergence.

D.

It forces the latent space to become highly correlated, leading to mode collapse.

52 $A common post-hoc method to interpret a black-box clustering algorithm is to train a surrogate decision tree predicting the cluster labels from the input features. If the underlying clustering algorithm is Spectral Clustering applied to concentric rings (a non-linear manifold), what is the most likely interpretability challenge faced by the surrogate decision tree?$

Interpretability Challenges in Unsupervised Learning Hard

A.

The decision tree will achieve 100% fidelity but will have only two leaf nodes, providing no useful information.

B.

The decision tree will require an impractically deep structure with many orthogonal axis-aligned splits to approximate the circular boundaries, reducing human interpretability and risking poor fidelity.

C.

Spectral Clustering outputs categorical cluster centers, which cannot be used as target labels for a standard decision tree.

D.

The tree will perfectly capture the eigenvectors, forcing the user to interpret Laplacian matrices rather than original features.

53 $In Principal Component Analysis (PCA), the principal components are linear combinations of the original features, which aids interpretability via 'loadings'. In contrast, a deep Autoencoder with non-linear activation functions typically lacks this interpretability. Which mathematical property strictly present in PCA is absent in standard Autoencoders, making the latter's latent space harder to interpret?$

Interpretability Challenges in Unsupervised Learning Hard

A.

The use of a bottleneck layer to compress information.

B.

Strict orthogonality and hierarchical variance maximization of the latent dimensions.

C.

The differentiability of the latent variables with respect to the input.

D.

Minimization of reconstruction error.

54 $An unsupervised model is used to segment neighborhoods for targeted marketing. To ensure fairness, the data scientists explicitly remove 'Race' and 'Income' from the dataset. However, an external audit reveals the clusters are still highly correlated with race. Which fundamental phenomenon of unsupervised learning causes this ethical failure?$

Ethical Considerations in Pattern Discovery Hard

A.

Mode collapse, where the clustering algorithm ignores all features except the one with the highest variance.

B.

The curse of dimensionality, which mathematically biases Euclidean distances towards minority groups.

C.

Redundant Encoding (or Proxy Variables), where remaining features like 'Zip Code' or 'Purchasing Habits' perfectly reconstruct the omitted protected attributes.

D.

Simpson's Paradox, where trends appear in different groups of data but disappear when combined.

55 $To address fairness in clustering, algorithms like 'Fair K-Means' introduce constraints into the objective function. If is the set of points in cluster, is a protected demographic group, and is the total dataset size, which constraint represents the concept of 'Disparate Impact' mitigation (demographic parity) in Fair K-Means?$

Ethical Considerations in Pattern Discovery Hard

A.

B.

C.

D.

56 $In a real-world cybersecurity application for anomaly detection, an Isolation Forest is chosen over distance-based methods like K-Nearest Neighbors (KNN). Given a dataset with features where normal traffic forms a dense hyper-sphere and anomalies are sparsely distributed, what is the theoretical justification for this choice?$

Real-World Case Studies Hard

A.

Isolation Forests project the data into a 2D space using eigenvectors, inherently filtering out high-dimensional noise.

B.

Isolation Forests compute pairwise Euclidean distances in time, bypassing the computational cost of KNN in high dimensions.

C.

Distance-based methods fail because the ratio of the distance to the nearest neighbor over the distance to the farthest neighbor approaches 1 in high dimensions, making anomalies indistinguishable from normal points.

D.

KNN requires normalized data to function, whereas network traffic data is strictly categorical.

57 $When analyzing single-cell RNA sequencing data (a real-world clustering application), researchers frequently utilize Louvain community detection on a K-Nearest Neighbor (KNN) graph rather than K-Means clustering. Which property of single-cell data makes Louvain biologically more meaningful in this context?$

Real-World Case Studies Hard

A.

K-Means requires labeled data to initialize centroids, which is unavailable in single-cell discovery.

B.

Cells often differentiate along continuous trajectories (pseudotime) forming complex, non-Euclidean manifolds; Louvain on a KNN graph captures this topological connectivity rather than assuming spherical clusters.

C.

The Louvain algorithm naturally handles the interpretation of missing gene expression by imputing zero values during its modularity optimization step.

D.

Single-cell data lies in a low-dimensional Euclidean space where K-Means suffers from centroid collapse.

58 $When applying t-SNE to the MNIST dataset to visualize digit clusters, the 'perplexity' hyperparameter balances attention between local and global aspects of the data. If a researcher mistakenly sets the perplexity to be equal to (the total number of data points), what will the resulting visualization look like?$

Case Study: Visualizing handwritten digits (MNIST) or customer segmentation data Hard

A.

It will degenerate into a single, uninformative, spherical blob of points with almost no local cluster structure preserved.

B.

It will produce 10 perfectly separated, infinitesimal points, one for each digit.

C.

It will exactly reproduce the output of the first two Principal Components of PCA.

D.

It will cause a division by zero error in the conditional probability distribution step, halting computation.

59 $You are performing customer segmentation using a Gaussian Mixture Model (GMM). Your dataset includes 'Age' and 'Annual Income'. Income is exponentially distributed and skewed, containing extreme outliers. If you use a full, untied covariance matrix for the GMM, what is the most severe mathematical risk you face during Expectation-Maximization (EM)?$

Case Study: Visualizing handwritten digits (MNIST) or customer segmentation data Hard

A.

The algorithm will strictly enforce diagonal covariance matrices due to the exponential distribution of income.

B.

The covariance matrix of a cluster assigned to a single outlier could become singular (determinant approaches 0), causing the likelihood to approach infinity (a singularity).

C.

The posterior probabilities (responsibilities) will all collapse to exactly 0.5, halting the algorithm.

D.

The EM algorithm will perfectly fit a single Gaussian to the entire dataset, ignoring the clusters.

60 $When applying UMAP to the MNIST handwritten digits dataset, changing the distance metric from Euclidean to Cosine significantly alters the topological embedding. What underlying morphological property of the MNIST digits is fundamentally ignored by the Cosine distance compared to Euclidean distance?$

Case Study: Visualizing handwritten digits (MNIST) or customer segmentation data Hard

A.

The rotational variance of the digits, because Cosine distance is perfectly rotation invariant.

B.

The spatial location of the pixels; Cosine distance treats the image as a bag-of-pixels.

C.

The negative pixel values, because Cosine distance requires strictly positive vectors.

D.

The total intensity/brightness (ink volume) of the digit, as Cosine distance normalizes the magnitude of the feature vectors.

Unit 6 - Practice Quiz