Unit 6 - Practice Quiz

INT396 60 Questions
0 Correct 0 Wrong 60 Left
0/60

1 What is the primary difference between internal and external clustering validation metrics?

Internal and External Clustering Validation Metrics Easy
A. Both internal and external metrics require ground truth labels.
B. Internal metrics use ground truth labels, while external metrics do not.
C. External metrics use ground truth labels, while internal metrics do not.
D. Neither metric type uses ground truth labels.

2 Which of the following is a common example of an external clustering validation metric?

Internal and External Clustering Validation Metrics Easy
A. Adjusted Rand Index (ARI)
B. Within-Cluster Sum of Squares (WCSS)
C. Silhouette Score
D. Davies-Bouldin Index

3 Internal clustering validation metrics typically evaluate the quality of clusters based on which two properties?

Internal and External Clustering Validation Metrics Easy
A. Accuracy and Precision
B. Cohesion and Separation
C. True Positives and False Negatives
D. Stability and Randomness

4 What is the range of possible values for the Silhouette Score?

Silhouette Score and Cohesion–Separation Intuition Easy
A. $0$ to $100$
B. to
C. to $1$
D. $0$ to $1$

5 In the context of clustering evaluation, what does "cohesion" measure?

Silhouette Score and Cohesion–Separation Intuition Easy
A. The accuracy of the labels assigned
B. The computation time of the algorithm
C. How distinct different clusters are from each other
D. How closely related objects are within the same cluster

6 What does a Silhouette Score near indicate about a specific data point?

Silhouette Score and Cohesion–Separation Intuition Easy
A. It has been assigned to the wrong cluster.
B. It is an outlier that should be deleted.
C. It is perfectly clustered.
D. It lies exactly on the boundary between two clusters.

7 What is the primary goal of stability-based evaluation in clustering?

Stability-Based Evaluation Easy
A. To measure how consistent the clustering results are when the data is perturbed or resampled
B. To measure the speed and memory usage of the clustering algorithm
C. To assign clear, human-readable labels to unlabelled data
D. To convert high-dimensional data into a 2D plot

8 Which statistical technique is commonly used to test cluster stability by repeatedly drawing random samples with replacement from the dataset?

Stability-Based Evaluation Easy
A. Linear Regression
B. Principal Component Analysis (PCA)
C. One-hot encoding
D. Bootstrapping

9 Why is interpretability often a major challenge in unsupervised learning?

Interpretability Challenges in Unsupervised Learning Easy
A. The datasets used are always too small to generalize.
B. There are no predefined ground truth labels to give context to the discovered patterns.
C. Unsupervised models only output binary $0$ or $1$ values.
D. The algorithms are generally too simple to produce meaningful results.

10 Which of the following best describes the role of "domain expertise" in unsupervised learning?

Interpretability Challenges in Unsupervised Learning Easy
A. The mathematical understanding of convergence proofs
B. The ability to write advanced Python code for neural networks
C. Choosing the cloud computing platform for the model
D. Using knowledge of a specific field (e.g., medicine or marketing) to assign meaning to clusters

11 In topic modeling (a form of unsupervised learning), why might a human struggle to interpret a discovered topic?

Interpretability Challenges in Unsupervised Learning Easy
A. Topic models only work on numerical image data.
B. The words grouped together might not form a coherent theme to a human reader.
C. The model forces all words to have the same frequency.
D. The algorithm always translates the words into a different language.

12 Which of the following is a classic real-world application of unsupervised anomaly detection?

Real-World Case Studies Easy
A. Credit card fraud detection
B. Translating a text document from English to French
C. Generating realistic human faces
D. Predicting tomorrow's exact stock market prices

13 Unsupervised learning is commonly used in recommendation systems to achieve which goal?

Real-World Case Studies Easy
A. Enforce password complexity rules
B. Discover latent groups of users with similar preferences
C. Calculate the exact shipping cost of an item
D. Determine the raw material cost of a product

14 In bioinformatics, clustering algorithms are frequently used for what purpose?

Real-World Case Studies Easy
A. Predicting a patient's exact time of death
B. Diagnosing broken medical equipment
C. Grouping genes with similar expression patterns
D. Designing hospital building layouts

15 What is a major ethical risk when unsupervised learning models discover patterns in demographic data?

Ethical Considerations in Pattern Discovery Easy
A. The model will permanently delete the original dataset.
B. The model might inadvertently discover and amplify historical biases or discrimination.
C. The model will encrypt the data so no one can read it.
D. The model will automatically charge users for data access.

16 Why is data privacy a significant concern in unsupervised pattern discovery?

Ethical Considerations in Pattern Discovery Easy
A. Unsupervised algorithms are designed to sell data to third parties.
B. Unsupervised models cannot be protected by standard passwords.
C. Clustering might infer sensitive underlying traits about individuals that were not explicitly stated.
D. Data privacy laws do not apply to machine learning.

17 An unsupervised algorithm clusters banking customers by zip code, which inadvertently groups them by race, leading to unfair loan rejections. This is an example of:

Ethical Considerations in Pattern Discovery Easy
A. Algorithmic bias and proxy discrimination
B. Perfect cluster cohesion
C. Optimal feature selection
D. Dimensionality expansion

18 Which dimensionality reduction technique is most famous for successfully visualizing the MNIST handwritten digit dataset in 2D or 3D spaces?

Case Study: Visualizing handwritten digits (MNIST) or customer segmentation data Easy
A. K-Means Clustering
B. Linear Regression
C. t-SNE (t-Distributed Stochastic Neighbor Embedding)
D. Apriori Algorithm

19 In a customer segmentation case study, what do the resulting distinct clusters typically represent?

Case Study: Visualizing handwritten digits (MNIST) or customer segmentation data Easy
A. The exact time of day a physical retail store should open
B. Groups of customers with similar purchasing behaviors or demographic profiles
C. The specific names and home addresses of the customers
D. Randomly distributed anomalies in a database system

20 When visually plotting the MNIST dataset using t-SNE, what does a dense cluster of points of the same color typically represent?

Case Study: Visualizing handwritten digits (MNIST) or customer segmentation data Easy
A. Images that are completely unrelated to one another
B. The mathematical formula for converting text to speech
C. Images of the same handwritten digit that share structural similarities
D. Images that took the exact same amount of time for the user to draw

21 Which of the following metrics is most appropriate for evaluating a clustering algorithm when ground-truth labels are available, and you want to account for chance agreement?

Internal and External Clustering Validation Metrics Medium
A. Dunn Index
B. Adjusted Rand Index (ARI)
C. Silhouette Coefficient
D. Davies-Bouldin Index

22 The Dunn Index evaluates clustering quality based on cluster compactness and separation. If is the minimum inter-cluster distance and is the maximum intra-cluster distance, how is the Dunn Index interpreted?

Internal and External Clustering Validation Metrics Medium
A. A Dunn Index close to zero indicates optimal clustering.
B. The Dunn Index must be exactly 1 for a perfectly stable cluster structure.
C. A higher Dunn Index indicates better clustering, meaning clusters are well-separated and compact.
D. A lower Dunn Index indicates better clustering because it minimizes .

23 When using the Davies-Bouldin (DB) Index to evaluate clustering performance, which of the following scenarios represents the best clustering result?

Internal and External Clustering Validation Metrics Medium
A. A DB index exactly equal to 1
B. The lowest possible positive DB index
C. A highly negative DB index
D. The highest possible positive DB index

24 Let be the mean intra-cluster distance for a sample, and be the mean nearest-cluster distance. The silhouette score is given by . What does a score of indicate?

Silhouette Score and Cohesion–Separation Intuition Medium
A. The sample is misclassified and belongs to a different cluster.
B. The sample is perfectly clustered at the center of its own cluster.
C. The clustering algorithm failed to converge.
D. The sample is located on or very close to the decision boundary between two clusters.

25 In the context of cluster cohesion and separation, how does K-Means optimize these two intuitive properties?

Silhouette Score and Cohesion–Separation Intuition Medium
A. It maximizes the silhouette score at each iteration.
B. It minimizes the within-cluster sum of squares (cohesion), which mathematically maximizes the between-cluster sum of squares (separation) for a fixed dataset.
C. It explicitly maximizes separation without affecting cohesion.
D. It uses a penalty term to balance cohesion and separation equally during gradient descent.

26 If a dataset yields a highly negative average Silhouette Score, what is the most likely geometric interpretation of the clusters?

Silhouette Score and Cohesion–Separation Intuition Medium
A. The clusters are perfectly spherical and well-separated.
B. The clusters are highly overlapping, and most data points have been assigned to the wrong clusters.
C. The number of clusters is perfectly optimal.
D. The clusters represent dense, arbitrary shapes similar to those found by DBSCAN.

27 Stability-based evaluation involves repeatedly clustering perturbed versions of the dataset. Which parameter is often determined using this method?

Stability-Based Evaluation Medium
A. The learning rate of the algorithm
B. The maximum number of iterations
C. The distance metric to be used
D. The optimal number of clusters,

28 When comparing two clustering partitions produced during stability-based evaluation via bootstrapping, which metric is commonly used to quantify the agreement between the two partitions?

Stability-Based Evaluation Medium
A. Within-Cluster Sum of Squares (WCSS)
B. Principal Component Variance
C. Silhouette Score
D. Jaccard Coefficient

29 A data scientist applies stability-based evaluation by adding small amounts of Gaussian noise to the dataset. If the resulting clusters change drastically, what does this imply about the original clustering?

Stability-Based Evaluation Medium
A. The original clusters represent genuine, deep topological structures in the data.
B. The dataset has too few features for unsupervised learning.
C. The distance metric is too robust to outliers.
D. The clustering model is overfitting to the specific noise or outliers in the original dataset.

30 Why is interpreting the principal components generated by PCA often challenging in high-dimensional domains like genomics?

Interpretability Challenges in Unsupervised Learning Medium
A. PCA inherently introduces non-linear distortions to the data.
B. PCA components are completely random vectors.
C. PCA discards the features with the highest variance, removing important information.
D. Each principal component is a linear combination of potentially all original features, making it hard to assign a single semantic meaning.

31 When interpreting cluster centroids obtained from K-Means on a dataset with standardized features (mean 0, variance 1), what does a centroid value of $1.5$ for a specific feature signify?

Interpretability Challenges in Unsupervised Learning Medium
A. This feature contributed 1.5 times more to the clustering distance than other features.
B. The points in this cluster have an average value for this feature that is 1.5 standard deviations above the global mean.
C. The cluster spans a distance of 1.5 units across this feature.
D. The feature has an absolute value of 1.5 in the original raw data.

32 Which of the following describes a common approach to improving the interpretability of an autoencoder's latent space representation?

Interpretability Challenges in Unsupervised Learning Medium
A. Increasing the dimensionality of the latent space to exceed the input space.
B. Applying sparsity constraints (e.g., L1 regularization) to the latent activations.
C. Training the autoencoder without any reconstruction loss.
D. Using exclusively linear activation functions in all layers.

33 In a real-world anomaly detection case study for credit card fraud, an unsupervised Isolation Forest model is deployed. What is the most significant practical limitation of relying solely on this unsupervised approach?

Real-World Case Studies Medium
A. It requires a strictly linear relationship between transaction features.
B. It can only detect fraud types that have been explicitly labeled in the past.
C. It scales exponentially with the number of transactions, making it unusable in real-time.
D. It may flag rare but legitimate transactions as anomalies, leading to a high false-positive rate.

34 When performing topic modeling (e.g., using LDA) on a large corpus of news articles, how is the quality of the unsupervised topics usually evaluated in a real-world setting?

Real-World Case Studies Medium
A. By ensuring every document belongs to exactly one topic with a probability of 1.0.
B. Through human evaluation of topic coherence (e.g., seeing if the top words in a topic make semantic sense together).
C. By calculating the exact Silhouette score of the text embeddings.
D. By checking if the perplexity score is perfectly zero.

35 An unsupervised clustering algorithm groups job applicants based on resume text. It ends up creating a cluster that predominantly contains female applicants, despite 'gender' being removed from the data. What ethical issue does this highlight?

Ethical Considerations in Pattern Discovery Medium
A. The clustering model was under-fitted and requires more epochs.
B. Data privacy was violated during the text tokenization phase.
C. The algorithm failed to minimize the within-cluster variance.
D. The presence of proxy variables (e.g., women's colleges, specific clubs) implicitly captured the sensitive attribute.

36 Why is 'reinforcement of historical bias' a significant concern in Unsupervised Learning algorithms used for pattern discovery?

Ethical Considerations in Pattern Discovery Medium
A. Unsupervised models extract patterns inherent in the data; if the historical data reflects societal biases, the model will identify and potentially codify those biases as objective clusters.
B. Because the algorithms rely on labels, biased labels will immediately corrupt the model.
C. Unsupervised models require a perfectly uniform distribution of classes to function ethically.
D. Unsupervised algorithms are programmed to intentionally alter data distributions.

37 In the context of clustering user data for a targeted advertising system, which of the following poses the greatest risk to user privacy (deanonymization)?

Ethical Considerations in Pattern Discovery Medium
A. Using a small number of very large, general clusters (e.g., ).
B. Allowing the algorithm to form 'micro-clusters' consisting of only one or two individuals.
C. Standardizing the features to have a mean of zero.
D. Applying PCA to reduce the data from 100 dimensions to 10 dimensions before clustering.

38 When applying t-SNE to visualize the 784-dimensional MNIST handwritten digit dataset in 2D, a data scientist notices that the distance between the cluster of '0's and the cluster of '1's is very large. How should this distance be interpreted?

Case Study: Visualizing handwritten digits (MNIST) or customer segmentation data Medium
A. It strictly indicates that '0's and '1's are the most visually dissimilar digits in the entire dataset.
B. t-SNE preserves global distances perfectly, so this represents the exact Euclidean distance in the 784D space.
C. t-SNE primarily preserves local neighborhood structures; global distances between distinct clusters in the 2D plot are not strictly meaningful or proportional to true distances.
D. The large distance implies that the perplexity parameter was set too low.

39 In a customer segmentation case study, a dataset contains 'Age' (ranging 18-80) and 'Annual Income' (ranging $20,000-$150,000). Before applying K-Means clustering, the data scientist forgets to scale the data. What is the most likely consequence?

Case Study: Visualizing handwritten digits (MNIST) or customer segmentation data Medium
A. K-Means will fail to converge entirely.
B. The clusters will be determined almost entirely by 'Age' because its variance is mathematically harder to compute.
C. The clusters will be determined almost entirely by 'Annual Income', as its scale and variance are vastly larger, dominating the Euclidean distance.
D. The algorithm will automatically normalize the distances internally.

40 When reducing the dimensionality of the MNIST dataset to 2D for visualization, a researcher compares PCA and UMAP. The PCA plot shows overlapping classes, while the UMAP plot shows clearly distinct islands for each digit. What explains this difference?

Case Study: Visualizing handwritten digits (MNIST) or customer segmentation data Medium
A. UMAP utilizes supervised labels during its default projection, whereas PCA is purely unsupervised.
B. PCA removes the mean of the data, which destroys the structural information of images.
C. PCA attempts to maximize global variance linearly, which cannot capture the non-linear manifold of the digits, whereas UMAP captures non-linear local neighborhood relationships.
D. PCA is a non-linear technique, making it prone to overlapping classes, while UMAP is strictly linear.

41 Suppose you are evaluating a clustering algorithm using the Normalized Mutual Information (NMI) and the Adjusted Rand Index (ARI). The ground truth consists of roughly equal-sized clusters. The algorithm degenerates and places every single data point into its own individual cluster (i.e., clusters for points). How will the Homogeneity, Completeness, and ARI behave in this edge case?

Internal and External Clustering Validation Metrics Hard
A. Homogeneity = 1, Completeness approaches 0, ARI approaches -1
B. Homogeneity approaches 0, Completeness approaches 0, ARI approaches 0
C. Homogeneity = 1, Completeness approaches 0, ARI approaches 0
D. Homogeneity approaches 0, Completeness = 1, ARI = 0

42 The Calinski-Harabasz (CH) Index is defined as . What is the mathematical vulnerability of the CH Index when evaluating algorithms like DBSCAN that can produce an arbitrary number of clusters along with noise points, assuming noise points are assigned to a single 'noise' cluster?

Internal and External Clustering Validation Metrics Hard
A. The CH index assumes spherical clusters and uses the global centroid; non-convex clusters or a widely dispersed 'noise' cluster will artificially inflate , severely dropping the CH score.
B. The CH index strictly monotonically increases as approaches , favoring absolute fragmentation.
C. The inclusion of noise points inflates disproportionately, heavily penalizing dense, well-separated clusters.
D. The factor causes the CH index to become negative when noise points exceed valid cluster points.

43 Consider the Davies-Bouldin (DB) Index, defined as . If you apply a clustering algorithm to a high-dimensional dataset where the distance metric suffers from the 'curse of dimensionality' (i.e., all pairwise distances converge to a similar value ), what is the asymptotic behavior of the DB Index?

Internal and External Clustering Validation Metrics Hard
A. It converges to a constant ratio dependent only on the arbitrary cluster assignment sizes, losing its discriminative power.
B. It approaches infinity because the centroid distances () converge to 0.
C. It becomes exactly 1 for all possible cluster assignments regardless of the data distribution.
D. It converges to 0 because the scatter () approaches 0 in high dimensions.

44 Which of the following scenarios describes a theoretical flaw when using the Fowlkes-Mallows Index (FMI) to compare two clusterings of highly imbalanced ground truth data?

Internal and External Clustering Validation Metrics Hard
A. FMI converges to the Jaccard Index as the cluster sizes become increasingly imbalanced.
B. FMI is unaffected by true negatives, meaning it completely ignores the vast majority of point pairs that correctly do not belong to the same cluster.
C. FMI strictly requires an equal number of predicted clusters and ground truth clusters to be mathematically defined.
D. FMI penalizes false negatives more than false positives, causing it to favor over-segmentation.

45 The Silhouette coefficient for a data point is . Suppose you cluster a dataset consisting of two perfectly concentric circles using DBSCAN, which correctly identifies the inner circle as Cluster 1 and the outer circle as Cluster 2. What will be the general characteristic of the Silhouette scores for the points in Cluster 2 (the outer circle)?

Silhouette Score and Cohesion–Separation Intuition Hard
A. They will be close to +1 because the clusters are perfectly separated topologically.
B. They will be undefined because DBSCAN does not use centroids for cluster assignment.
C. They will be close to 0 or negative because the mean distance to points in the inner circle () can be smaller than the mean distance to points across the outer circle ().
D. They will fluctuate uniformly between -1 and +1 depending strictly on the density parameter .

46 A researcher is optimizing the hyperparameter in K-Means by maximizing the average Silhouette score. The dataset fundamentally consists of three clusters: one highly dense spherical cluster of 10,000 points, and two sparse, elongated clusters of 100 points each. How might Silhouette maximization mislead the researcher?

Silhouette Score and Cohesion–Separation Intuition Hard
A. It will likely choose by grouping the two sparse clusters together to minimize intra-cluster distance penalties associated with elongated shapes.
B. It will prefer but force the sparse clusters to merge, leaving one cluster empty.
C. It will bias towards splitting the massive dense cluster into multiple sub-clusters because maximizing the global average silhouette heavily weights the dense cluster's internal cohesion.
D. It will inherently fail to compute because Silhouette cannot handle clusters with differing sample sizes.

47 By convention, if a cluster contains only a single data point (a singleton), its Silhouette score is set to 0. If this convention were instead evaluated mathematically using the standard formula without overriding, what logical paradox would occur?

Silhouette Score and Cohesion–Separation Intuition Hard
A. The term (mean intra-cluster distance) would be undefined or zero, causing division by zero if is also zero, or incorrectly evaluating to .
B. The equation would perfectly compute to 0 naturally without needing a convention.
C. The term would equal 0, making the numerator negative and yielding .
D. The denominator would become negative, invalidating the metric.

48 When using stability-based evaluation to determine the optimal number of clusters , you repeatedly subsample the data and measure the agreement (e.g., using Adjusted Rand Index) between the clusterings. In a dataset drawn from a completely uniform distribution with no true clusters, what is the expected behavior of the stability curve as increases from 2 to ?

Stability-Based Evaluation Hard
A. Stability will be consistently low (near 0) across all because the arbitrary cluster boundaries will shift wildly with different subsamples.
B. Stability will linearly increase as grows, eventually reaching 1.0.
C. Stability will remain near 1.0 for all , indicating that uniform data is perfectly stable.
D. Stability will oscillate predictably between -1 and 1 depending on whether is even or odd.

49 A modeler applies a stability-based method to evaluate a K-Means clustering model. They bootstrap the dataset times, cluster each sample, and calculate the pairwise Jaccard coefficient of the cluster assignments. Why might bootstrapping introduce a pessimistic bias (underestimating true stability) compared to subsampling without replacement in this specific context?

Stability-Based Evaluation Hard
A. Bootstrapping changes the total number of points in each sample, making Jaccard coefficients mathematically impossible to compute.
B. Bootstrapping ensures every original point appears exactly once across all samples, preventing proper variance estimation.
C. Bootstrapping creates duplicate data points, which shifts K-Means centroids toward dense duplicated regions and alters boundaries more drastically than mere subsetting.
D. Bootstrapping inherently reduces the dimensionality of the dataset, distorting the distance metric.

50 Consider evaluating cluster stability via a Prediction Strength metric. The dataset is split into training and test sets; clusters are found on both. Test points are then assigned to the nearest training cluster centroid. What represents a critical failure mode of this specific evaluation strategy when applied to clusters with highly irregular, non-convex shapes?

Stability-Based Evaluation Hard
A. Prediction strength requires computing the determinant of the covariance matrix, which is singular for non-convex shapes.
B. The test set will always contain out-of-distribution points, making prediction strength naturally inflate to 1.
C. Prediction strength assumes clusters are best represented by global centroids; nearest-centroid assignment will incorrectly classify test points of non-convex clusters, yielding falsely low stability.
D. Non-convex clusters always have overlapping training and test subsets, violating the independence assumption of the metric.

51 To improve interpretability in generative unsupervised models, researchers often use a -VAE to enforce disentangled representations. Mathematically, this is achieved by scaling the Kullback-Leibler (KL) divergence term in the ELBO by . What is the primary theoretical trade-off encountered when enforcing this interpretability constraint?

Interpretability Challenges in Unsupervised Learning Hard
A. It severely degrades the reconstruction quality (the likelihood term) because the model is forced to prioritize matching an isotropic Gaussian prior over capturing complex data variance.
B. It converts the unsupervised learning problem into a supervised one, requiring labeled data for convergence.
C. It forces the latent space to become highly correlated, leading to mode collapse.
D. It increases the dimensionality of the latent space to infinity, causing the 'curse of dimensionality'.

52 A common post-hoc method to interpret a black-box clustering algorithm is to train a surrogate decision tree predicting the cluster labels from the input features. If the underlying clustering algorithm is Spectral Clustering applied to concentric rings (a non-linear manifold), what is the most likely interpretability challenge faced by the surrogate decision tree?

Interpretability Challenges in Unsupervised Learning Hard
A. The decision tree will require an impractically deep structure with many orthogonal axis-aligned splits to approximate the circular boundaries, reducing human interpretability and risking poor fidelity.
B. The tree will perfectly capture the eigenvectors, forcing the user to interpret Laplacian matrices rather than original features.
C. The decision tree will achieve 100% fidelity but will have only two leaf nodes, providing no useful information.
D. Spectral Clustering outputs categorical cluster centers, which cannot be used as target labels for a standard decision tree.

53 In Principal Component Analysis (PCA), the principal components are linear combinations of the original features, which aids interpretability via 'loadings'. In contrast, a deep Autoencoder with non-linear activation functions typically lacks this interpretability. Which mathematical property strictly present in PCA is absent in standard Autoencoders, making the latter's latent space harder to interpret?

Interpretability Challenges in Unsupervised Learning Hard
A. The use of a bottleneck layer to compress information.
B. Minimization of reconstruction error.
C. The differentiability of the latent variables with respect to the input.
D. Strict orthogonality and hierarchical variance maximization of the latent dimensions.

54 An unsupervised model is used to segment neighborhoods for targeted marketing. To ensure fairness, the data scientists explicitly remove 'Race' and 'Income' from the dataset. However, an external audit reveals the clusters are still highly correlated with race. Which fundamental phenomenon of unsupervised learning causes this ethical failure?

Ethical Considerations in Pattern Discovery Hard
A. Simpson's Paradox, where trends appear in different groups of data but disappear when combined.
B. Redundant Encoding (or Proxy Variables), where remaining features like 'Zip Code' or 'Purchasing Habits' perfectly reconstruct the omitted protected attributes.
C. Mode collapse, where the clustering algorithm ignores all features except the one with the highest variance.
D. The curse of dimensionality, which mathematically biases Euclidean distances towards minority groups.

55 To address fairness in clustering, algorithms like 'Fair K-Means' introduce constraints into the objective function. If is the set of points in cluster , is a protected demographic group, and is the total dataset size, which constraint represents the concept of 'Disparate Impact' mitigation (demographic parity) in Fair K-Means?

Ethical Considerations in Pattern Discovery Hard
A.
B.
C.
D.

56 In a real-world cybersecurity application for anomaly detection, an Isolation Forest is chosen over distance-based methods like K-Nearest Neighbors (KNN). Given a dataset with features where normal traffic forms a dense hyper-sphere and anomalies are sparsely distributed, what is the theoretical justification for this choice?

Real-World Case Studies Hard
A. KNN requires normalized data to function, whereas network traffic data is strictly categorical.
B. Distance-based methods fail because the ratio of the distance to the nearest neighbor over the distance to the farthest neighbor approaches 1 in high dimensions, making anomalies indistinguishable from normal points.
C. Isolation Forests project the data into a 2D space using eigenvectors, inherently filtering out high-dimensional noise.
D. Isolation Forests compute pairwise Euclidean distances in time, bypassing the computational cost of KNN in high dimensions.

57 When analyzing single-cell RNA sequencing data (a real-world clustering application), researchers frequently utilize Louvain community detection on a K-Nearest Neighbor (KNN) graph rather than K-Means clustering. Which property of single-cell data makes Louvain biologically more meaningful in this context?

Real-World Case Studies Hard
A. K-Means requires labeled data to initialize centroids, which is unavailable in single-cell discovery.
B. Cells often differentiate along continuous trajectories (pseudotime) forming complex, non-Euclidean manifolds; Louvain on a KNN graph captures this topological connectivity rather than assuming spherical clusters.
C. The Louvain algorithm naturally handles the interpretation of missing gene expression by imputing zero values during its modularity optimization step.
D. Single-cell data lies in a low-dimensional Euclidean space where K-Means suffers from centroid collapse.

58 When applying t-SNE to the MNIST dataset to visualize digit clusters, the 'perplexity' hyperparameter balances attention between local and global aspects of the data. If a researcher mistakenly sets the perplexity to be equal to (the total number of data points), what will the resulting visualization look like?

Case Study: Visualizing handwritten digits (MNIST) or customer segmentation data Hard
A. It will exactly reproduce the output of the first two Principal Components of PCA.
B. It will cause a division by zero error in the conditional probability distribution step, halting computation.
C. It will produce 10 perfectly separated, infinitesimal points, one for each digit.
D. It will degenerate into a single, uninformative, spherical blob of points with almost no local cluster structure preserved.

59 You are performing customer segmentation using a Gaussian Mixture Model (GMM). Your dataset includes 'Age' and 'Annual Income'. Income is exponentially distributed and skewed, containing extreme outliers. If you use a full, untied covariance matrix for the GMM, what is the most severe mathematical risk you face during Expectation-Maximization (EM)?

Case Study: Visualizing handwritten digits (MNIST) or customer segmentation data Hard
A. The covariance matrix of a cluster assigned to a single outlier could become singular (determinant approaches 0), causing the likelihood to approach infinity (a singularity).
B. The algorithm will strictly enforce diagonal covariance matrices due to the exponential distribution of income.
C. The EM algorithm will perfectly fit a single Gaussian to the entire dataset, ignoring the clusters.
D. The posterior probabilities (responsibilities) will all collapse to exactly 0.5, halting the algorithm.

60 When applying UMAP to the MNIST handwritten digits dataset, changing the distance metric from Euclidean to Cosine significantly alters the topological embedding. What underlying morphological property of the MNIST digits is fundamentally ignored by the Cosine distance compared to Euclidean distance?

Case Study: Visualizing handwritten digits (MNIST) or customer segmentation data Hard
A. The total intensity/brightness (ink volume) of the digit, as Cosine distance normalizes the magnitude of the feature vectors.
B. The negative pixel values, because Cosine distance requires strictly positive vectors.
C. The rotational variance of the digits, because Cosine distance is perfectly rotation invariant.
D. The spatial location of the pixels; Cosine distance treats the image as a bag-of-pixels.