1Which of the following best describes the agglomerative approach to hierarchical clustering?
Hierarchical clustering: Agglomerative vs. Divisive
Easy
A.A density-based approach that identifies core points.
B.A bottom-up approach that starts with each data point as its own cluster.
C.A centroid-based approach that requires specifying in advance.
D.A top-down approach that starts with all data points in a single cluster.
Correct Answer: A bottom-up approach that starts with each data point as its own cluster.
Explanation:
Agglomerative clustering is a bottom-up approach where each data point starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
Incorrect! Try again.
2Divisive hierarchical clustering operates by:
Hierarchical clustering: Agglomerative vs. Divisive
Easy
A.Starting with all points in one cluster and recursively splitting them.
B.Merging the closest pairs of clusters sequentially.
C.Identifying dense regions separated by sparse regions.
D.Assigning points to the nearest centroid.
Correct Answer: Starting with all points in one cluster and recursively splitting them.
Explanation:
Divisive clustering is a top-down method. It begins with all data points in a single cluster and iteratively splits clusters until each point forms its own cluster.
Incorrect! Try again.
3In the initial step of agglomerative hierarchical clustering on a dataset with points, how many clusters are there?
Hierarchical clustering: Agglomerative vs. Divisive
Easy
A.$1$
B.
C.
D.
Correct Answer:
Explanation:
In agglomerative clustering, every individual data point initially represents its own cluster, resulting in clusters for points.
Incorrect! Try again.
4Which linkage method defines the distance between two clusters as the shortest distance between any two points in the clusters?
Single linkage computes the distance between two clusters based on the minimum distance between any single point in the first cluster and any single point in the second.
Incorrect! Try again.
5Complete linkage is based on which of the following metrics?
A.It calculates the average of distances between all pairs of points across two clusters.
B.It uses the closest pair of points.
C.It explicitly minimizes the sum of squared errors.
D.It uses the furthest pair of points.
Correct Answer: It calculates the average of distances between all pairs of points across two clusters.
Explanation:
Average linkage computes the distance between two clusters as the average of all pairwise distances between points in cluster A and points in cluster B.
Incorrect! Try again.
8Which linkage method is most prone to the 'chaining' effect, where clusters end up being long and uncompact?
B.It requires the user to specify the number of clusters in advance.
C.It forces every single point into a cluster.
D.It only works well with linearly separable data.
Correct Answer: It can discover clusters of arbitrary shapes.
Explanation:
Because it groups points based on dense regions separated by sparse areas, DBSCAN is highly effective at finding non-spherical, arbitrarily shaped clusters.
Incorrect! Try again.
14In DBSCAN, what defines a 'core point'?
e-neighborhood, MinPts, Noise and border points
Easy
A.A point that has zero neighbors.
B.A point that belongs to multiple clusters.
C.A point that has at least MinPts within its -neighborhood.
D.A point that is the furthest from the center of a cluster.
Correct Answer: A point that has at least MinPts within its -neighborhood.
Explanation:
A core point is a data point that is situated in a dense region, meaning it has a minimum number of points (MinPts) within a specific radius ().
Incorrect! Try again.
15How does DBSCAN classify a 'border point'?
e-neighborhood, MinPts, Noise and border points
Easy
A.It does not belong to any -neighborhood.
B.It has more than MinPts within its -neighborhood.
C.It is randomly assigned to a cluster boundary.
D.It has fewer than MinPts within its -neighborhood, but falls within the neighborhood of a core point.
Correct Answer: It has fewer than MinPts within its -neighborhood, but falls within the neighborhood of a core point.
Explanation:
A border point is not dense enough to be a core point itself, but it is reachable from a core point because it lies within that core point's -radius.
Incorrect! Try again.
16What is a 'noise point' in the context of DBSCAN?
e-neighborhood, MinPts, Noise and border points
Easy
A.A point that is neither a core point nor a border point.
B.A point that is the exact center of a cluster.
C.A point that connects two distinct clusters.
D.A point that satisfies the MinPts condition perfectly.
Correct Answer: A point that is neither a core point nor a border point.
Explanation:
Noise points (or outliers) in DBSCAN are points that do not have enough neighbors to be core points and do not fall within the neighborhood of any core point.
Incorrect! Try again.
17In DBSCAN, what does the parameter (epsilon) represent?
e-neighborhood, MinPts, Noise and border points
Easy
A.The threshold for hierarchical cluster merging.
B.The maximum distance (radius) used to define the neighborhood of a point.
C.The total number of clusters to form.
D.The minimum number of points required to form a cluster.
Correct Answer: The maximum distance (radius) used to define the neighborhood of a point.
Explanation:
(epsilon) specifies the radius of the neighborhood around a given point used to calculate local density.
Incorrect! Try again.
18Unlike Hierarchical clustering and DBSCAN, what must be provided as an explicit input parameter to the standard k-Means algorithm?
Comparison: k-Means vs. Hierarchical vs. DBSCAN
Easy
A.The linkage criteria
B.The radius
C.The number of clusters
D.The minimum number of points (MinPts)
Correct Answer: The number of clusters
Explanation:
Standard k-Means requires the user to pre-define , the number of clusters to be formed, before the algorithm runs.
Incorrect! Try again.
19Which clustering algorithm inherently identifies outliers and explicitly leaves them unclustered?
Comparison: k-Means vs. Hierarchical vs. DBSCAN
Easy
A.k-Means
B.Ward's Hierarchical
C.DBSCAN
D.Agglomerative Hierarchical
Correct Answer: DBSCAN
Explanation:
DBSCAN categorizes data points into core, border, and noise. Noise points are identified as outliers and are not assigned to any cluster.
Incorrect! Try again.
20Which clustering method generates a tree-like hierarchy of clusters that does not require an initial assumption about the number of clusters?
Comparison: k-Means vs. Hierarchical vs. DBSCAN
Easy
A.k-Means
B.Hierarchical clustering
C.K-Medoids
D.DBSCAN
Correct Answer: Hierarchical clustering
Explanation:
Hierarchical clustering builds a hierarchy (represented by a dendrogram) of cluster merges or splits, allowing the number of clusters to be chosen after the tree is built.
Incorrect! Try again.
21In the context of hierarchical clustering on a dataset with points, which of the following accurately describes the initial and final states of the Divisive approach?
Hierarchical clustering: Agglomerative vs. Divisive
Medium
A.Initial state: clusters containing 1 point each; Final state: 1 cluster containing points
B.Initial state: clusters; Final state: clusters containing 1 point each
C.Initial state: 1 cluster containing points; Final state: clusters containing 1 point each
D.Initial state: 1 cluster containing points; Final state: clusters based on density
Correct Answer: Initial state: 1 cluster containing points; Final state: clusters containing 1 point each
Explanation:
Divisive hierarchical clustering is a top-down approach. It starts with all data points in a single macro-cluster and recursively splits them until each point forms its own individual cluster.
Incorrect! Try again.
22Which linkage method defines the distance between two clusters as the maximum distance between any single point in the first cluster and any single point in the second cluster, tending to produce tightly bound, spherical clusters?
Linkage methods (single, complete, average, Ward)
Medium
A.Average linkage
B.Single linkage
C.Ward's linkage
D.Complete linkage
Correct Answer: Complete linkage
Explanation:
Complete linkage uses the maximum distance between points in two clusters to define cluster distance. This avoids the chaining effect seen in single linkage and tends to produce highly compact clusters.
Incorrect! Try again.
23An analyst notices that their hierarchical clustering results suffer from the "chaining effect," where clusters are stretched out into long, thin bands. Which linkage method was most likely used?
Linkage methods (single, complete, average, Ward)
Medium
A.Ward's method
B.Complete linkage
C.Average linkage
D.Single linkage
Correct Answer: Single linkage
Explanation:
Single linkage defines cluster distance based on the closest pair of points between two clusters. This can cause clusters to merge if just two points are close, leading to long, elongated "chains" of data points.
Incorrect! Try again.
24Unlike single, complete, and average linkage which rely strictly on pairwise distances between points, Ward's method decides which clusters to merge by minimizing what metric?
Linkage methods (single, complete, average, Ward)
Medium
A.The between-cluster sum of squared errors
B.The maximum distance between cluster centroids
C.The within-cluster sum of squared errors (WCSS)
D.The median distance between all cluster points
Correct Answer: The within-cluster sum of squared errors (WCSS)
Explanation:
Ward's method evaluates the total within-cluster variance. At each step, it merges the two clusters that result in the smallest increase in the total within-cluster sum of squares.
Incorrect! Try again.
25When interpreting a dendrogram generated by agglomerative clustering, what does the vertical height (on the y-axis) of a horizontal merge line represent?
Dendrogram interpretation
Medium
A.The variance explained by the merging of the two clusters
B.The distance or dissimilarity threshold at which the two clusters were merged
C.The number of data points in the newly formed cluster
D.The density of the newly formed cluster
Correct Answer: The distance or dissimilarity threshold at which the two clusters were merged
Explanation:
In a standard dendrogram, the y-axis represents the distance (or dissimilarity measure). A horizontal line connecting two branches indicates the distance threshold at which those two clusters were deemed similar enough to merge.
Incorrect! Try again.
26If you draw a horizontal cut line across a dendrogram at a specific height , how is the number of resulting clusters determined?
Dendrogram interpretation
Medium
A.By summing the data points below height
B.By calculating the ratio of total height to
C.By counting the number of vertical lines that intersect the horizontal cut line
D.By counting the number of horizontal merge lines exactly at height
Correct Answer: By counting the number of vertical lines that intersect the horizontal cut line
Explanation:
Cutting a dendrogram with a horizontal line creates a partition of the data. Each vertical line that intersects this cut line represents a distinct cluster at that specific distance threshold.
Incorrect! Try again.
27In the DBSCAN algorithm, what is the primary condition for a data point to be classified as a "core point"?
Density-based clustering: DBSCAN fundamentals
Medium
A.It must have a distance less than to all other points in the dataset.
B.It must be reachable from another core point and have fewer than neighbors.
C.It must have at least number of points (including itself) within its -neighborhood.
D.It must be exactly in the geometric center of a cluster.
Correct Answer: It must have at least number of points (including itself) within its -neighborhood.
Explanation:
A core point in DBSCAN is defined strictly by density: the -neighborhood surrounding the point must contain at least a specified minimum number of points, denoted as .
Incorrect! Try again.
28Consider a point in a dataset analyzed via DBSCAN. Point has fewer than points in its -neighborhood, but it falls within the -neighborhood of a core point . How will DBSCAN classify point ?
Noise and border points
Medium
A.As a core point
B.As an outlier
C.As a noise point
D.As a border point
Correct Answer: As a border point
Explanation:
A border point does not have enough points in its own neighborhood to be a core point, but it is reachable from (i.e., falls within the -neighborhood of) a core point. Thus, it becomes part of the cluster.
Incorrect! Try again.
29If DBSCAN finishes running and a point is neither a core point nor reachable from any core point in the dataset, what is its final designation?
Noise and border points
Medium
A.Noise point
B.Border point
C.Centroid
D.Isolated core point
Correct Answer: Noise point
Explanation:
Points that are not core points and do not fall within the -neighborhood of any core point are considered isolated in low-density regions and are classified as noise (or outliers).
Incorrect! Try again.
30If you decrease the value of (epsilon) while keeping constant in DBSCAN, what is the most likely effect on the clustering output?
e-neighborhood, MinPts
Medium
A.Clusters will merge together into larger, single clusters.
B.More points will be classified as noise, and existing clusters may split.
C.Fewer points will be classified as noise.
D.The number of core points will increase.
Correct Answer: More points will be classified as noise, and existing clusters may split.
Explanation:
Decreasing shrinks the neighborhood size. This makes it harder for points to satisfy the condition, reducing the number of core points, breaking up clusters, and increasing the number of noise points.
Incorrect! Try again.
31You are given a dataset containing two concentric ring-shaped clusters. Which clustering algorithm is best suited to correctly identify these two non-linear clusters?
Comparison: k-Means vs. Hierarchical vs. DBSCAN
Medium
A.Divisive clustering with complete linkage
B.Agglomerative clustering with Ward's linkage
C.k-Means clustering
D.DBSCAN
Correct Answer: DBSCAN
Explanation:
DBSCAN can find arbitrarily shaped clusters (like concentric rings) because it forms clusters based on local density connectivity. k-Means and Ward's method favor spherical, convex clusters and would fail here.
Incorrect! Try again.
32Which of the following statements accurately compares k-Means and DBSCAN regarding their handling of outliers?
Comparison: k-Means vs. Hierarchical vs. DBSCAN
Medium
A.DBSCAN incorporates outliers into the nearest border points, whereas k-Means drops them from the dataset.
B.Both algorithms are highly robust to outliers because they use median distances.
C.k-Means forces all points into clusters, shifting centroids due to outliers, whereas DBSCAN ignores isolated points as noise.
D.Both algorithms assign a specific "noise" label to outliers.
Correct Answer: k-Means forces all points into clusters, shifting centroids due to outliers, whereas DBSCAN ignores isolated points as noise.
Explanation:
k-Means assigns every single point to a cluster, meaning outliers can skew the centroid calculations. DBSCAN explicitly classifies points in low-density regions as noise, entirely excluding them from clusters.
Incorrect! Try again.
33When comparing the time complexity of clustering algorithms for a large dataset of points, which of the following is generally true?
Comparison: k-Means vs. Hierarchical vs. DBSCAN
Medium
A.k-Means is generally faster than Standard Agglomerative Hierarchical clustering .
B.Divisive clustering is always faster than k-Means because it splits data linearly.
C.Standard Agglomerative Hierarchical clustering is generally faster than k-Means .
D.DBSCAN is always the slowest algorithm regardless of indexing.
Correct Answer: k-Means is generally faster than Standard Agglomerative Hierarchical clustering .
Explanation:
k-Means has a near-linear time complexity relative to the number of points, making it highly scalable. Standard hierarchical clustering requires computing a distance matrix and updating it, resulting in a time complexity of or .
Incorrect! Try again.
34In DBSCAN, what happens to the clustering model if is set to 1?
e-neighborhood, MinPts
Medium
A.Every point will be classified as a core point, and points within of each other will form clusters.
B.Every point will be classified as noise regardless of the value.
C.The algorithm will behave exactly like k-Means.
D.The algorithm will fail to execute due to a division by zero error.
Correct Answer: Every point will be classified as a core point, and points within of each other will form clusters.
Explanation:
If , every single point satisfies the core point condition (since a point includes itself in its neighborhood). Consequently, any two points within distance of each other will merge into the same cluster, turning DBSCAN into a simple single-linkage component algorithm.
Incorrect! Try again.
35Which of the following is a major drawback of standard Agglomerative Hierarchical Clustering?
Hierarchical clustering: Agglomerative vs. Divisive
Medium
A.Once a merge is performed, it cannot be undone in subsequent steps.
B.It can only identify spherical clusters.
C.It randomly initializes centroids, leading to non-deterministic results.
D.It requires the user to specify the number of clusters before running the algorithm.
Correct Answer: Once a merge is performed, it cannot be undone in subsequent steps.
Explanation:
Agglomerative clustering is a greedy algorithm. Once two clusters are merged at a particular step, that decision is permanent for the remainder of the hierarchy building process.
Incorrect! Try again.
36Which algorithm does NOT require the user to explicitly declare the desired number of clusters beforehand, but instead infers the number of clusters from the data's properties?
Comparison: k-Means vs. Hierarchical vs. DBSCAN
Medium
A.DBSCAN
B.k-Means
C.k-Medoids
D.Spectral Clustering
Correct Answer: DBSCAN
Explanation:
Unlike k-Means or k-Medoids which require the hyperparameter , DBSCAN automatically determines the number of clusters based on the density parameters and .
Incorrect! Try again.
37You observe a dendrogram where the longest vertical branches without any horizontal merges occur between a height of 10 and 25. What does this gap suggest about the dataset?
Dendrogram interpretation
Medium
A.The algorithm encountered a local minimum between heights 10 and 25.
B.The clusters formed above height 25 are completely identical to those below height 10.
C.The data has a high amount of noise points.
D.Cutting the dendrogram anywhere between 10 and 25 will yield a highly stable and natural clustering partition.
Correct Answer: Cutting the dendrogram anywhere between 10 and 25 will yield a highly stable and natural clustering partition.
Explanation:
A large vertical gap in a dendrogram indicates a significant jump in dissimilarity required to merge the next set of clusters. Cutting within this large gap generally represents a stable number of natural clusters.
Incorrect! Try again.
38In the formal definition of DBSCAN, what is the relationship between "Directly Density-Reachable" and "Density-Reachable"?
Density-based clustering: DBSCAN fundamentals
Medium
A.Directly density-reachable applies only to noise points, while density-reachable applies to core points.
B.They are synonymous terms describing identical spatial relationships.
C.Density-reachable is the transitive closure of directly density-reachable.
D.Density-reachable is a symmetric relationship, while directly density-reachable is not.
Correct Answer: Density-reachable is the transitive closure of directly density-reachable.
Explanation:
Point is density-reachable from point if there is a chain of points where and , such that each is directly density-reachable from .
Incorrect! Try again.
39Suppose you are computing the distance between Cluster A (containing 3 points) and Cluster B (containing 4 points) using Average Linkage. How many pairwise distance calculations are averaged to find the distance between A and B?
Linkage methods (single, complete, average, Ward)
Medium
A.7
B.12
C.1
D.2
Correct Answer: 12
Explanation:
Average linkage computes the distance between two clusters as the average of all pairwise distances between points in the first cluster and points in the second cluster. Here, pairwise distances are averaged.
Incorrect! Try again.
40A border point in DBSCAN is within the -neighborhood of two different core points, and , which belong to two entirely separate clusters. How does DBSCAN handle the assignment of point ?
Noise and border points
Medium
A.It assigns to both clusters simultaneously, creating an overlapping clustering.
B.It assigns to whichever cluster's core point discovers it first.
C.It classifies as a noise point because it cannot break the tie.
D.It duplicates , putting one copy in 's cluster and one in 's cluster.
Correct Answer: It assigns to whichever cluster's core point discovers it first.
Explanation:
DBSCAN partitions data into non-overlapping clusters. If a border point is reachable from core points in different clusters, its cluster assignment depends on the order in which the points are processed in the dataset.
Incorrect! Try again.
41An exact divisive hierarchical clustering algorithm on a dataset of points requires evaluating all possible bipartite splits of a cluster at each step. What is the worst-case time complexity of creating the first split in this exact divisive approach, assuming no heuristic optimizations like DIANA are used?
Hierarchical clustering: Agglomerative vs. Divisive
Hard
A.
B.
C.
D.
Correct Answer:
Explanation:
To make the first split in an exact divisive clustering approach, we must consider all possible ways to partition objects into two non-empty subsets. The number of such subsets is , leading to exponential time complexity, which is why heuristics like DIANA are used in practice.
Incorrect! Try again.
42Ward's linkage method aims to minimize the total within-cluster variance. When deciding to merge clusters and with centroids and , the increase in the Sum of Squared Errors (SSE), denoted as , is proportional to the squared Euclidean distance between their centroids. Which of the following defines the exact cost ?
Linkage methods (single, complete, average, Ward)
Hard
A.
B.
C.
D.
Correct Answer:
Explanation:
Ward's method merges the two clusters that result in the smallest increase in total within-cluster variance. This increase is mathematically represented by the squared distance between the cluster centroids, weighted by the harmonic mean-like term of their sizes: .
Incorrect! Try again.
43Consider the Lance-Williams update formula: . For Complete Linkage, what are the specific values of and ?
Linkage methods (single, complete, average, Ward)
Hard
A.
B.
C.
D.
Correct Answer:
Explanation:
Complete linkage defines the distance as . Using the identity , the Lance-Williams coefficients must be , and .
Incorrect! Try again.
44Single Linkage clustering can be directly derived from a Minimum Spanning Tree (MST) of the data points. If a dataset has points and all pairwise distances are distinct, how can one obtain the exact clusters produced by Single Linkage from the MST?
Linkage methods (single, complete, average, Ward)
Hard
A.By finding the subtrees with the minimum total edge weight.
B.By removing all edges in the MST with weights greater than the median edge weight, recursively times.
C.By removing the edges in the MST that have the largest weights.
D.It is impossible; Single Linkage handles graph components differently than Kruskal's or Prim's algorithms.
Correct Answer: By removing the edges in the MST that have the largest weights.
Explanation:
Single Linkage clustering corresponds to building an MST (e.g., via Kruskal's algorithm). Removing the longest edges in the MST naturally partitions the graph into exactly connected components, which are identical to the clusters formed by Single Linkage.
Incorrect! Try again.
45You calculate the cophenetic correlation coefficient to evaluate an agglomerative clustering output. Which of the following scenarios would theoretically yield a cophenetic correlation coefficient of strictly $1.0$?
Dendrogram interpretation
Hard
A.The original distance matrix exactly satisfies the ultrametric inequality for all triplets of points.
B.The data points lie exactly on a one-dimensional straight line.
C.The original distance matrix strictly satisfies the triangle inequality for all points.
D.Ward's linkage is used on a dataset where all variables follow a standard normal distribution.
Correct Answer: The original distance matrix exactly satisfies the ultrametric inequality for all triplets of points.
Explanation:
The cophenetic correlation coefficient measures how faithfully a dendrogram preserves pairwise distances. A coefficient of $1.0$ implies a perfect linear relationship between original distances and dendrogram merge heights. This perfect tree-like structure is only possible if the original distances form a perfect ultrametric space ().
Incorrect! Try again.
46A researcher is analyzing a dendrogram generated by a hierarchical clustering algorithm. They observe 'reversals' (or 'inversions'), where a parent node merges at a lower distance height than its child nodes. Which of the following explains this anomaly?
Dendrogram interpretation
Hard
A.Ward's linkage was used, causing the objective function to contract heavily in the initial merges.
B.The data contains highly dense clusters intermixed with uniform noise, causing scaling issues on the y-axis.
C.The distance metric used was non-Euclidean, such as Cosine distance or Manhattan distance.
D.The researcher used a linkage method that violates the space-conserving Lance-Williams constraints, such as Centroid or Median linkage.
Correct Answer: The researcher used a linkage method that violates the space-conserving Lance-Williams constraints, such as Centroid or Median linkage.
Explanation:
Reversals in a dendrogram indicate a lack of monotonicity. Linkage methods like Centroid and Median are not strictly monotonic, meaning the distance between merged clusters can be smaller than the distance between the clusters that formed them, leading to crossover branches in the dendrogram.
Incorrect! Try again.
47In a dendrogram produced by Complete Linkage, a very long vertical line segment with no horizontal merges branching off indicates:
Dendrogram interpretation
Hard
A.That the clusters merged at the bottom of the vertical line are highly highly chained and space-contracting.
B.A substantial range of distance thresholds (cutting heights) over which the number of clusters and their compositions remain unchanged.
C.A violation of the ultrametric inequality during the agglomeration steps.
D.That the dataset contains extreme outliers that were forced to merge at a localized point.
Correct Answer: A substantial range of distance thresholds (cutting heights) over which the number of clusters and their compositions remain unchanged.
Explanation:
The vertical axis of a dendrogram represents the merge distance. A long vertical line means that one must increase the distance threshold significantly before the next merge occurs. Thus, cutting the dendrogram anywhere along this large vertical gap yields the same stable set of clusters.
Incorrect! Try again.
48In the context of DBSCAN, let be an undirected graph where vertices are data points. An edge exists between and if and . How can the resulting DBSCAN clusters be rigorously defined in graph-theoretic terms?
Density-based clustering: DBSCAN fundamentals, e-neighborhood, MinPts
Hard
A.They are the strongly connected components of the entire graph including noise points.
B.They correspond to all maximal cliques in of size at least MinPts.
C.They are the strictly bipartite subgraphs where the two sets are core points and border points respectively.
D.They are the connected components of the subgraph induced strictly by the core points, with border points appended to any adjacent core-point component.
Correct Answer: They are the connected components of the subgraph induced strictly by the core points, with border points appended to any adjacent core-point component.
Explanation:
In DBSCAN, core points that are within of each other form a connected component (density-connected). Border points do not connect to other border points directly to form clusters; rather, they are assigned to the cluster of a core point they are -reachable from.
Incorrect! Try again.
49In DBSCAN, density-reachability and density-connectedness are foundational concepts. Which of the following statements strictly holds regarding their mathematical properties?
Density-based clustering: DBSCAN fundamentals, e-neighborhood, MinPts
Hard
A.Both density-reachability and density-connectedness are strictly symmetric and transitive equivalence relations.
B.Density-reachability is symmetric but not transitive, whereas density-connectedness is transitive but not symmetric.
C.Neither density-reachability nor density-connectedness exhibit symmetry in datasets containing varying densities.
D.Density-reachability is transitive but generally asymmetric, whereas density-connectedness is both symmetric and transitive.
Correct Answer: Density-reachability is transitive but generally asymmetric, whereas density-connectedness is both symmetric and transitive.
Explanation:
Density-reachability is asymmetric because a border point can be reached from a core point, but a core point cannot be reached from a border point (since the border point's -neighborhood doesn't contain MinPts). Density-connectedness is defined symmetrically (two points are connected if both are reachable from a third core point) and is transitive.
Incorrect! Try again.
50Assume a dataset in a 10,000-dimensional space where points are distributed uniformly at random. When applying DBSCAN to this dataset, which of the following phenomena severely hinders its effectiveness due to the Curse of Dimensionality?
Density-based clustering: DBSCAN fundamentals, e-neighborhood, MinPts
Hard
A.The density-reachability condition becomes symmetric for all points, collapsing the distinction between core and border points.
B.The distance from a point to its nearest neighbor approaches the distance to its farthest neighbor, making it nearly impossible to distinguish dense regions from sparse regions using a fixed .
D.All points automatically become core points regardless of because the volume of the -sphere expands to encompass the entire space exponentially fast.
Correct Answer: The distance from a point to its nearest neighbor approaches the distance to its farthest neighbor, making it nearly impossible to distinguish dense regions from sparse regions using a fixed .
Explanation:
In very high-dimensional spaces, the relative difference between the maximum and minimum distances between pairs of points converges to 0. Consequently, any given will either encompass almost all points (making everything one cluster) or almost no points (making everything noise).
Incorrect! Try again.
51Which of the following is an inherent source of non-determinism in the standard DBSCAN algorithm regarding border points?
Noise and border points
Hard
A.The algorithm randomly selects border points to merge density-reachable clusters, leading to different final cluster counts.
B.A border point that falls within the -neighborhood of core points belonging to two distinct clusters will be assigned to whichever cluster is processed first.
C.If a border point's distance to a core point is exactly equal to , floating-point instability will randomly assign it to the noise category.
D.Border points frequently fluctuate between being classified as noise and border points depending on the random initialization of the algorithm's KD-tree.
Correct Answer: A border point that falls within the -neighborhood of core points belonging to two distinct clusters will be assigned to whichever cluster is processed first.
Explanation:
While core points and noise points are deterministically identified, a border point that is density-reachable from multiple clusters (i.e., lies on the boundary between two clusters) is assigned to the cluster that discovers it first. Thus, cluster assignment for these specific border points depends on data traversal order.
Incorrect! Try again.
52Consider a dataset clustered by DBSCAN with parameters and . Point is identified as a border point. If a single noise point located far away from all clusters is removed from the dataset, and DBSCAN is re-run with the identical parameters, which of the following is guaranteed to be true regarding Point ?
Noise and border points
Hard
A.Point could be upgraded to a core point because the global density of the dataset decreases.
B.Point will still be identified as a border point, or it could potentially become a noise point if the distance matrix recalculation shifts the space.
C.Point will definitely remain a border point, as the removal of a distant noise point cannot affect the -neighborhoods of the core points near .
D.Point will turn into a noise point, because DBSCAN relies on global average density to determine relative border distances.
Correct Answer: Point will definitely remain a border point, as the removal of a distant noise point cannot affect the -neighborhoods of the core points near .
Explanation:
DBSCAN computes density purely locally using the -neighborhood. Removing a noise point that is strictly outside the -neighborhood of all core and border points has zero impact on the neighborhood counts of any other points. Therefore, 's status is completely unaffected.
Incorrect! Try again.
53Let be a dataset. You run DBSCAN with and an arbitrary . Which of the following correctly describes the resulting classifications of points?
Noise and border points
Hard
A.The classification will perfectly mirror Single Linkage clustering cut at height , resulting in a mix of core and border points.
B.All points are classified as core points; there are absolutely no border points or noise points.
C.All points are classified as noise points.
D.All points are classified as border points, and there are no core points.
Correct Answer: All points are classified as core points; there are absolutely no border points or noise points.
Explanation:
If , every point satisfies the core point condition because a point is always in its own -neighborhood (thus having at least 1 point in its neighborhood). Since every point is a core point, it is impossible for border points or noise points to exist.
Incorrect! Try again.
54Suppose you have a dataset with dense data points in . You require a clustering algorithm that can theoretically operate within or memory and time limits. Which of the following approaches is most feasible without utilizing severe dataset subsampling?
Comparison: k-Means vs. Hierarchical vs. DBSCAN
Hard
A.Agglomerative Hierarchical Clustering with Ward's linkage, using a Lance-Williams matrix update.
B.Agglomerative Hierarchical Clustering with Single Linkage, using a generic distance matrix.
C.DBSCAN, heavily leveraging a spatial index such as an R-tree or KD-tree.
D.Divisive Hierarchical Clustering using an exact bipartite split algorithm.
Correct Answer: DBSCAN, heavily leveraging a spatial index such as an R-tree or KD-tree.
Explanation:
Standard Agglomerative Hierarchical clustering requires computing a distance matrix, which requires memory and time, making it intractable for points. DBSCAN, however, can be optimized using spatial indexing structures (like KD-trees in low dimensions ), reducing time to and memory to .
Incorrect! Try again.
55You are tasked with clustering a 2D dataset consisting of a dense inner ring and a sparse, expanding outer ring of points (concentric circles of varying densities). Which of the following algorithms and configurations will fundamentally fail to isolate the two rings as distinct clusters?
Comparison: k-Means vs. Hierarchical vs. DBSCAN
Hard
A.Single Linkage Hierarchical Clustering, because it suffers from the chaining effect and cannot separate non-convex shapes.
B.OPTICS (Ordering Points To Identify the Clustering Structure), because it strictly inherits DBSCAN's inability to handle varying densities.
C.k-Means with , assuming the centroids are initialized using k-Means++.
D.DBSCAN, because a single fixed cannot simultaneously accommodate the dense inner ring (without merging it into noise) and the sparse outer ring (without classifying it as noise).
Correct Answer: k-Means with , assuming the centroids are initialized using k-Means++.
Explanation:
While standard DBSCAN struggles with varying densities, k-Means fundamentally fails on all non-convex shapes (like concentric rings) regardless of density, as it assumes isotropic, spherical clusters separated linearly by Voronoi boundaries. Single linkage can extract concentric shapes if the gap is wide enough. Thus, k-Means is guaranteed to fail in isolating the topological structure of rings.
Incorrect! Try again.
56Inductive clustering models can readily assign out-of-sample data points to existing clusters without re-running the algorithm on the entire combined dataset. Transductive models generally lack this native capability. Categorize k-Means, DBSCAN, and Standard Agglomerative Hierarchical clustering based on these definitions.
Comparison: k-Means vs. Hierarchical vs. DBSCAN
Hard
A.All three are inherently Transductive; Inductive extensions require supervised learning.
B.k-Means and DBSCAN are Inductive. Agglomerative Hierarchical is Transductive.
C.k-Means is Inductive. DBSCAN and Agglomerative Hierarchical are Transductive.
D.Agglomerative Hierarchical is Inductive. k-Means and DBSCAN are Transductive.
Correct Answer: k-Means is Inductive. DBSCAN and Agglomerative Hierarchical are Transductive.
Explanation:
k-Means provides a global mathematical model (the centroids) which can trivially classify a new point in time. DBSCAN and Hierarchical clustering define clusters based on the specific interconnected topology of the input points; adding a new point can change the entire cluster structure (e.g., bridging two clusters), making them natively transductive.
Incorrect! Try again.
57The MacNaughton-Smith algorithm is a divisive hierarchical approach. Unlike the computationally prohibitive exact divisive method, how does it primarily construct the sequence of splitting?
Hierarchical clustering: Agglomerative vs. Divisive
Hard
A.By calculating the eigenvectors of the unnormalized Laplacian matrix and splitting at the median.
B.By starting with all points in one cluster, selecting the point with the highest average dissimilarity to the rest as a 'splinter group', and iteratively moving points to it.
C.By repeatedly running k-Means with and picking the cluster with the highest variance to split next.
D.By recursively removing the longest edge in the dataset's Minimum Spanning Tree.
Correct Answer: By starting with all points in one cluster, selecting the point with the highest average dissimilarity to the rest as a 'splinter group', and iteratively moving points to it.
Explanation:
The MacNaughton-Smith algorithm (often associated with DIANA - DIvisive ANAlysis) avoids exponential complexity by using a heuristic: it finds the object with the largest average dissimilarity to all others in the cluster, initiates a splinter group, and greedily moves objects that are closer to the splinter group than to the main group.
Incorrect! Try again.
58Consider a dataset clustered effectively by k-Means, DBSCAN, and Complete Linkage Hierarchical Clustering using Euclidean distance. If the dataset undergoes an affine transformation where the x-axis is scaled by a factor of , which algorithm's underlying logic will be fundamentally robust to this transformation, assuming parameters are unadjusted?
Comparison: k-Means vs. Hierarchical vs. DBSCAN
Hard
A.Complete Linkage Hierarchical, as relative maximal distances are preserved.
B.DBSCAN, as density remains topologically equivalent despite the scaling.
C.None of the algorithms are robust to non-uniform scaling when using standard Euclidean distance.
D.k-Means, because the centroids simply stretch along the x-axis.
Correct Answer: None of the algorithms are robust to non-uniform scaling when using standard Euclidean distance.
Explanation:
Non-uniform scaling (scaling one axis drastically more than others) severely warps the Euclidean distance metric. Spherical clusters become highly elongated ellipses, breaking k-Means' isotropic assumption. Neighborhoods in DBSCAN and pairwise distances in Hierarchical clustering are completely distorted, requiring metric adjustments (like Mahalanobis distance) or data standardization to recover.
Incorrect! Try again.
59A data scientist decides to construct a k-distance graph to determine the optimal for DBSCAN, setting . They plot the sorted k-distances of all points in descending order. If the graph exhibits two distinct, sharp 'knees' (inflection points) separated by a long plateau, what topological characteristic does the dataset likely possess?
Density-based clustering: DBSCAN fundamentals, e-neighborhood, MinPts
Hard
A.The dataset contains at least two distinct clusters of significantly different densities.
B.The dataset is essentially uniformly distributed without any cluster structures.
C.The dataset consists exclusively of noise points randomly scattered around a single core point.
D.The dataset suffers from the curse of dimensionality, rendering the distance metric useless.
Correct Answer: The dataset contains at least two distinct clusters of significantly different densities.
Explanation:
The k-distance plot shows the distance from each point to its -th nearest neighbor. A sharp knee indicates a transition from a sparse region (noise) to a dense region. Two distinct knees with a plateau between them imply there are multiple threshold distance levels characterizing different cluster densities in the dataset.
Incorrect! Try again.
60Average Linkage clustering (UPGMA) avoids the chaining effect of Single Linkage and the extreme sensitivity to outliers of Complete Linkage. What is a strict mathematical requirement for the distance metric between points for Average Linkage to produce an ultrametric tree without reversals?
Linkage methods (single, complete, average, Ward)
Hard
A.The distance metric must be Euclidean, as Ward's linkage is the only alternative for non-Euclidean spaces.
B.The distance metric must be identical to the Pearson correlation coefficient.
C.The distance metric must satisfy and , and UPGMA will always guarantee monotonicity.
D.There is no requirement; UPGMA is space-conserving and inherently monotonic regardless of the strict metric properties of the underlying dissimilarity matrix.
Correct Answer: The distance metric must satisfy and , and UPGMA will always guarantee monotonicity.
Explanation:
Average Linkage (UPGMA) uses Lance-Williams parameters , , , . Because and , the linkage is strictly space-conserving and monotonic (avoids reversals) as long as the base dissimilarities are non-negative and symmetric. It does not strictly require the triangle inequality.