Unit3 - Subjective Questions
INT396 • Practice Questions with Detailed Answers
Define Hierarchical Clustering and differentiate between Agglomerative and Divisive approaches.
Hierarchical Clustering is an unsupervised learning algorithm that groups similar objects into groups called clusters by creating a hierarchy of clusters. It does not require pre-specifying the number of clusters to be generated.
Differences between Agglomerative and Divisive:
- Agglomerative (Bottom-Up):
- Starts with each data point as its own individual cluster.
- At each step, it merges the two most similar (closest) clusters.
- Continues merging until all points are grouped into a single root cluster.
- Computationally less expensive than divisive.
- Divisive (Top-Down):
- Starts with all data points in a single, massive cluster.
- At each step, it splits the most heterogeneous cluster into two smaller clusters.
- Continues splitting until each data point forms its own cluster.
- Computationally highly complex ( for all possible splits), making it less common in practice.
Explain the four common linkage methods used in hierarchical clustering: Single, Complete, Average, and Ward's linkage.
Linkage methods determine how the distance between two clusters is calculated in hierarchical clustering.
- Single Linkage (Minimum Linkage): The distance between two clusters is defined as the shortest distance between any two points in the respective clusters. It tends to produce long, "chain-like" clusters (chaining effect).
- Complete Linkage (Maximum Linkage): The distance between two clusters is defined as the longest distance between any two points in the respective clusters. It tends to produce highly compact, spherical clusters but is sensitive to outliers.
- Average Linkage: The distance between two clusters is defined as the average distance between all pairs of points in the respective clusters. It strikes a balance between single and complete linkage, avoiding extreme chaining while being more robust to outliers.
- Ward's Linkage: Unlike distance-based linkages, Ward's method minimizes the total within-cluster variance. At each step, it merges the pair of clusters that leads to the minimum increase in total within-cluster sum of squares (variance).
What is a dendrogram? Explain how it is interpreted to determine the optimal number of clusters.
Dendrogram: A dendrogram is a tree-like diagram that records the sequences of merges or splits in hierarchical clustering. The x-axis represents the data points, and the y-axis represents the distance (or dissimilarity) at which the clusters were merged (or split).
Interpretation and Determining Clusters:
- Height of Branches: The vertical height where two branches merge indicates the distance between the two clusters being combined. A greater height indicates less similarity.
- Cutting the Dendrogram: To form flat clusters, a horizontal line is drawn across the dendrogram at a specific height.
- Optimal Number of Clusters: The optimal number is often found by identifying the largest vertical distance that can be traversed without intersecting any horizontal merge lines. Drawing a horizontal threshold line through this vertical space cuts a certain number of branches; this number represents the optimal number of clusters .
Provide a detailed comparison between k-Means, Hierarchical, and DBSCAN clustering algorithms.
1. k-Means Clustering:
- Approach: Centroid-based.
- Parameter: Requires (number of clusters) in advance.
- Shape: Assumes spherical clusters; performs poorly on non-linear or arbitrary shapes.
- Noise/Outliers: Highly sensitive to outliers.
- Complexity: Generally , very fast for large datasets.
2. Hierarchical Clustering:
- Approach: Connectivity-based (Bottom-up or Top-down).
- Parameter: Does not require in advance; uses a dendrogram.
- Shape: Depending on linkage, usually prefers convex shapes (except single linkage).
- Noise/Outliers: Sensitivity depends on linkage (Complete is sensitive, Average/Ward's are robust).
- Complexity: or , slow for large datasets.
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
- Approach: Density-based.
- Parameter: Requires (radius) and MinPts (minimum points).
- Shape: Excellent at finding clusters of arbitrary shapes.
- Noise/Outliers: Robust; explicitly identifies and ignores outliers as "noise".
- Complexity: with spatial indexing, good for medium-to-large datasets.
Explain the fundamental concepts of -neighborhood and MinPts in the context of DBSCAN.
DBSCAN relies on two primary user-defined parameters to define "density":
-
-neighborhood (Epsilon-neighborhood):
- Definition: For a given data point , its -neighborhood is the set of all data points that lie within a distance of from .
- Mathematically: .
- Significance: It defines the local radius within which the algorithm searches for neighboring points to evaluate density.
-
MinPts (Minimum Points):
- Definition: The minimum number of points required to form a dense region (a valid cluster).
- Significance: If a point has at least MinPts in its -neighborhood (including the point itself), that neighborhood is considered dense, and the point acts as a seed for a cluster.
Define and distinguish between Core points, Border points, and Noise points in DBSCAN.
In DBSCAN, based on and MinPts, every data point is classified into one of three categories:
- Core Point:
- A point is a core point if there are at least MinPts number of points (including the point itself) in its -neighborhood.
- These points form the interior of a density-based cluster.
- Border Point:
- A point is a border point if it has fewer than MinPts within its -neighborhood, but it lies within the -neighborhood of a Core Point.
- These points form the edge (boundary) of a cluster.
- Noise Point (Outlier):
- A point is a noise point if it is neither a core point nor a border point.
- It does not belong to any cluster and is treated as an outlier.
Formulate the mathematical expressions for Single Linkage and Complete Linkage. Why is Single Linkage prone to the chaining effect?
Mathematical Formulation:
Let and be two clusters. Let be the distance between point and point .
- Single Linkage:
- Complete Linkage:
The Chaining Effect in Single Linkage:
Single linkage evaluates the distance between two clusters based ONLY on their two closest points. Because of this, two massive, distinct clusters can be merged simply because a single pair of points (one from each cluster) happen to be close to each other. This causes points to link together in long, extended "chains" rather than forming compact, spherical clusters, making it highly sensitive to noise points located between actual clusters.
Explain Ward's Linkage method in agglomerative hierarchical clustering. How does its objective function differ from Single and Complete linkage?
Ward's Linkage Method:
Ward's method is a variance-minimizing approach. At each step of the agglomerative clustering process, it merges the two clusters that result in the minimum increase in the total within-cluster sum of squares (variance).
Objective Function:
The Error Sum of Squares (ESS) for a cluster is defined as:
where is the centroid of cluster .
When merging clusters and into , the increase in ESS (the "distance" between and ) is:
Differences from Single/Complete Linkage:
While Single and Complete linkages evaluate distance based on individual point-to-point comparisons (min and max distances respectively), Ward's method evaluates distances based on the geometric centroid and statistical variance of the clusters. This forces the algorithm to create highly compact, evenly sized, spherical clusters.
Describe the step-by-step algorithm for Agglomerative Hierarchical Clustering.
The Agglomerative Hierarchical Clustering algorithm follows a bottom-up approach:
- Initialization: Treat each data point as a single cluster. If there are data points, you start with clusters.
- Distance Matrix Computation: Calculate the distance (e.g., Euclidean, Manhattan) between all pairs of clusters to form a distance matrix.
- Find the Closest Clusters: Identify the two clusters that are closest to each other based on a chosen linkage criterion (Single, Complete, Average, or Ward's).
- Merge: Merge the two closest clusters into a single new cluster.
- Update Distance Matrix: Recalculate the distances between the new cluster and all remaining clusters using the chosen linkage method.
- Repeat: Repeat steps 3, 4, and 5 until all data points are merged into a single root cluster containing all points.
- Dendrogram: The history of merges is plotted as a dendrogram, which can be cut to yield the desired number of clusters.
Discuss the main advantages and limitations of Hierarchical Clustering.
Advantages:
- No need to pre-specify : Unlike k-Means, it does not require the user to guess the number of clusters in advance.
- Interpretability: The dendrogram provides an excellent visual representation of the data's hierarchy and similarity structure.
- Deterministic: It always produces the same clusters for a given dataset and linkage method, unlike k-Means which depends on random initialization.
Limitations:
- Computational Complexity: Standard agglomerative clustering has a time complexity of and space complexity of , making it unsuitable for large datasets.
- Irreversibility: Once a merge (or split) is performed, it cannot be undone. If an early merge is poor, the algorithm cannot correct it later.
- Sensitivity to Noise/Linkage: Methods like Single Linkage are highly sensitive to noise and outliers (chaining effect).
Provide a step-by-step explanation of how the DBSCAN algorithm works.
The DBSCAN algorithm operates step-by-step to find high-density regions:
- Parameter Definition: Choose the values for (radius) and MinPts (minimum points).
- Point Selection: Randomly select an unvisited data point from the dataset and mark it as visited.
- Neighborhood Retrieval: Find all points within the -neighborhood of .
- Core Point Check:
- If the neighborhood contains at least MinPts, is marked as a Core Point, and a new cluster is created.
- If the neighborhood has fewer than MinPts, is temporarily marked as Noise.
- Cluster Expansion: If is a core point, add all points in its -neighborhood to cluster . For each new point added:
- If it is unvisited, mark it as visited and find its -neighborhood.
- If its neighborhood has at least MinPts, add those points to cluster as well. (This expands the cluster recursively).
- If the point was previously marked as Noise, change its status to a Border Point belonging to cluster .
- Iteration: Repeat steps 2-5 for all unvisited points in the dataset until all points have been visited.
- Termination: Points remaining marked as Noise do not belong to any cluster.
What are the advantages and disadvantages of using DBSCAN for clustering?
Advantages:
- Arbitrary Shapes: Can discover clusters of any geometric shape (e.g., moons, concentric circles), unlike k-Means which is restricted to spherical clusters.
- No Need to Specify : Does not require the number of clusters to be specified a priori.
- Robust to Outliers: Explicitly identifies and isolates noise/outliers, preventing them from skewing the cluster centers.
Disadvantages:
- Parameter Sensitivity: Highly sensitive to the choice of and MinPts. Small changes in can drastically alter the clustering results.
- Varying Densities: Struggles significantly with datasets containing clusters of widely varying densities because a single -MinPts combination cannot capture all densities.
- High Dimensionality: Performance degrades in high-dimensional spaces due to the "curse of dimensionality," making distance metrics (and thus ) less meaningful.
Explain the concepts of 'Directly Density-Reachable', 'Density-Reachable', and 'Density-Connected' in DBSCAN.
DBSCAN relies on formal mathematical definitions of reachability to expand clusters:
-
Directly Density-Reachable: A point is directly density-reachable from point if:
- is a core point.
- is in the -neighborhood of ().
(Note: This relationship is not symmetric if is a border point).
-
Density-Reachable: A point is density-reachable from if there is a chain of points where and , such that each is directly density-reachable from .
- This allows the algorithm to transitively connect core points and expand the cluster.
-
Density-Connected: Two points and are density-connected if there exists a core point such that both and are density-reachable from .
- This handles border points. Since a border point cannot reach another point (it isn't a core point), two border points on opposite sides of a cluster are considered part of the same cluster because they are density-connected through the core points in the middle.
How does Average Linkage strike a balance between Single and Complete Linkage in Hierarchical Clustering?
Average linkage computes the distance between two clusters as the arithmetic mean of the distances between all pairs of points, one from each cluster:
Striking a Balance:
- Against Single Linkage: Single linkage only looks at the closest pair of points, leading to chaining. Average linkage considers all pairs, so a single close pair of points won't force a merge if the rest of the clusters are far apart. This mitigates the chaining effect.
- Against Complete Linkage: Complete linkage only looks at the farthest pair, making it overly sensitive to a single outlier at the edge of a cluster. Average linkage averages out these extreme distances, making the algorithm much more robust to outliers than complete linkage.
- By considering the entire global structure of the clusters being compared, average linkage provides a stable compromise between the extremes of min and max distance.
Why is DBSCAN often preferred over k-Means when dealing with datasets that have non-linear or arbitrary shapes?
k-Means uses a centroid-based approach, measuring the Euclidean distance from a central point (the mean). This inherently assumes that clusters are convex and roughly spherical. If presented with non-linear shapes (like two concentric rings or interlocking moon shapes), k-Means will split them incorrectly because it groups points based on proximity to a central coordinate.
DBSCAN's Preference:
DBSCAN, being density-based, connects points based on local density (points packed closely together). It builds clusters by linking together -neighborhoods. As long as the dense region is continuous, DBSCAN will traverse the entire shape, regardless of whether it is a line, a ring, an S-shape, or any other arbitrary geometry. Thus, it can successfully map out the actual structure of non-linear datasets.
Explain the Divisive Hierarchical Clustering process and why it is computationally more expensive than Agglomerative Clustering.
Divisive Hierarchical Clustering (DIANA - Divisive Analysis):
- Starts with all data points grouped into a single, massive cluster.
- At each step, it identifies the cluster with the highest heterogeneity (largest diameter or variance).
- It splits this cluster into two smaller, distinct clusters.
- This process of splitting is repeated recursively until all points form their own individual clusters.
Computational Expense:
To split a cluster of points into two sub-clusters, there are theoretically possible ways to make the split. Finding the optimal split at the very first step requires an immense number of calculations (exponential time complexity ). While heuristics can be used to optimize this, exact divisive clustering remains computationally prohibitive for large datasets compared to agglomerative clustering, which simply computes an distance matrix initially.
Derive or provide the mathematical representations of cluster distances for Single, Complete, and Average linkage.
Let and be two distinct clusters. Let be a point in and be a point in . Let be the distance (e.g., Euclidean distance) between and .
-
Single Linkage (Minimum distance):
This represents the closest distance between any two elements of the clusters. -
Complete Linkage (Maximum distance):
This represents the maximum distance between any two elements of the clusters. -
Average Linkage (Mean distance):
where and are the number of elements in and respectively. This is the sum of all pairwise distances divided by the total number of pairs.
In a dendrogram, what does the vertical height of a horizontal line connecting two branches represent? Explain with an example.
In a dendrogram, the vertical height at which a horizontal line connects two branches represents the distance (or dissimilarity) between the two clusters being merged at that step.
Explanation & Example:
If the y-axis represents Euclidean distance, and two distinct clusters (say, Cluster A and Cluster B) are merged by a horizontal line that sits at a y-value of 4.5, it means that the distance between Cluster A and Cluster B, calculated according to the chosen linkage criteria (like Single or Complete linkage), is exactly 4.5 units.
- Low vertical heights indicate that the clusters being merged are very similar (close together).
- High vertical heights indicate that the clusters being merged are dissimilar (far apart).
A very long vertical line with no horizontal branches indicates a significant jump in dissimilarity, often implying that merging these groups combines distinctly different data structures.
What happens to the DBSCAN clustering output if the value of is chosen too large or too small?
The (epsilon) parameter defines the radius of neighborhood around a point. Its selection is critical for DBSCAN's success:
- If is chosen too large:
- The neighborhood radius becomes so vast that the majority of data points will fall within each other's neighborhoods.
- This causes multiple distinct clusters to merge together. In the worst-case scenario, the entire dataset will be grouped into a single, massive cluster.
- Noise points and outliers will be incorrectly absorbed into clusters.
- If is chosen too small:
- The neighborhood radius is too restrictive, meaning very few points will satisfy the MinPts condition to become core points.
- As a result, a large portion of the dataset will be classified as Noise.
- Meaningful, valid clusters will be fragmented into many smaller, insignificant clusters or ignored entirely.
Contrast how k-Means and DBSCAN handle outliers and noise in a dataset.
k-Means Handling of Outliers:
- k-Means forces every data point to belong to exactly one of the clusters, including extreme outliers.
- Because the centroid is calculated as the arithmetic mean of all points in the cluster, outliers disproportionately pull the centroid away from the true center of the data mass.
- This distorts cluster boundaries and heavily degrades the overall quality of the clustering.
DBSCAN Handling of Outliers:
- DBSCAN intrinsically models outliers through its density parameters ( and MinPts).
- Points that lie in low-density regions (failing to meet MinPts and not reachable from any core point) are explicitly labeled as Noise.
- Noise points are completely excluded from cluster formation, meaning they have zero impact on the shape or structure of the valid clusters. This makes DBSCAN highly robust to noisy datasets.