Unit 3 - Notes

INT396 7 min read

Unit 3: Hierarchical & Density-Based Clustering

1. Hierarchical Clustering

Hierarchical clustering is a family of unsupervised learning algorithms that build a hierarchy of clusters. Unlike k-Means, it does not require the user to pre-specify the number of clusters (). Instead, it outputs a tree-based representation of the data, allowing users to choose the desired number of clusters after the clustering process has occurred.

Agglomerative vs. Divisive Clustering

There are two primary approaches to building a hierarchical cluster:

Agglomerative Clustering (Bottom-Up)

This is the most common approach.

  • Initialization: Every single data point starts as its own individual cluster (if there are points, there are clusters).
  • Iteration: At each step, the algorithm computes the distance between all pairs of clusters and merges the two closest clusters into a single cluster.
  • Termination: The process repeats until all data points are merged into a single macroscopic cluster containing the entire dataset.
  • Complexity: Generally time complexity, making it computationally expensive for large datasets.

Divisive Clustering (Top-Down)

This is the inverse of the agglomerative approach.

  • Initialization: All data points start in one giant, single cluster.
  • Iteration: At each step, the algorithm splits the most heterogeneous cluster into two smaller clusters.
  • Termination: The process repeats until every data point is in its own singleton cluster.
  • Complexity: Can be more complex than agglomerative because finding the optimal split of a cluster at each step is computationally difficult ( in the worst naive case, though heuristics are used).

Linkage Methods

In agglomerative clustering, the decision of which clusters to merge relies on a distance metric (e.g., Euclidean, Manhattan) and a linkage criterion. The linkage determines how the distance between two clusters (sets of points) is calculated.

  1. Single Linkage (Minimum Linkage)

    • Definition: The distance between two clusters is defined as the shortest distance between any two points in the respective clusters.
    • Characteristics: Tends to produce long, elongated clusters. Highly susceptible to the "chaining effect," where a few noisy points can bridge two distinct clusters.
  2. Complete Linkage (Maximum Linkage)

    • Definition: The distance between two clusters is the longest distance between any two points in the respective clusters.
    • Characteristics: Tends to produce tightly bound, compact, and spherical clusters. Sensitive to outliers, as a single outlier can drastically increase the maximum distance.
  3. Average Linkage

    • Definition: The distance between two clusters is the average of the distances between all pairs of points in the two clusters.
    • Characteristics: A compromise between Single and Complete linkage. Less sensitive to outliers and does not suffer from the chaining effect.
  4. Ward's Method (Minimum Variance)

    • Definition: Instead of measuring distance directly, Ward's method minimizes the total within-cluster variance. At each step, it merges the two clusters that result in the smallest increase in the total sum of squared errors (SSE).
    • Characteristics: Highly effective for quantitative data with Euclidean distance. It tends to create clusters of relatively equal sizes and is the default in many software packages (like scikit-learn).

Dendrogram Interpretation

A dendrogram is a tree-like diagram that records the sequences of merges (agglomerative) or splits (divisive).

  • X-Axis: Represents the individual data points (or indices of the data).
  • Y-Axis: Represents the distance (or dissimilarity) at which the clusters were merged.
  • Reading the Dendrogram:
    • Each inverted U-shape represents a merge.
    • The height of the vertical lines represents the distance between the two clusters being merged. A tall vertical line indicates that the clusters merged were far apart.
  • Determining the Number of Clusters ():
    • To find clusters, draw a horizontal line across the dendrogram. The number of vertical lines intersected by the horizontal line is the number of clusters.
    • Rule of Thumb: Look for the longest vertical lines that are not intersected by horizontal lines from other merges. Drawing a line through this wide gap often yields the optimal number of natural clusters.

2. Density-Based Clustering: DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups together points that are closely packed together (points with many nearby neighbors) and marks points as outliers if they lie alone in low-density regions.

Unlike k-Means or Hierarchical clustering, DBSCAN can discover clusters of arbitrary shapes and is inherently robust to noise.

Fundamentals & Key Parameters

DBSCAN requires two primary parameters to define "density":

  1. (Epsilon or eps): The maximum radius of the neighborhood around a data point. Two points are considered neighbors if the distance between them is less than or equal to .
  2. MinPts (Minimum Points): The minimum number of points required within the -neighborhood of a data point for that point to be considered a "dense" region (or a core point). This count includes the point itself.

Point Classifications

Based on and MinPts, DBSCAN classifies every data point into one of three categories:

  1. Core Points:
    • A point is a core point if there are at least MinPts points within its -neighborhood.
    • These points form the dense interior of a cluster.
  2. Border Points:
    • A point is a border point if it has fewer than MinPts within its -neighborhood, but it lies within the -neighborhood of a Core point.
    • These points form the edge (or border) of a cluster.
  3. Noise (Outliers):
    • A point is a noise point if it is neither a core point nor a border point.
    • These points are isolated and do not belong to any cluster.

The DBSCAN Algorithm Workflow

  1. Pick an arbitrary unvisited point in the dataset.
  2. Check its -neighborhood:
    • If it contains MinPts, a new cluster is started, and the point is marked as a Core point.
    • If it does not contain MinPts, the point is temporarily labeled as Noise (it may later be updated to a Border point if it falls within the neighborhood of a subsequently discovered Core point).
  3. Expand the cluster: For a newly discovered Core point, all points in its -neighborhood are added to the cluster. The algorithm then checks the -neighborhood of these newly added points. If they are also Core points, their neighbors are added. This chain reaction continues until the cluster's density drops below the threshold (no more reachable Core points).
  4. Repeat: Once a cluster is fully expanded, the algorithm proceeds to the next unvisited point in the dataset, repeating until all points are visited.

3. Comparison: k-Means vs. Hierarchical vs. DBSCAN

Feature k-Means Hierarchical DBSCAN
Algorithm Type Centroid-based / Partitioning Connectivity-based / Tree-based Density-based
Required Parameters (Number of clusters) Linkage method, distance metric (radius), MinPts
Cluster Shape Assumes spherical, convex clusters Depends on linkage (Single = elongated, Ward = spherical) Arbitrary shapes (highly flexible)
Handling Outliers Sensitive (forces outliers into clusters, pulling centroids) Sensitive (depends on linkage, complete/average are better) Robust (explicitly identifies and isolates noise/outliers)
Scalability (Time) (Very fast) standard, optimized (Slow) with spatial indexing, worst case
Interpretability Moderate (Cluster centers) High (Dendrogram provides visual hierarchy) Moderate (Based on density reachability)
Determinism Non-deterministic (depends on random initial centroid placement) Deterministic Deterministic (except for edge cases of border points shared by two clusters)

When to use which?

  • Use k-Means when: The dataset is large, clusters are expected to be roughly spherical and equally sized, and computational speed is a priority.
  • Use Hierarchical when: You have a smaller dataset, you want a deep understanding of the data's topology, or you don't know the exact number of clusters a priori and want to visually inspect a dendrogram.
  • Use DBSCAN when: The data contains noise/outliers, clusters are irregularly shaped (e.g., moons, concentric circles), and you do not know the number of clusters in advance.

Reference Implementation Snippet (scikit-learn)

PYTHON
from sklearn.cluster import AgglomerativeClustering, DBSCAN

# Hierarchical (Agglomerative)
# Ward linkage is default; distance_threshold can be used instead of n_clusters
hierarchical = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels_hc = hierarchical.fit_predict(X)

# DBSCAN
# eps is epsilon, min_samples is MinPts
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels_db = dbscan.fit_predict(X) # Returns -1 for Noise points