Unit 2 - Notes

INT396 7 min read

Unit 2: Partition-Based Clustering

1. Clustering Fundamentals and Assumptions

Clustering is a fundamental unsupervised machine learning technique aimed at grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups.

Core Objectives

  • Maximize Intra-cluster Similarity: Data points within the same cluster should be as close to each other as possible.
  • Minimize Inter-cluster Similarity: Data points belonging to different clusters should be as far apart as possible.

Key Assumptions of Partition-Based Clustering

Partition-based algorithms (like k-Means) rely heavily on certain geometric and statistical assumptions about the underlying data:

  1. Spherical Clusters: The algorithm assumes clusters are convex and isotropic (spherical). It struggles with elongated, crescent-shaped, or irregular cluster boundaries.
  2. Similar Variance/Density: It assumes all clusters have roughly the same variance. If one cluster is highly dispersed and another is extremely dense, the algorithm may split the larger cluster to minimize variance.
  3. Similar Size: It often assumes clusters contain a roughly comparable number of data points.
  4. Linear Separability: Since it uses distance metrics (typically Euclidean) from a centroid, it implicitly assumes clusters are linearly separable.

2. Hard vs. Soft Clustering

Hard Clustering

In hard clustering, every data point is assigned to exactly one cluster. The boundaries between clusters are strict.

  • Example: standard k-Means.
  • Representation: A point belongs to cluster . The assignment is binary: .
  • Pros: Computationally efficient, easy to interpret.
  • Cons: Struggles with overlapping boundaries or points sitting exactly between two centroids.

Soft (Fuzzy) Clustering

In soft clustering, a data point can belong to multiple clusters with varying degrees of membership or probability.

  • Example: Fuzzy C-Means (FCM), Gaussian Mixture Models (GMM).
  • Representation: A point has a probability/weight of belonging to cluster , where and .
  • Pros: More robust for ambiguous data points near cluster boundaries; provides a measure of uncertainty.
  • Cons: Computationally more intensive; harder to interpret categorical boundaries.

3. k-Means Algorithm

The k-Means algorithm partitions a dataset of observations into predefined distinct, non-overlapping clusters.

Objective Function

The goal of k-Means is to minimize the Within-Cluster Sum of Squares (WCSS), also known as Inertia.
The objective function is defined as:

Where:

  • is the total number of clusters.
  • is the -th cluster.
  • is a data point belonging to .
  • is the centroid (mean) of the points in cluster .
  • is the squared Euclidean distance between the point and the centroid.

Initialization Strategies

1. Random Initialization (Forgy Method)

  • Process: Randomly select data points from the dataset to serve as the initial centroids.
  • Drawbacks: Highly susceptible to the "initialization trap." Poor initial placement can lead the algorithm to converge on a suboptimal local minimum, resulting in poor clustering.

2. k-Means++

k-Means++ is a smart initialization technique designed to spread out the initial centroids, leading to faster convergence and better overall solutions.

  • Algorithm Steps:
    1. Choose one centroid uniformly at random from the data points.
    2. For each data point , compute , the distance between and the nearest previously chosen centroid.
    3. Choose the next centroid from the data points such that the probability of choosing is proportional to . (Points further away are more likely to be chosen).
    4. Repeat steps 2 and 3 until centroids have been selected.
    5. Proceed with the standard k-Means algorithm.

Convergence and Limitations

k-Means typically uses Lloyd's Algorithm, an iterative expectation-maximization process:

  1. Assignment Step: Assign each point to the nearest centroid.
  2. Update Step: Recalculate the centroids as the mean of all points assigned to that cluster.
    • Convergence: The algorithm is guaranteed to converge in a finite number of steps (WCSS decreases monotonically). However, it is only guaranteed to find a local optimum, not the global optimum.
    • Limitations:
    • Requires specifying (number of clusters) in advance.
    • Highly sensitive to outliers (since it minimizes squared errors).
    • Fails on non-spherical or differing-density clusters.

4. k-Medoids (PAM) vs. k-Means

k-Medoids, specifically the Partitioning Around Medoids (PAM) algorithm, is a robust alternative to k-Means.

Feature k-Means k-Medoids (PAM)
Center Definition Centroid: The mathematical mean of the points in the cluster (may not be an actual data point). Medoid: An actual data point in the dataset that minimizes the sum of distances to other points in the cluster.
Distance Metric Typically relies on Squared Euclidean distance. Can use any distance metric (Manhattan, Cosine, etc.).
Outlier Sensitivity High: Outliers heavily skew the mathematical mean. Low: Medoids are highly robust to outliers and noise.
Computation Fast and efficient ( where is iterations). Slower and more computationally expensive (). Better suited for small to medium datasets.

5. Data Standardization and Scaling Impact

Partition-based clustering relies entirely on distance metrics (usually Euclidean) to determine similarities. Consequently, feature scaling is mandatory.

  • The Problem: If one feature has a range of 0 to 1, and another has a range of 0 to 100,000, the Euclidean distance calculation will be completely dominated by the feature with the larger range. The clustering algorithm will essentially ignore the smaller-scale feature.
  • The Solution: Standardize the data so all features contribute equally.
    • Z-score Standardization (StandardScaler): Centers data around a mean of 0 with a standard deviation of 1. Formula:
    • Min-Max Scaling: Scales features to a fixed range, usually [0, 1]. Formula:

6. MiniBatch k-Means for Large-Scale Datasets

Standard k-Means requires computing the distance between every data point and every centroid at each iteration, which becomes computationally prohibitive for massive datasets (e.g., millions of rows).

MiniBatch k-Means is a variant designed to handle large-scale data:

  • Mechanism: Instead of using the entire dataset at each iteration, it uses small, randomly sampled batches (mini-batches) of data to update the centroids.
  • Centroid Update: Centroids are updated incrementally using an exponentially decaying learning rate, combining the new mini-batch data with the existing centroid positions.
  • Trade-offs:
    • Speed: Drastically reduces computation time and memory requirements.
    • Accuracy: Produces slightly lower quality clusters (higher WCSS) compared to standard k-Means, but the difference is usually negligible in practice.

7. Cluster Validation

Because unsupervised learning lacks ground-truth labels, evaluating the quality of clusters requires internal validation metrics.

1. Inertia (WCSS)

  • Definition: The sum of squared distances of samples to their closest cluster center.
  • Interpretation: Lower is better. However, Inertia is not normalized and naturally decreases as increases (reaching 0 if ). It is primarily used for the Elbow Method rather than as an absolute metric of success.

2. Silhouette Coefficient

Measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation).

  • Formula: For a given point :
    • : Mean distance between point and all other points in the same cluster.
    • : Mean distance between point and all points in the nearest neighboring cluster.
  • Interpretation:
    • Range is [-1, 1].
    • Near +1: The point is far away from neighboring clusters (excellent clustering).
    • Near 0: The point is on or very close to the decision boundary between two neighboring clusters.
    • Near -1: The point is likely assigned to the wrong cluster.

3. Davies–Bouldin Index (DBI)

Measures the average "similarity" between clusters, where similarity is a measure that compares the distance between clusters with the size of the clusters themselves.

  • Formula Logic: It computes the ratio of intra-cluster scatter (cluster diameter) to inter-cluster separation (distance between centroids).
  • Interpretation: The minimum score is 0. Lower values indicate better clustering, signifying that clusters are compact and well-separated.

4. Elbow Method Pitfalls

The Elbow Method involves plotting Inertia (WCSS) against the number of clusters and looking for a "knee" or "elbow" where the rate of decrease sharply shifts.

  • The Pitfalls:
    • Ambiguity: In real-world, noisy data, the curve is often smooth. There is no clear, distinct elbow, making the choice of highly subjective.
    • Overlooking Cluster Structure: It only looks at variance. It doesn't account for overlapping clusters or cluster densities.
    • Best Practice: The Elbow Method should never be used in isolation. It must be combined with the Silhouette Score, DBI, and domain knowledge to determine the optimal .