Unit 3 - Notes

INT423 7 min read

Unit 3: Clustering Metrics

Clustering evaluation is critical in determining the quality of the clusters generated by an algorithm. Unlike supervised learning, clustering often lacks "ground truth" labels. Therefore, metrics are divided into two primary categories:

Internal Evaluation: Used when true labels are unknown (Unsupervised). Evaluates based on cluster cohesion and separation.
External Evaluation: Used when true class labels are known (Supervised). Evaluates how well the clustering matches the pre-existing structure.

Part 1: Internal Evaluation Metrics

Used when ground truth labels are not available.

1. Silhouette Score

The Silhouette Score measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation).

Mechanism:
For each point $i$ :
1. Calculate $a(i)$ : The average distance between $i$ and all other points in the same cluster (Intra-cluster distance).
2. Calculate $b(i)$ : The average distance between $i$ and all points in the nearest neighboring cluster (Inter-cluster distance).
3. The Silhouette Coefficient $s(i)$ is:
  $s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$
Interpretation:
- Range: $[-1, +1]$
- +1: The point is perfectly clustered (far from neighbors, close to its own).
- 0: The point is on or very close to the decision boundary between two clusters.
- -1: The point has likely been assigned to the wrong cluster.
Pros: Easy to interpret; visualizes cluster density and separation.
Cons: Computationally expensive ( $O(N^2)$ ); tends to score higher for convex (spherical) clusters and lower for density-based structures (like DBSCAN results).

Python Implementation:

PYTHON

from sklearn.metrics import silhouette_score
# X is the dataset, labels are the predicted cluster assignments
score = silhouette_score(X, labels)

2. Davies-Bouldin Index (DBI)

The Davies-Bouldin Index represents the average "similarity" between clusters, where similarity is a measure that compares the distance between clusters with the size of the clusters themselves.

Mechanism:
1. $s_i$ : Average distance between each point of cluster $i$ and the centroid of cluster $i$ (Cluster diameter/dispersion).
2. $d_{ij}$ : Distance between cluster centroids $i$ and $j$ .
3. $R_{ij}$ : Similarity ratio $\frac{s_i + s_j}{d_{ij}}$ .
4. The index is the average of the maximum $R_{ij}$ for each cluster.
Interpretation:
- Range: $[0, \infty)$
- Lower Score: Better clustering. A score of 0 indicates perfectly separated clusters with zero scatter.
- Logic: We want low intra-cluster scatter ( $s$ ) and high inter-cluster distance ( $d$ ).
Pros: Simpler computation than Silhouette Score.
Cons: Limits analysis to centroid-based distances; generally assumes spherical clusters.

Python Implementation:

PYTHON

from sklearn.metrics import davies_bouldin_score
score = davies_bouldin_score(X, labels)

3. Dunn Index

The Dunn Index aims to identify dense and well-separated clusters. It is defined as the ratio between the minimal inter-cluster distance to the maximal intra-cluster distance.

Formula:
- Numerator: Minimum distance between any two different clusters (separation).
- Denominator: Maximum diameter of any cluster (compactness).
Interpretation:
- Range: $[0, \infty)$
- Higher Score: Better clustering. It implies clusters are compact (small denominator) and far apart (large numerator).
Pros: Does not rely on centroids/means.
Cons: Highly sensitive to noise (outliers can artificially increase diameter or decrease separation); computationally expensive.

Part 2: External Evaluation Metrics

Used when ground truth labels are available.

4. Adjusted Rand Index (ARI)

The Rand Index (RI) computes a similarity measure between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings. The ARI "adjusts" the RI for chance.

Components:
- a (TP): Number of pairs of elements that are in the same set in True Labels and in the same set in Predicted Labels.
- b (TN): Pairs in different sets in True and different in Predicted.
- RI: $(a + b) / \binom{n}{2}$
- ARI: Adjusted to ensure a random clustering gets a score of 0.
  $ARI = \frac{RI - Expected\_RI}{\max(RI) - Expected\_RI}$
Interpretation:
- Range: $[-1, +1]$
- +1: Perfect labeling.
- 0: Random labeling (independent of number of clusters).
- Negative: Worse than random (independent clustering).
Pros: Interpretable; works well when the number of clusters is large; symmetric.

Python Implementation:

PYTHON

from sklearn.metrics import adjusted_rand_score
score = adjusted_rand_score(labels_true, labels_pred)

5. Normalized Mutual Information (NMI)

Based on the concept of Entropy ( $H$ ) and Mutual Information ( $MI$ ). MI measures the reduction in uncertainty about one clustering given the knowledge of the other.

Formula:

$NMI(Y, C) = \frac{2 \cdot I(Y; C)}{H(Y) + H(C)}$
Where $I(Y; C)$ is Mutual Information and $H$ is Entropy.
Interpretation:
- Range: $[0, 1]$
- 1: Perfect correlation between true labels and predicted clusters.
- 0: No mutual information (independent).
Pros: Independent of the absolute values of the labels (only structure matters).
Cons: Not adjusted for chance (unlike AMI), meaning it tends to increase as the number of clusters increases, even for random data.

Python Implementation:

PYTHON

from sklearn.metrics import normalized_mutual_info_score
score = normalized_mutual_info_score(labels_true, labels_pred)

6. Homogeneity, 7. Completeness, and 8. V-measure

These three metrics are derived from conditional entropy analysis and are often analyzed together.

6. Homogeneity

Definition: A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class.
Concept: "Each cluster should be pure."
Formula: $h = 1 - \frac{H(C|K)}{H(C)}$

7. Completeness

Definition: A clustering result satisfies completeness if all the data points that are members of a given class are assigned to the same cluster.
Concept: "All members of class A should be in cluster 1."
Formula: $c = 1 - \frac{H(K|C)}{H(K)}$

8. V-measure

Definition: The harmonic mean of Homogeneity and Completeness. It is equivalent to the F1-score in classification but for clustering.
Formula:

$V = \frac{2 \cdot h \cdot c}{h + c}$
(Note: Weighted variants exist, denoted as $V_\beta$ , to prioritize homogeneity or completeness).
Interpretation: All range from $[0, 1]$ . 1 is perfect.
Pros: Intuitive breakdown of error types (is the cluster mixed? or is the class split?).
Cons: Like NMI, V-measure is not adjusted for chance.

Python Implementation:

PYTHON

from sklearn.metrics import homogeneity_completeness_v_measure
h, c, v = homogeneity_completeness_v_measure(labels_true, labels_pred)

9. Fowlkes-Mallows Index (FMI)

The Fowlkes-Mallows score is defined as the geometric mean of the pairwise precision and recall.

Mechanism:
- TP (True Positive): Pairs of points that belong to the same cluster in both True and Predicted labels.
- FP (False Positive): Pairs in the same Predicted cluster but different True clusters.
- FN (False Negative): Pairs in different Predicted clusters but the same True cluster.
Formula:

$FMI = \sqrt{\frac{TP}{TP + FP} \times \frac{TP}{TP + FN}}$
Interpretation:
- Range: $[0, 1]$
- 1: Perfect agreement.
- 0: Purely random assignment.
Pros: Very effective when class sizes are unbalanced.

Python Implementation:

PYTHON

from sklearn.metrics import fowlkes_mallows_score
score = fowlkes_mallows_score(labels_true, labels_pred)

10. Adjusted Mutual Information (AMI)

AMI is an adjustment of the Mutual Information (MI) score to account for chance. It is similar to ARI but based on information theory rather than pair counting.

Why Adjustment is Needed:
Standard NMI tends to favor clusterings with a higher number of clusters. If you have as many clusters as data points, NMI is 1.0, which is misleading. AMI corrects this bias.
Formula:

$AMI(U, V) = \frac{MI(U, V) - E[MI(U, V)]}{\text{mean}(H(U), H(V)) - E[MI(U, V)]}$
Where $E[MI]$ is the expected mutual information under a random permutation model.
Interpretation:
- Range: $[-1, 1]$ (Though typically 0 to 1).
- 1: Perfect match.
- 0: Random clustering.
Pros: Best metric for comparing clustering algorithms with different numbers of clusters; robust against cluster size imbalance.
Cons: Computationally more intensive than NMI.

Python Implementation:

PYTHON

from sklearn.metrics import adjusted_mutual_info_score
score = adjusted_mutual_info_score(labels_true, labels_pred)

Summary Comparison Table

Metric	Type	Ground Truth Req?	Range	Best for...
Silhouette	Internal	No	-1 to 1	Determining optimal $K$ ; spherical clusters.
Davies-Bouldin	Internal	No	0 to $\infty$	Fast evaluation; centroid-based clusters.
Dunn	Internal	No	0 to $\infty$	Detecting compact, well-separated clusters.
ARI	External	Yes	-1 to 1	General purpose; comparing algorithms.
NMI	External	Yes	0 to 1	Text clustering; entropy-based analysis.
V-Measure	External	Yes	0 to 1	Analyzing homogeneity vs completeness balance.
Fowlkes-Mallows	External	Yes	0 to 1	Unbalanced class sizes.
AMI	External	Yes	0 to 1	Comparing solutions with different $K$ .

Unit 2

Unit 4

Unit 3 - Notes

Table of Contents

Unit 3: Clustering Metrics

Part 1: Internal Evaluation Metrics

1. Silhouette Score

2. Davies-Bouldin Index (DBI)

3. Dunn Index

Part 2: External Evaluation Metrics

4. Adjusted Rand Index (ARI)

5. Normalized Mutual Information (NMI)

6. Homogeneity, 7. Completeness, and 8. V-measure

6. Homogeneity

7. Completeness

8. V-measure

9. Fowlkes-Mallows Index (FMI)

10. Adjusted Mutual Information (AMI)

Summary Comparison Table