Unit 4 - Notes

INT234 7 min read

Unit 4: UNSUPERVISED LEARNING: CLUSTERING AND PATTERN DETECTION

1. Introduction to Unsupervised Learning

Unsupervised learning involves training machine learning models using datasets that are neither classified nor labeled. The algorithm acts on the information without guidance, aiming to discover hidden structures, patterns, or groupings within the data.

Key Characteristics:

Input: Raw data without target variables (no "Y" variable).
Goal: To model the underlying structure or distribution in the data.
Primary Applications: Clustering (grouping), dimensionality reduction, and association rule learning.

2. K-Means Clustering

K-Means is a centroid-based, iterative algorithm that partitions a dataset into $K$ distinct, non-overlapping subgroups (clusters). Every data point belongs to only one group.

K-Means Clustering Intuition

The fundamental intuition behind K-Means is to minimize the variance within each cluster (intra-cluster variance) while maximizing the distinction between clusters. It assigns data points to the nearest cluster center (centroid).

The Objective Function:
The algorithm aims to minimize the Within-Cluster Sum of Squares (WCSS), also known as inertia.

WCSS = \sum_{i=1}^{K} \sum_{x \in C_i} distance(x, \mu_i)^2

$K$ : Number of clusters.
$x$ : A data point in cluster $C_i$ .
$\mu_i$ : The centroid of cluster $C_i$ .

The Algorithm Steps:

Specify $K$ : Choose the number of clusters.
Initialization: Randomly select $K$ data points as the initial centroids.
Assignment: Assign each data point to the closest centroid (usually using Euclidean distance).
Update: Calculate the new mean (centroid) of the points assigned to each cluster.
Iterate: Repeat steps 3 and 4 until the centroids stop changing position (convergence) or a maximum number of iterations is reached.

K-Means Random Initialization Trap

A major limitation of standard K-Means is its sensitivity to the initial placement of centroids.

The Problem: If the initial centroids are placed poorly (e.g., two centroids within the same actual cluster), the algorithm may converge to a local minimum rather than the global minimum. This results in suboptimal clustering.
The Solution (K-Means++): This is an initialization algorithm used to select the starting centroids.
1. Choose the first centroid at random from the data points.
2. Compute the distance of all points to the nearest existing centroid.
3. Choose the next centroid from the remaining points such that the probability of choosing a point is proportional to the square of its distance to the nearest existing centroid (favoring points far away).
4. Repeat until centroids are chosen.
  - Result: This ensures centroids are spread out initially, leading to faster convergence and better results.

Selecting the Number of Clusters ( $K$ )

Since $K$ is a hyperparameter, it must be determined before training.

1. The Elbow Method

This is a graphical method to find the optimal $K$ .

Procedure: Run K-Means for a range of $K$ values (e.g., 1 to 10). Calculate the WCSS for each $K$ .
Plot: Plot $K$ (x-axis) vs. WCSS (y-axis).
Interpretation: As $K$ increases, WCSS decreases (eventually reaching 0 when $K$ = number of data points). The optimal $K$ is the point where the rate of decrease sharply shifts—the "elbow" of the curve. This indicates that adding more clusters provides diminishing returns in explaining variance.

2. Silhouette Score

Measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation).

Range: -1 to +1.
- +1: Perfectly clustered (close to own centroid, far from neighbors).
- 0: Point is on the decision boundary between two clusters.
- -1: Point is assigned to the wrong cluster.
Selection: Pick the $K$ that maximizes the average Silhouette Score across all points.

3. Hierarchical Clustering

Hierarchical clustering builds a hierarchy of clusters. Unlike K-Means, you do not need to pre-specify the number of clusters. The result is a tree-like representation called a Dendrogram.

Types of Hierarchical Clustering

1. Agglomerative (Bottom-Up)

This is the most common approach.

Step 1: Treat each data point as a single cluster (N clusters for N points).
Step 2: Identify the two clusters that are closest together.
Step 3: Merge the two most similar clusters into a single cluster.
Step 4: Repeat steps 2 and 3 until only one cluster remains (containing all data points).

2. Divisive (Top-Down)

This is the inverse of Agglomerative.

Step 1: Start with all data points in one giant cluster.
Step 2: Split the cluster recursively into two least similar clusters.
Step 3: Continue until each data point is its own cluster.

The Dendrogram

A diagram representing the tree of clusters.

X-axis: The data points.
Y-axis: The Euclidean distance at which clusters merge.
Cutting the Tree: To determine the number of clusters, you draw a horizontal line across the dendrogram. The number of vertical lines the horizontal line intersects represents the number of clusters at that distance threshold. The optimal cut is usually where the vertical distance (gap) without horizontal crossing is largest.

Types of Linkages

When merging clusters in Agglomerative clustering, we must define how to measure the "distance" between two clusters (which may contain multiple points).

Single Linkage (Nearest Neighbor):
- Distance between the two closest points in the two clusters.
- Pros: Can handle non-elliptical shapes.
- Cons: Susceptible to the "chaining" effect (long, stringy clusters) and sensitive to outliers.
Complete Linkage (Farthest Neighbor):
- Distance between the two farthest points in the two clusters.
- Pros: Produces more compact, spherical clusters.
- Cons: Sensitive to outliers.
Average Linkage:
- Average distance between all pairs of points, where one point is in cluster A and the other is in cluster B.
- Pros: Robust to outliers; a compromise between Single and Complete linkage.
Centroid Linkage:
- Distance between the centroids (means) of the two clusters.
- Pros: Mathematically intuitive.
- Cons: Can result in "inversions" (where the merge height in the dendrogram is lower than the previous step), which makes visualization difficult.

4. Association Rules and Pattern Detection

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases.

Finding Patterns

The goal is to find regularities in data.

If-Then rules: $\{Antecedent\} \rightarrow \{Consequent\}$ .
Example: "If a customer buys Bread and Milk, they are likely to buy Eggs."

Market Basket Analysis

This is the most famous application of Association Rules. Retailers use it to identify items that are frequently purchased together.

Use Cases: Store layout optimization (placing beer near diapers), cross-selling, recommendation engines (Amazon's "Customers who bought this also bought...").

Key Metrics in Association Rules

To determine if a rule is strong or just a coincidence, we use three key metrics:

1. Support

Indicates how frequently the itemset appears in the dataset. It filters out infrequent items.

\text{Support}(A) = \frac{\text{Transactions containing } A}{\text{Total Transactions}}

\text{Support}(A \rightarrow B) = \frac{\text{Transactions containing both } A \text{ and } B}{\text{Total Transactions}}

2. Confidence

Measures the reliability of the rule. Given that item A is purchased, what is the probability item B is also purchased? (Conditional Probability).

\text{Confidence}(A \rightarrow B) = \frac{\text{Support}(A \cap B)}{\text{Support}(A)}

Note: High confidence alone can be misleading if B is a very popular item that everyone buys anyway.

3. Lift

The most critical metric. It measures the strength of association independent of the popularity of item B. It compares the observed frequency of A and B appearing together to the frequency expected if A and B were independent.

\text{Lift}(A \rightarrow B) = \frac{\text{Confidence}(A \rightarrow B)}{\text{Support}(B)}

Lift = 1: No association (Items are independent).
Lift > 1: Positive association (Items are likely to be bought together).
Lift < 1: Negative association (Items are unlikely to be bought together; e.g., substitute goods).

The Apriori Algorithm

A classic algorithm used to generate association rules efficiently.

Apriori Principle: If an itemset is frequent, then all of its non-empty subsets must also be frequent. Conversely, if an itemset is infrequent, all its supersets will be infrequent.
Process:
- Set a minimum support and confidence threshold.
- Find all frequent 1-itemsets.
- Generate 2-itemsets from the frequent 1-itemsets and prune those below the support threshold.
- Repeat for k-itemsets until no more frequent itemsets can be found.
- Generate rules from the frequent itemsets that meet the confidence threshold.

Unit 3

Unit 5