Practice MCQ

Unit 1 - Notes

INT423 7 min read

Unit 1: Introduction to Unsupervised Learning & K-Means

1. Introduction to Unsupervised Learning

Definition

Unsupervised learning is a paradigm in machine learning where the algorithm learns patterns from untagged data. Unlike supervised learning, where the model is provided with input-output pairs $(X, Y)$ and trained to predict $Y$ from $X$ , unsupervised learning deals with input data $(X)$ that has no corresponding labels.

Core Objectives

The primary goal is to model the underlying structure or distribution of the data. The algorithm must navigate the data without external guidance (labels) to find:

Hidden Structures: Grouping data based on similarities.
Data Density: Estimating the distribution of data points.
Dimensionality Reduction: Compressing information while retaining essential features.

Comparison: Supervised vs. Unsupervised

Feature	Supervised Learning	Unsupervised Learning
Input Data	Labeled data $(X, Y)$	Unlabeled data $(X)$
Goal	Prediction / Classification / Regression	Discovery of patterns / Structure / Clustering
Feedback	Direct feedback (loss based on actual vs. predicted)	No direct feedback; evaluation is often subjective or based on internal metrics
Examples	Linear Regression, Logistic Regression, Neural Networks	K-Means, PCA, Apriori Algorithm

2. What is Clustering?

Clustering is the most common application of unsupervised learning. It involves the task of grouping a set of objects in such a way that objects in the same group (cluster) are more similar to each other than to those in other groups.

Key Characteristics

Intra-cluster similarity: Points within a cluster should be close together (high density).
Inter-cluster difference: Points in different clusters should be far apart.

Common Use Cases

Market Segmentation: Grouping customers based on purchasing behavior or demographics to target marketing campaigns.
Social Network Analysis: Identifying communities within a social graph.
Image Segmentation: Grouping pixels with similar colors or textures.
Anomaly Detection: Identifying data points that do not fit well into any cluster (outliers).

3. K-Means Intuition

K-Means is an iterative clustering algorithm that aims to partition $m$ observations into $K$ clusters.

The "Centroid" Concept

The core intuition behind K-Means relies on Centroids (cluster centers).

Imagine a 2D scatter plot of data points.
We want to find $K$ specific points (centroids) that represent the center of each cluster.
A data point belongs to the cluster associated with the nearest centroid.
The centroid is then recalculated as the average (mean) position of all the points assigned to it.

This process creates a feedback loop:

Assignment depends on centroid position.
Centroid position depends on assignment.
The algorithm repeats until stability is reached.

4. The K-Means Algorithm

Algorithm Inputs

$X$ : Training set $\{x^{(1)}, x^{(2)}, ..., x^{(m)}\}$ where $x^{(i)} \in \mathbb{R}^n$ .
$K$ : The number of clusters to form.

The Iterative Process

Step 1: Initialization
Randomly initialize $K$ cluster centroids: $\mu_1, \mu_2, ..., \mu_K$ .

Step 2: Cluster Assignment Step
For each training example $x^{(i)}$ , assign it to the closest centroid.

c^{(i)} := \text{index (from 1 to } K) \text{ of cluster centroid closest to } x^{(i)}

Mathematically, we minimize the squared Euclidean distance:

c^{(i)} = \underset{k}{\mathrm{argmin}} \ || x^{(i)} - \mu_k ||^2

Step 3: Move Centroid Step
For each cluster $k$ (from 1 to $K$ ), move the centroid to the average (mean) of the points assigned to that cluster.

\mu_k := \frac{1}{|C_k|} \sum_{i \in C_k} x^{(i)}

Where $C_k$ is the set of examples assigned to cluster $k$ .

Step 4: Convergence
Repeat Steps 2 and 3 until convergence. Convergence occurs when:

The centroids do not change position between iterations.
The assignments ( $c^{(i)}$ ) do not change.
Or a maximum number of iterations is reached.

5. Optimization Objective (Cost Function)

K-Means is not just moving points randomly; it is mathematically optimizing a specific cost function. This cost function is often called the Distortion Function or Inertia.

The Variables

$c^{(i)}$ : Index of the cluster to which example $x^{(i)}$ is currently assigned.
$\mu_k$ : The centroid of cluster $k$ .
$\mu_{c^{(i)}}$ : The centroid of the cluster to which example $x^{(i)}$ is assigned.

The Objective Function $J$

J(c^{(1)}, ..., c^{(m)}, \mu_1, ..., \mu_K) = \frac{1}{m} \sum_{i=1}^{m} || x^{(i)} - \mu_{c^{(i)}} ||^2

Goal

The goal of the K-Means algorithm is to minimize $J$ :

\min_{c, \mu} J(c, \mu)

Cluster Assignment Step minimizes $J$ with respect to $c^{(i)}$ (holding $\mu$ constant).
Move Centroid Step minimizes $J$ with respect to $\mu_k$ (holding $c$ constant).

6. Initializing K-Means

The outcome of K-Means can depend heavily on the initial placement of the centroids.

The Local Optima Problem

If initialized poorly, K-Means can get stuck in a local optimum, where the cost function $J$ is minimized locally but is not the lowest possible value (global minimum). This often results in split clusters or merged clusters that don't represent the data well.

Random Initialization

The standard method involves randomly picking $K$ training examples from the dataset and setting the initial centroids equal to these examples. However, this is risky.

Solution 1: Multiple Random Initializations

To avoid local optima:

Run K-Means multiple times (e.g., 50 to 1000 times) with different random initializations.
Compute the cost function (Distortion $J$ ) for each run.
Select the clustering result that yields the lowest cost $J$ .

Solution 2: K-Means++ (Industry Standard)

K-Means++ is a smarter initialization algorithm designed to choose centroids that are far away from each other.

Choose the first centroid $\mu_1$ uniformly at random from the data points.
For each data point $x$ , compute $D(x)$ , the distance between $x$ and the nearest centroid that has already been chosen.
Choose the next centroid using a weighted probability distribution where a point $x$ is chosen with probability proportional to $D(x)^2$ .
Repeat until $K$ centroids are chosen.
Proceed with standard K-Means.

7. Hard versus Soft Clustering

This distinction refers to how strictly a data point is assigned to a cluster.

Hard Clustering

Definition: Each data point belongs to exactly one cluster.
Method: This is the standard K-Means approach.
Mathematical representation: $c^{(i)} \in \{1, 2, ..., K\}$ .
Example: A customer is segmented as "High Spender." They cannot be "High Spender" and "Low Spender" simultaneously.

Soft Clustering (Fuzzy Clustering)

Definition: Each data point has a probability or a "degree of membership" of belonging to each cluster.
Method: Fuzzy C-Means or Gaussian Mixture Models (GMM).
Mathematical representation: A point might belong to Cluster A with 0.7 probability and Cluster B with 0.3 probability.
Example: A news article might be 60% "Politics" and 40% "Economics."
Relation to K-Means: K-Means can be converted to soft clustering by calculating the distance to all centroids and converting those distances into probabilities (e.g., using Softmax).

8. Choosing the Number of Clusters ( $K$ )

One of the main challenges in K-Means is that $K$ is a hyperparameter; it is not learned from the data itself. The algorithm requires you to define $K$ before it starts.

Methods for Choosing $K$ :

Domain Knowledge / Business Requirements:
- Example: A T-shirt manufacturer wants to size shirts. They specifically need 3 clusters (Small, Medium, Large) to fit their manufacturing process.
Visualization:
- If data is 2D or 3D, visual inspection can suggest natural groupings.
The Elbow Method: (Detailed below).
Silhouette Score: (Advanced metric evaluating separation distance between resulting clusters).

9. The Elbow Method

The Elbow Method is a heuristic used to determine the optimal number of clusters by balancing minimizing the cost function against the complexity of adding more clusters.

The Concept

As the number of clusters $K$ increases, the average distortion (Inertia / WCSS - Within-Cluster Sum of Squares) will always decrease.

If $K = m$ (number of data points), distortion is 0 (every point is its own cluster).
However, we want the smallest $K$ that still provides a low distortion.

Procedure

Run K-Means for a range of values of $K$ (e.g., $K=1$ to $K=10$ ).
For each $K$ , calculate the final Cost Function value (Total WCSS).
Plot $K$ (x-axis) versus WCSS/Cost (y-axis).

Interpreting the Plot

The plot usually resembles an arm.
The Elbow Point: Look for the point where the curve bends effectively—where the drop in distortion goes from rapid to slow.
Interpretation: Before the elbow, adding a cluster drastically reduces variance (good value). After the elbow, adding a cluster only marginally reduces variance (diminishing returns, potentially overfitting).

Limitations

Sometimes the curve is smooth and has no clear "elbow." In such cases, the method is ambiguous, and domain expertise becomes the deciding factor.

Unit 2