Unit 2 - Notes

INT423 7 min read

Unit 2: Advanced Clustering Techniques

1. Anomaly Detection Algorithms

Anomaly detection (or outlier detection) identifies data points, events, or observations that deviate so significantly from the dataset's majority that they arouse suspicion of being generated by a different mechanism.

Isolation Forest (iForest)

Isolation Forest is an unsupervised learning algorithm that identifies anomalies by isolating outliers in the data. Unlike other methods that profile normal data points to identify anomalies (e.g., density-based or distance-based), iForest explicitly isolates anomalies.

Core Concept

The algorithm is built on two properties of anomalies:

Few: They are the minority consisting of fewer instances.
Different: They have attribute-values that are very different from those of normal instances.

Because of these properties, anomalies are susceptible to isolation.

Working Mechanism

Random Partitioning: The algorithm recursively partitions the dataset by randomly selecting an attribute and then randomly selecting a split value between the maximum and minimum values of that attribute.
Tree Construction: This recursive process produces a tree structure (iTree).
Path Length:
- Normal points lie deep within the tree (require many cuts to isolate).
- Anomalies lie close to the root of the tree (require very few cuts to isolate).

The Anomaly Score

The anomaly score $s$ of an instance $x$ is defined based on the average path length $E(h(x))$ over a forest of trees:

s(x, n) = 2^{-\frac{E(h(x))}{c(n)}}

Where:

$h(x)$ is the path length of instance $x$ .
$c(n)$ is the average path length of unsuccessful search in a Binary Search Tree (normalization factor).
Interpretation:
- If $s$ is close to 1, the instance is likely an anomaly.
- If $s$ is much smaller than 0.5, the instance is likely normal.

Advantages

Efficiency: It has a linear time complexity $O(n)$ , making it scalable to large datasets.
No Distance Calculation: Unlike K-Means or KNN, it does not rely on computationally expensive distance calculations.
Handling High Dimensions: It performs reasonably well in high-dimensional feature spaces compared to distance-based methods.

2. Anomaly Detection vs. Supervised Learning

Understanding when to use anomaly detection versus standard supervised classification is critical for system design.

Feature	Anomaly Detection	Supervised Learning
Label Availability	Unlabeled (Unsupervised) or Semi-supervised (mostly normal data, very few labeled anomalies).	Labeled. Both positive and negative examples are available and labeled.
Class Balance	Extremely Skewed. Positive examples (anomalies) might be < 1% of the data (e.g., 1 fraud per 10,000 transactions).	Balanced or moderately imbalanced. There are enough positive examples to learn a pattern.
Nature of Anomalies	Anomalies are unknown or evolving. The algorithm looks for "anything different." (e.g., a new type of cyber attack).	The positive class is well-defined. The algorithm looks for "specific patterns" it has seen before.
Goal	Detect outliers/novelties.	Predict a specific class.
Examples	Fraud detection, manufacturing defect detection (where defects are rare), data center monitoring.	Spam filtering, cancer detection (if historical data is abundant), handwritten digit recognition.

When to choose Anomaly Detection:

Positive examples (anomalies) are extremely rare.
The "abnormal" behavior is not consistent; tomorrow's anomaly may look completely different from today's.

When to choose Supervised Learning:

You have a large number of positive and negative examples.
The future positive examples will likely look similar to the ones in the training set.

3. Choosing Features & PCA for Anomaly Detection

Feature selection is paramount in anomaly detection because irrelevant features can add "noise," making normal points look like outliers or masking actual outliers (the "Curse of Dimensionality").

Feature Selection Criteria

Relevance: Choose features where anomalies are expected to deviate significantly.
Gaussian Distribution: Many anomaly detection algorithms assume data is normally distributed. Transforming non-Gaussian features (e.g., using log-transform) to look Gaussian often improves performance.

Principal Component Analysis (PCA) for Anomaly Detection

PCA is traditionally a dimensionality reduction technique, but it is effectively used for anomaly detection via Reconstruction Error.

The Logic

PCA captures the directions (Principal Components) of maximum variance (correlation) in the data. The top $k$ eigenvectors capture the "normal" patterns of the data.

Training: Train PCA on the dataset (mostly normal data) and select the top $k$ components.
Projection: Project a new data point into this lower-dimensional subspace.
Reconstruction: Project the point back to the original dimension.
Error Calculation: Calculate the distance between the original point and the reconstructed point.

Interpretation

Normal Data: Correlates well with the principal components. The reconstruction error will be low.
Anomalies: Do not conform to the correlation structure of the normal data. When projected down and back up, significant information is lost. The reconstruction error will be high.

Error = || x - x_{reconstructed} ||^2

4. Hierarchical Clustering

Hierarchical clustering groups data into a tree of clusters. Unlike K-Means, it does not require specifying the number of clusters ( $k$ ) beforehand.

Organizing Clusters as a Hierarchical Tree (Dendrogram)

The output of hierarchical clustering is a Dendrogram, a tree diagram that records the sequences of merges or splits.

X-axis: Represents the individual data points.
Y-axis: Represents the distance (dissimilarity) at which clusters merge.
Cutting the Tree: To obtain a specific number of clusters, you "cut" the dendrogram at a specific height (distance threshold). A cut at a higher distance yields fewer, larger clusters.

Agglomerative Clustering (Bottom-Up)

This is the most common approach to hierarchical clustering.

Algorithm Steps:

Initialization: Treat each data point as a single cluster. If there are $N$ points, there are $N$ clusters.
Distance Matrix: Calculate the distance matrix between all clusters.
Merge: Find the two clusters that are closest to each other and merge them into a single cluster.
Update: Update the distance matrix to reflect the distance between the new merged cluster and all remaining clusters.
Repeat: Repeat steps 3 and 4 until only one single cluster remains (containing all data points).

Linkage Criteria (Measuring distance between clusters)

How do we define the distance between Cluster A and Cluster B?

Single Linkage (Min): Distance between the closest pair of points (one in A, one in B).
- Pros: Good for non-elliptical shapes.
- Cons: Sensitive to noise; produces "chaining" effects (long, thin clusters).
Complete Linkage (Max): Distance between the farthest pair of points.
- Pros: Produces compact, spherical clusters.
- Cons: Sensitive to outliers.
Average Linkage: Average distance between all pairs of points.
Ward’s Method: Merges the two clusters that result in the minimum increase in total intra-cluster variance (similar to K-Means objective). Generally the default choice.

5. DBSCAN Clustering

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It clusters points based on the density of points in a region, rather than distance from a centroid.

Key Characteristics

Does not require $k$ (number of clusters) to be defined.
Can find arbitrarily shaped clusters (e.g., one cluster surrounding another).
Robust to outliers (explicitly labels them as noise).

Parameters

$\epsilon$ (Eps): The radius of the neighborhood around a data point.
MinPts: The minimum number of data points (including the point itself) required within the radius $Eps$ to form a dense region.

Point Classification

DBSCAN classifies every data point into three categories:

Core Point: A point is a Core Point if it has at least $MinPts$ within its $Eps$ radius. These points form the interior of a cluster.
Border Point: A point that has fewer than $MinPts$ within its $Eps$ radius, but it is in the neighborhood of a Core Point. It is part of the cluster but on the edge.
Noise Point (Outlier): A point that is neither a Core Point nor a Border Point. It does not belong to any cluster.

Algorithm Steps

Pick an arbitrary unvisited point $P$ .
Retrieve the neighborhood of $P$ using radius $Eps$ .
If size of neighborhood $\ge MinPts$ :
- Mark $P$ as a Core Point.
- Create a new cluster.
- Expand the cluster: Add all points in $P$ 's neighborhood to the cluster. If any of those points are also Core Points, add their neighbors as well (recursive expansion).
If size of neighborhood $< MinPts$ :
- Mark $P$ as Noise (temporarily; it might later be found to be a Border Point of a different Core Point).
Repeat until all points have been visited.

Pros and Cons

Pros:
- Handles noise effectively.
- Can find non-linearly separable clusters (e.g., "half-moon" shapes).
Cons:
- Struggles with clusters of varying densities (it is hard to find a single $Eps$ that works for both sparse and dense clusters).
- Sensitive to the selection of parameters $Eps$ and $MinPts$ .
- Computational cost is higher ( $O(n^2)$ ) without spatial indexing (like KD-trees).

Unit 1

Unit 3