Unit6 - Subjective Questions
CSE274 • Practice Questions with Detailed Answers
Explain the fundamental role of Unsupervised Learning and distinguish it from Supervised Learning.
Role of Unsupervised Learning:
Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses. Its primary roles include:
- Clustering: Grouping similar data points together (e.g., customer segmentation).
- Dimensionality Reduction: Compressing data while maintaining structure (e.g., PCA).
- Anomaly Detection: Identifying rare events or outliers.
Differences:
| Feature | Supervised Learning | Unsupervised Learning |
|---|---|---|
| Input Data | Labeled data (Input-Output pairs). | Unlabeled data (Input only). |
| Goal | Predict outcomes or classify data. | Find hidden patterns or structures. |
| Feedback | Direct feedback (error minimization). | No external feedback/ground truth. |
| Algorithms | Linear Regression, SVM, Decision Trees. | K-Means, Apriori, PCA. |
Mathematically define Euclidean Distance and Manhattan Distance. In which scenarios is Manhattan distance preferred over Euclidean?
1. Euclidean Distance:
It is the straight-line distance between two points in a space. For two points and in an -dimensional space:
2. Manhattan Distance:
It is the sum of absolute differences between the coordinates of the points. It represents the path a taxicab takes in a grid-like street system:
Preference for Manhattan Distance:
- High Dimensionality: In very high-dimensional spaces, the difference between the furthest and closest points using Euclidean distance becomes negligible (Curse of Dimensionality). Manhattan distance is often more robust in these cases.
- Grid Structures: When movement is restricted to grid-like paths (e.g., city blocks, VLSI layout).
What is Cosine Distance? Derive the formula for Cosine Similarity and explain why it is commonly used in text mining.
Cosine Distance:
Cosine distance measures the cosine of the angle between two non-zero vectors. It depends on the orientation, not the magnitude.
Cosine Similarity Formula:
Given two vectors and , the similarity is defined as:
Usage in Text Mining:
- Magnitude Independence: In text data, document length varies significantly. One document might contain the word "science" 5 times, and another 50 times, but they discuss the same topic. Euclidean distance would treat them as far apart due to magnitude, whereas Cosine similarity focuses on the direction (content overlap), making them appear similar.
Describe the K-Means Clustering algorithm step-by-step. What is the objective function it tries to minimize?
K-Means Algorithm Steps:
- Initialization: Choose random points as initial cluster centroids.
- Assignment: Assign every data point to the nearest centroid (usually based on Euclidean distance).
- Update: Recalculate the centroid of each cluster by taking the mean of all data points assigned to that cluster.
- Repeat: Repeat steps 2 and 3 until convergence is reached (i.e., centroids do not change, or the change is below a threshold).
Objective Function (Inertia/SSE):
K-Means tries to minimize the Within-Cluster Sum of Squares (WCSS):
Where:
- is the number of clusters.
- is a data point in cluster .
- is the centroid of cluster .
Explain the 'Elbow Method' for determining the optimal number of clusters () in K-Means.
The Elbow Method:
The Elbow Method is a heuristic used to determine the optimal number of clusters in a dataset.
Process:
- Run the K-Means algorithm for a range of values (e.g., from 1 to 10).
- For each , calculate the Within-Cluster Sum of Squares (WCSS) or Inertia.
- Plot (x-axis) versus WCSS (y-axis).
Interpretation:
- As increases, WCSS decreases (trivially, if , WCSS is 0).
- The goal is to find the point where adding another cluster does not significantly reduce the WCSS.
- This point typically forms an "elbow" or bend in the graph.
- The value at the elbow is chosen as the optimal number of clusters.
Compare K-Means and K-Medoids (PAM). Why is K-Medoids considered more robust to outliers?
Comparison:
| Feature | K-Means | K-Medoids (PAM) |
|---|---|---|
| Center Representation | Mean of points in the cluster. | Actual data point (Medoid) from the cluster. |
| Complexity | Faster, . | Slower, computationally expensive for large datasets. |
| Robustness | Sensitive to outliers. | Robust to outliers. |
Robustness to Outliers:
K-Means uses the mean to update centroids. A single outlier with an extreme value can significantly shift the mean, dragging the cluster centroid toward the outlier. K-Medoids uses an actual data point (medoid) that minimizes the sum of dissimilarities. Since it uses the most central item rather than an average value, extreme outliers do not affect the position of the representative point as drastically.
Discuss Hierarchical Clustering. Distinguish between Agglomerative and Divisive approaches.
Hierarchical Clustering:
A method of cluster analysis which seeks to build a hierarchy of clusters. It does not require specifying beforehand and produces a tree-based representation called a Dendrogram.
Agglomerative (Bottom-Up):
- Treat each data point as a single cluster (N clusters).
- Merge the two closest clusters based on a linkage criteria.
- Repeat until only one single cluster remains containing all data points.
Divisive (Top-Down):
- Start with all data points in one single cluster.
- Split the cluster recursively into smaller clusters.
- Repeat until each data point forms its own cluster.
Agglomerative is more common due to lower computational complexity compared to Divisive.
Explain the different Linkage Criteria used in Hierarchical Clustering: Single, Complete, Average, and Ward's Linkage.
Linkage criteria determine how the distance between two clusters is calculated:
-
Single Linkage (Min distance):
- Distance between the two closest points in different clusters.
- for .
- Effect: Can result in "chaining" (long, thin clusters).
-
Complete Linkage (Max distance):
- Distance between the two farthest points in different clusters.
- .
- Effect: Tends to yield compact, spherical clusters.
-
Average Linkage:
- Average distance between all pairs of points in the two clusters.
- .
- Effect: Robust to noise.
-
Ward’s Linkage:
- Minimizes the increase in variance (or SSE) when merging clusters.
- Effect: Produces clusters of similar sizes, similar to K-Means objective.
Describe the DBSCAN algorithm. Define the concepts of Core Points, Border Points, and Noise Points.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points that are closely packed together (points with many nearby neighbors).
Parameters:
- (Epsilon): The radius of the neighborhood.
- MinPts: Minimum number of points required to form a dense region.
Point Definitions:
- Core Point: A point is a Core Point if there are at least MinPts within its -neighborhood (including itself).
- Border Point: A point that is reachable from a Core Point (within distance) but has fewer than MinPts in its own neighborhood.
- Noise Point (Outlier): A point that is neither a Core Point nor a Border Point. It does not belong to any cluster.
Algorithm:
It starts with an arbitrary point. If it's a core point, a cluster is formed. It recursively adds connected core and border points. If it's noise, the algorithm moves to the next unvisited point.
What are the primary advantages of Density-Based Clustering (DBSCAN) over Partitioning methods like K-Means?
Advantages of DBSCAN over K-Means:
- Arbitrary Shapes: K-Means assumes spherical clusters. DBSCAN can find clusters of arbitrary shapes (e.g., nested circles, moons) because it relies on density rather than distance from a center.
- No need to specify K: Unlike K-Means, DBSCAN does not require the number of clusters () to be specified beforehand.
- Robust to Outliers: DBSCAN explicitly handles noise. Points that do not satisfy the density criteria are labeled as noise/outliers, whereas K-Means forces every point into a cluster, potentially distorting the centroids.
- Parameter sensitivity: While DBSCAN requires and MinPts, it is generally not affected by the initialization of centroids like K-Means.
Explain the concept of Anomaly Detection in the context of Unsupervised Learning.
Anomaly Detection:
Anomaly detection (or outlier detection) is the identification of rare items, events, or observations which raise suspicions by differing significantly from the majority of the data.
In Unsupervised Context:
Since we do not have labels indicating which samples are "normal" and which are "anomalies," unsupervised algorithms look for deviations in the data structure.
Approaches:
- Density-Based: Points in low-density regions are considered anomalies (e.g., noise points in DBSCAN).
- Distance-Based: Points far away from cluster centroids (e.g., in K-Means).
- Isolation Forest: Explicitly isolates anomalies rather than profiling normal data points, based on the fact that anomalies are few and different, making them easier to isolate in a tree structure.
Define Inertia as an evaluation metric for clustering. What are its limitations?
Inertia (Within-Cluster Sum of Squares):
Inertia measures how well a dataset was clustered by K-Means. It is calculated as the sum of squared distances of samples to their closest cluster center.
- Goal: Lower inertia indicates that points are closer to their centroids (tighter clusters).
Limitations:
- Depends on K: Inertia decreases as the number of clusters () increases. It cannot be used as a standalone metric to pick without the Elbow method.
- Assumption of Convex Shapes: It assumes clusters are convex (spherical) and isotropic. It fails to evaluate irregular shapes (e.g., elongated clusters) correctly.
- Scale Dependent: It is not a normalized metric; its value depends on the scale of the data (requires feature scaling beforehand).
Explain the Silhouette Score. How is it calculated and how do you interpret its value?
Silhouette Score:
A metric used to calculate the goodness of a clustering technique. It measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation).
Calculation:
For a point :
- Calculate : The average distance between and all other points in the same cluster.
- Calculate : The average distance between and all points in the nearest neighboring cluster.
- Silhouette coefficient :
Interpretation:
The score ranges from -1 to +1.
- +1: The sample is far away from the neighboring clusters (Good clustering).
- 0: The sample is on or very close to the decision boundary between two neighboring clusters.
- -1: The sample has likely been assigned to the wrong cluster.
What is the Davies–Bouldin Index? How does it differ from the Silhouette Score in terms of optimization?
Davies–Bouldin Index (DBI):
The DBI is defined as the average similarity measure of each cluster with its most similar cluster. Similarity is the ratio of within-cluster distances to between-cluster distances.
Formula Logic:
Where is the average distance of points in cluster to centroid , and is the distance between centroids.
Interpretation & Difference:
- Optimization: A lower DBI indicates better clustering (clusters are compact and far apart). In contrast, a higher Silhouette score indicates better clustering.
- Computation: DBI is generally simpler and faster to compute than Silhouette Score but is restricted to Euclidean distance concepts usually.
Discuss the 'Curse of Dimensionality' and its impact on choosing distance metrics for clustering.
Curse of Dimensionality:
This refers to various phenomena that arise when analyzing data in high-dimensional spaces (many features). As the number of dimensions increases, the volume of the space increases exponentially, and the data becomes sparse.
Impact on Distance Metrics:
- Distance Concentration: In high dimensions, the distance between the nearest and farthest points becomes negligible. All points tend to become equidistant from each other.
- Metric Selection:
- Euclidean Distance loses meaning in high dimensions as the squared error amplifies noise.
- Cosine Similarity is often preferred (e.g., in text clustering) because it focuses on direction rather than geometric distance.
- Dimensionality Reduction (like PCA) is often required before clustering to make distance metrics effective again.
Compare Partitioning (K-Means), Hierarchical, and Density-based (DBSCAN) clustering methods.
| Feature | Partitioning (K-Means) | Hierarchical | Density-Based (DBSCAN) |
|---|---|---|---|
| Shape of Clusters | Spherical/Convex. | Any shape (depends on linkage). | Arbitrary shapes. |
| Parameters | Number of clusters (). | Number of clusters or Cutoff distance. | Radius () and MinPts. |
| Outliers | Sensitive (forces assignment). | Sensitive (often merges with clusters). | Robust (identifies as noise). |
| Complexity | Linear . Efficient for large data. | Cubic or Quadratic . Slow. | with spatial index. |
| determinism | Non-deterministic (random init). | Deterministic. | Deterministic (mostly). |
What is a Dendrogram? How is it used to determine the number of clusters in Hierarchical Clustering?
Dendrogram:
A dendrogram is a tree-like diagram that records the sequences of merges (or splits) in hierarchical clustering. The y-axis usually represents the Euclidean distance (or dissimilarity) at which clusters merge, and the x-axis represents the data points.
Determining Clusters:
- Visual Inspection: Look for the longest vertical line in the dendrogram that is not crossed by any horizontal line (which represents a merge).
- Cut-off: This long vertical distance indicates a large jump in dissimilarity, implying that the clusters being merged are quite different.
- Horizontal Cut: Drawing a horizontal line across this vertical span allows you to count the number of vertical lines it intersects. This count represents the optimal number of clusters.
Why is scaling or normalization of features critical before applying distance-based clustering algorithms?
Importance of Scaling:
Distance-based algorithms (like K-Means, KNN, Hierarchical) calculate the distance between points to determine similarity (e.g., Euclidean distance).
The Problem:
If features have different units or scales (e.g., "Income" in thousands vs. "Age" in tens), the feature with the larger numerical range will dominate the distance calculation.
Example:
- Feature A (0 to 100,000)
- Feature B (0 to 100)
- A small change in A contributes massively to the squared difference compared to B.
Solution:
Normalization (Min-Max scaling) or Standardization (Z-score scaling) ensures all features contribute equally to the distance metric, preventing bias toward variables with larger magnitudes.
Explain the concept of 'Centroid' in K-Means versus 'Medoid' in K-Medoids.
Centroid (K-Means):
- A centroid is a geometric center of a cluster.
- It is an artificial point calculated as the arithmetic mean of all data points in that cluster.
- It may not correspond to any actual observation in the dataset.
- Formula:
Medoid (K-Medoids):
- A medoid is the most centrally located actual data point within a cluster.
- It is the point in the cluster whose average dissimilarity to all other points in the cluster is minimal.
- Since it is a real data point, it makes the interpretation of the cluster representative easier and is more robust to noise.
Describe the main challenges associated with Unsupervised Learning compared to Supervised Learning.
Challenges in Unsupervised Learning:
- No Ground Truth: Unlike supervised learning, there are no labels to verify the results. We cannot calculate simple accuracy or error rates.
- Subjective Evaluation: Evaluation often relies on internal metrics (Silhouette, Inertia) or domain expert validation, which can be subjective. What looks like a good cluster to the algorithm might not make business sense.
- Curse of Dimensionality: Unsupervised methods rely heavily on distance metrics, which degrade in high-dimensional spaces.
- Parameter Selection: Many algorithms require hyperparameter tuning (e.g., in K-Means, in DBSCAN) without a clear, objective way to tune them perfectly.
- Interpretability: The patterns found (e.g., a specific cluster of customers) may be difficult to explain or label semantically.