Unit 1 - Notes
Unit 1: Foundations of Unsupervised Learning
1. Learning Paradigms: Supervised vs. Unsupervised vs. Semi-Supervised Learning
Understanding the distinction between different machine learning paradigms is fundamental to formulating data-driven solutions. The key differentiator lies in the presence and utilization of "labels" or target variables in the training data.
Supervised Learning
- Definition: The model learns from a fully labeled dataset. Every input vector is associated with a known target output or label .
- Objective: To learn a mapping function such that the model can accurately predict the output for new, unseen inputs.
- Key Tasks:
- Classification: Predicting discrete class labels (e.g., Spam vs. Not Spam).
- Regression: Predicting continuous numerical values (e.g., Predicting house prices based on square footage).
- Pros/Cons: Highly accurate and easy to evaluate, but acquiring large, labeled datasets is often expensive, time-consuming, and requires human domain expertise.
Unsupervised Learning
- Definition: The model learns from an unlabeled dataset. The data consists of input vectors without any corresponding target labels .
- Objective: To discover hidden structures, patterns, relationships, or representations within the data without explicit guidance on what to look for.
- Key Tasks:
- Clustering: Grouping similar data points together (e.g., K-Means, DBSCAN).
- Dimensionality Reduction: Compressing data while retaining essential information (e.g., PCA, t-SNE).
- Association Rule Learning: Discovering interesting relations between variables (e.g., Market Basket Analysis).
- Pros/Cons: Can utilize vast amounts of readily available unlabeled data. However, evaluating performance is inherently difficult and subjective since there is no "ground truth" to compare against.
Semi-Supervised Learning
- Definition: A hybrid approach that utilizes a small amount of labeled data in conjunction with a large amount of unlabeled data during training.
- Objective: To leverage the underlying distribution of the unlabeled data to improve the learning accuracy of the supervised task.
- Mechanism: Assumptions like the Continuity Assumption (points close to each other are more likely to share a label) or the Cluster Assumption (data tends to form discrete clusters, and points in the same cluster share a label) are used to propagate labels from the small labeled set to the larger unlabeled set (e.g., Pseudo-labeling).
- Pros/Cons: Significantly reduces the cost of manual labeling while achieving near-supervised accuracy. It is highly effective in domains like medical imaging or speech recognition where data is abundant, but expert labeling is scarce.
2. Problem Formulation for Unsupervised Learning
Unlike supervised learning, where the goal is minimizing a specific prediction error (like Mean Squared Error or Cross-Entropy), unsupervised learning requires a different mathematical formulation.
The Dataset
Let the dataset be represented as a matrix , where:
- is the number of observations (data points).
- is the number of features (dimensions).
- is the observation, represented as a feature vector.
- Crucially, there is no target vector .
Formulating Specific Objectives
Because there is no explicit target, the objective function depends entirely on the specific task:
-
Clustering (Grouping):
- Goal: Partition into disjoint subsets (clusters) .
- Optimization: Minimize intra-cluster variance (distance between points in the same cluster) and maximize inter-cluster variance (distance between different clusters).
- Formulation Example (K-Means): (where is the mean of cluster ).
-
Dimensionality Reduction (Representation):
- Goal: Find a transformation function where .
- Optimization: Retain as much variance or local structure of the original data as possible while minimizing reconstruction error.
-
Density Estimation:
- Goal: Estimate the underlying Probability Density Function (PDF) that generated the dataset .
- Optimization: Maximize the likelihood of the observed data using methods like Gaussian Mixture Models (GMMs).
3. Real-Life Use Cases
Unsupervised learning is ubiquitous in modern data science due to the sheer volume of unlabeled data generated daily.
Market Segmentation
- Concept: Dividing a broad consumer or business market into sub-groups (segments) based on shared characteristics.
- Application: A retail company uses clustering algorithms on demographic data, purchase history, and website interaction metrics to identify distinct buyer personas (e.g., "Bargain Hunters," "Brand Loyalists," "Impulse Buyers"). This allows for highly targeted marketing campaigns, optimizing advertising spend.
Customer Behavior Analysis
- Concept: Understanding how customers interact with a product or service over time.
- Application: E-commerce platforms utilize association rule learning (like the Apriori algorithm) to perform Market Basket Analysis. By discovering rules like "If a customer buys a laptop and a mouse, they are 80% likely to buy a laptop bag," companies can optimize product placement and build effective recommender systems.
Anomaly & Fraud Detection
- Concept: Identifying data points, events, or observations that deviate significantly from a dataset's normal behavior.
- Application: Credit card companies use unsupervised density estimation and isolation forests to flag fraudulent transactions. Because fraud is rare and constantly evolving, supervised models often struggle to keep up. Unsupervised models map "normal" spending patterns and flag any transaction (e.g., sudden large foreign purchases) that falls into a low-density/high-distance region of the feature space.
Pattern Discovery in Biological and Social Data
- Biological Data: In bioinformatics, clustering is used on gene expression data (microarrays) to group genes that exhibit similar expression patterns under different conditions. This helps in identifying unknown gene functions and discovering disease subtypes (e.g., finding distinct molecular variations of breast cancer).
- Social Data: In network analysis, community detection algorithms are used to find tight-knit groups of friends on social media platforms, map the spread of information/misinformation, or identify influential nodes (hubs) within a network.
4. Distance & Similarity Metrics
In unsupervised learning, since we lack labels, the algorithm's understanding of the data relies entirely on the mathematical definition of "similarity" or "distance" between data points.
Euclidean Distance (L2 Norm)
- Definition: The straight-line distance between two points in Euclidean space. It is the most common distance metric.
- Formula: For two vectors and :
- Characteristics: Highly sensitive to differences in large values (because the differences are squared). It measures the absolute distance in space.
Manhattan Distance (L1 Norm / City Block Distance)
- Definition: The distance between two points measured along axes at right angles. It is named after the grid-like street geography of Manhattan.
- Formula:
- Characteristics: Less sensitive to outliers compared to Euclidean distance because it does not square the differences. It is often preferred in high-dimensional spaces or when features are strictly discrete/grid-based.
Cosine Similarity
- Definition: A measure of similarity between two non-zero vectors that calculates the cosine of the angle between them.
- Formula:
- Characteristics: The output ranges from -1 (exactly opposite) to 1 (exactly the same). A value of 0 indicates orthogonality (independence). It measures the direction of the vectors, not their magnitude.
5. When and Why Distance Choice Matters
The choice of distance metric acts as the "lens" through which an unsupervised algorithm views the data. Choosing the wrong metric can lead to meaningless results.
1. The Role of Data Magnitude (Euclidean vs. Cosine)
- Why it matters: If you are analyzing text documents represented as word frequency vectors, a long document and a short document might be about the exact same topic (same direction in vector space), but they will have very different magnitudes (frequencies).
- The Choice: Cosine Similarity is ideal here because it ignores the length of the documents and only looks at the angle (content distribution). Euclidean distance would incorrectly classify these documents as highly dissimilar purely because one is longer.
2. Sensitivity to Outliers (Euclidean vs. Manhattan)
- Why it matters: Euclidean distance squares the differences between feature values. If two data points differ significantly in just one dimension due to a data anomaly, that squared difference will dominate the entire distance calculation.
- The Choice: If your dataset is prone to outliers or noisy data, Manhattan distance is more robust because the absolute differences do not magnify large anomalies as heavily as squared differences do.
3. The Curse of Dimensionality
- Why it matters: As the number of dimensions (features) grows, the volume of the space increases exponentially. In extremely high-dimensional spaces, the distance between the closest pair of points and the farthest pair of points becomes almost negligible. All points start to seem equidistant.
- The Choice: Euclidean distance degrades rapidly in high dimensions. In such scenarios, Cosine similarity or fractional distance metrics (like L-norms where p < 1) often provide better differentiation between similar and dissimilar points. Alternatively, applying Dimensionality Reduction (like PCA) before calculating Euclidean distance is standard practice.
4. Scale of Features
- Why it matters: If one feature is measured in millions (e.g., annual income) and another in single digits (e.g., number of children), the feature with the larger range will completely dominate any distance metric based on absolute values (Euclidean or Manhattan).
- The Solution: Whenever using distance metrics that rely on magnitude, feature scaling (Standardization or Min-Max Normalization) is an absolute necessity.
# Example: Standardizing data before calculating Euclidean distance in Python
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import euclidean_distances
# Assume X is a numpy array of shape (n_samples, n_features)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Now it is safe to calculate distance
distances = euclidean_distances(X_scaled)