Unit 4 - Notes

INT394 8 min read

Unit 4: Regression and Clustering

1. Difference Between Classification and Regression

Both classification and regression are types of Supervised Learning, where the model learns a mapping from input variables ( $X$ ) to an output variable ( $Y$ ). The fundamental difference lies in the nature of the output variable.

Feature	Regression	Classification
Output Type	Continuous (Numerical)	Discrete (Categorical)
Goal	Predict a quantity or value.	Predict a class label or category.
Mapping Function	$f: \mathbb{R}^n \rightarrow \mathbb{R}$	$f: \mathbb{R}^n \rightarrow \{c_1, c_2, ..., c_k\}$
Ordering	There is a natural ordering (e.g., $100 > 50$ ).	No natural ordering implies magnitude (e.g., Cat is not $>$ Dog).
Examples	House price prediction, Stock market forecasting, Temperature prediction.	Spam detection, Handwritten digit recognition, Medical diagnosis (Sick/Healthy).
Algorithms	Linear Regression, SVR, Regression Trees.	Logistic Regression, SVM, Decision Trees, Naive Bayes.
Evaluation Metrics	MSE, RMSE, MAE, $R^2$ Score.	Accuracy, Precision, Recall, F1-Score, ROC-AUC.

2. Linear vs. Non-linear Regression

Regression models are categorized based on the relationship between the independent variables and the dependent variable.

Linear Regression

Assumes a linear relationship between input variables ( $x$ ) and the single output variable ( $y$ ). The model fits a straight line (or hyperplane in higher dimensions).

Equation:
- Where $w$ are weights (parameters) and $\epsilon$ is the error term.
Characteristics:
- Computationally efficient and simple to interpret.
- Susceptible to high bias (underfitting) if the underlying data relationship is complex.
- Solved using Ordinary Least Squares (OLS) or Gradient Descent.

Non-linear Regression

Assumes the relationship between inputs and output is modeled by a non-linear function. The model fits a curve rather than a straight line.

Equation: $y = f(x, w) + \epsilon$ (where $f$ is non-linear w.r.t parameters or variables).
Polynomial Regression: A specific form of linear regression where the input features are transformed to higher powers (e.g., $y = w_0 + w_1x + w_2x^2$ ). While the curve is non-linear, it is still linear with respect to the weights $w$ .
Characteristics:
- More flexible; can capture complex patterns.
- Susceptible to high variance (overfitting).
- Requires more computational power.

3. Loss Functions for Regression

A loss function quantifies the error between the predicted value ( $\hat{y}$ ) and the actual value ( $y$ ). The goal of training is to minimize this loss.

A. Mean Squared Error (MSE)

The average of the squared differences between predicted and actual values.

MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

Pros: Differentiable (great for Gradient Descent); penalizes large errors heavily.
Cons: Very sensitive to outliers.

B. Mean Absolute Error (MAE)

The average of the absolute differences.

MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|

Pros: Robust to outliers compared to MSE.
Cons: Non-differentiable at 0 (requires sub-gradient methods); gradients are constant regardless of how close the prediction is.

C. Root Mean Squared Error (RMSE)

The square root of MSE.

RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}

Pros: Represents error in the same units as the target variable ( $y$ ).

D. Huber Loss

A hybrid loss function that acts like MSE for small errors (quadratic) and MAE for large errors (linear).

L_{\delta}(y, f(x)) = \begin{cases} \frac{1}{2}(y - f(x))^2 & \text{for } |y - f(x)| \le \delta, \\ \delta |y - f(x)| - \frac{1}{2}\delta^2 & \text{otherwise.} \end{cases}

Pros: Combines the best of both worlds: differentiable at 0 and robust to outliers.

4. Non-parametric Regression

Unlike parametric regression (e.g., Linear Regression), which assumes a fixed functional form with a finite set of parameters, non-parametric regression makes no strong assumptions about the underlying function. The structure of the model is determined by the data.

Key Characteristics

Flexible: Can model very complex distributions.
Data-Driven: The complexity of the model grows with the number of data points ( $N$ ).
Risk: Requires large amounts of data to avoid overfitting (Curse of Dimensionality).

Common Algorithms

1. K-Nearest Neighbors (KNN) Regression

Predicts the output based on the average of the $k$ nearest training points.

Logic: To predict $y$ for a new $x$ , find the $k$ closest examples in the training set and calculate their mean (or weighted mean).
Hyperparameter: $K$ (number of neighbors). Small $K$ = Overfitting; Large $K$ = Underfitting.

2. Kernel Regression (Nadaraya-Watson)

Calculates a weighted average of training outputs, where weights are determined by a kernel function (e.g., Gaussian Kernel) based on distance.

Formula: $\hat{y} = \frac{\sum K_h(x - x_i)y_i}{\sum K_h(x - x_i)}$
Points closer to the query point have higher influence.

3. Regression Trees (Decision Trees)

Splits the input space into distinct, non-overlapping regions (boxes). For a new observation, the prediction is the mean of the training observations in that region.

Algorithm: CART (Classification and Regression Trees).
Uses metrics like Variance Reduction or Standard Deviation Reduction to decide splits.

5. Difference Between Classification and Clustering

Feature	Classification	Clustering
Learning Paradigm	Supervised Learning.	Unsupervised Learning.
Input Data	Labeled data $\{(x, y)\}$ . Input variables and target labels are provided.	Unlabeled data $\{x\}$ . Only input variables are provided.
Goal	Assign data points to predefined classes.	Group data points based on similarity; discover inherent structures.
Phases	Distinct Training and Testing phases.	No predefined "training" phase; the algorithm runs on the dataset directly to find patterns.
Validation	Easy to validate against Ground Truth labels (Accuracy).	Difficult to validate; relies on internal metrics (Silhouette Score) or domain expertise.
Example	Is this email Spam or Not Spam?	Segment customers based on purchasing behavior.

6. Similarity and Distance Measures

Clustering relies heavily on measuring how "close" or "similar" two data points are. If $x$ and $y$ are two data vectors:

A. Minkowski Distance (General Metric)

D(x, y) = (\sum_{i=1}^{n} |x_i - y_i|^p)^{1/p}

B. Euclidean Distance ( $p=2$ )

The straight-line distance between two points. Most common for K-Means.

D(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}

Note: Highly sensitive to the scale of features (normalization is required).

C. Manhattan Distance ( $p=1$ )

The sum of absolute differences (Taxi-cab geometry).

D(x, y) = \sum_{i=1}^{n} |x_i - y_i|

Use Case: High-dimensional data or robust regression.

D. Cosine Similarity

Measures the cosine of the angle between two vectors. Focuses on orientation, not magnitude.

\text{Similarity}(x, y) = \frac{x \cdot y}{||x|| \times ||y||}

Use Case: Text mining, Document clustering.

E. Jaccard Similarity

Used for binary attributes or sets.

J(A, B) = \frac{|A \cap B|}{|A \cup B|}

7. Partition-based Clustering

Partitioning methods divide data into $k$ distinct, non-overlapping groups.

K-Means Clustering

The most popular centroid-based algorithm.

Algorithm Steps:

Initialization: Select $k$ random points as initial centroids.
Assignment: Assign every data point to the nearest centroid (typically using Euclidean distance).
Update: Recalculate the centroids by taking the mean of all points assigned to that cluster.
Repeat: Steps 2 and 3 until convergence (centroids do not move or max iterations reached).

Objective Function:
Minimize the Within-Cluster Sum of Squares (WCSS) or Sum of Squared Errors (SSE).

Limitations of K-Means:

Requires specifying $k$ in advance.
Sensitive to initial centroid placement (can get stuck in local optima).
Sensitive to outliers (centroids get pulled away).
Assumes clusters are spherical and roughly equal in size.

K-Medoids (PAM - Partitioning Around Medoids)

Similar to K-Means, but the center of a cluster must be an actual data point (medoid), not the calculated mean.

Pros: More robust to noise and outliers than K-Means.
Cons: Computationally more expensive.

8. Hierarchical Clustering

Builds a hierarchy of clusters. The result is a tree-like representation called a Dendrogram. This method does not require specifying $k$ beforehand.

Types of Hierarchical Clustering

1. Agglomerative (Bottom-Up)

Start: Treat each data point as a single cluster (N clusters).
Step: Merge the two closest clusters.
Repeat: Until only one huge cluster remains.
Most common approach.

2. Divisive (Top-Down)

Start: All points in one cluster.
Step: Split the cluster recursively.
Repeat: Until each point is its own cluster.

Linkage Criteria (How to measure distance between two clusters)

When merging clusters $A$ and $B$ , how do we define distance?

Single Linkage (Min): Distance between the two closest members of clusters A and B.
- Issue: "Chaining effect" (long, stringy clusters).
Complete Linkage (Max): Distance between the two farthest members of clusters A and B.
- Result: Compact, spherical clusters.
Average Linkage: Average distance between all pairs of points in A and B.
Ward’s Method: Minimizes the increase in variance (SSE) when merging clusters.

9. Cluster Validation and Evaluation

Since clustering is unsupervised, evaluation is challenging.

A. Internal Evaluation (Using only the data)

Elbow Method (For K-Means):
- Plot $k$ (x-axis) vs. WCSS/SSE (y-axis).
- Look for the "elbow" point where the decrease in error slows down significantly. This is the optimal $k$ .
Silhouette Coefficient:
- Measures how similar a point is to its own cluster (cohesion) compared to other clusters (separation).
- Range: $[-1, 1]$ .
- $+1$ : Well clustered.
- $0$: On the border.
- $-1$ : Incorrectly clustered.
Davies-Bouldin Index:
- Ratio of within-cluster scatter to between-cluster separation.
- Lower is better.

B. External Evaluation (When ground truth labels are available)

Used mostly for benchmarking algorithms on known datasets.

Purity: The percent of the total number of data points that were correctly clustered.
Adjusted Rand Index (ARI): Measures the similarity between the clustering result and the ground truth, correcting for chance. Range $[-1, 1]$ .
Normalized Mutual Information (NMI): Information-theoretic measure of shared information between predicted clusters and true classes.

Unit 3

Unit 5