Unit 4 - Notes

INT395 7 min read

Unit 4: Regression with scikit-Learn

1. Introduction to Regression

Regression analysis is a subfield of supervised machine learning that aims to model the relationship between a certain number of features (independent variables) and a continuous target variable (dependent variable).

Goal: Predict a continuous numerical value (e.g., house prices, temperature, stock prices) based on input features.
Mathematical Representation: Given a training set $D = \{(x^{(1)}, y^{(1)}), ..., (x^{(n)}, y^{(n)})\}$ , the goal is to learn a mapping function $f(x)$ such that $\hat{y} = f(x)$ is as close to the true $y$ as possible.

Types of Relationships

Simple Regression: One independent variable and one dependent variable.
Multiple Regression: Multiple independent variables and one dependent variable.
Multivariate Regression: Multiple dependent variables (distinct from multiple regression).

2. Exploratory Data Analysis (EDA)

Before building a model, it is crucial to understand the data structure, outliers, and relationships.

Visualizing Relationships

Scatter Plot Matrix (Pairplot): Visualizes the pairwise correlation between features and the target. Useful for detecting linearity.
Correlation Matrix (Heatmap): Quantifies linear relationships.
- Pearson Correlation coefficient ( $r$ ): Ranges from -1 to 1.
- $r=1$ : Perfect positive correlation.
- $r=-1$ : Perfect negative correlation.
- $r=0$ : No linear correlation.

Implementation with Python

PYTHON

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Load dataset
df = pd.read_csv('data.csv')

# Scatter plot matrix
sns.pairplot(df, size=2.5)
plt.show()

# Correlation Matrix Heatmap
cm = df.corr().values
sns.heatmap(cm, annot=True)
plt.show()

3. Evaluation Metrics

To measure the performance of a regression model, we calculate the difference between predicted values ( $\hat{y}$ ) and actual values ( $y$ ).

Key Metrics

Mean Absolute Error (MAE): The average of the absolute differences between predictions and actual values. It is robust to outliers compared to MSE.
$MAE = \frac{1}{n} \sum |y_i - \hat{y}_i|$
Mean Squared Error (MSE): The average of the squared differences. It heavily penalizes large errors. Useful for gradient descent derivation.
$MSE = \frac{1}{n} \sum (y_i - \hat{y}_i)^2$
Root Mean Squared Error (RMSE): The square root of MSE. It is interpretable as it is in the same units as the target variable.
Coefficient of Determination ( $R^2$ Score): Represents the proportion of variance in the dependent variable that is predictable from the independent variables.
- Range: $-\infty$ to $1$.
- $R^2 = 1$ : Perfect fit.
- $R^2 = 0$ : Model predicts the mean of the target.
  $R^2 = 1 - \frac{SSE}{SST}$ (where SSE is Sum of Squared Errors and SST is Total Sum of Squares).

Implementation

PYTHON

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

print('MSE:', mean_squared_error(y_test, y_pred))
print('R2:', r2_score(y_test, y_pred))

4. Linear Regression (Ordinary Least Squares)

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data.

Equation:
- $w_0$ : Bias (Intercept)
- $w_1...w_n$ : Weights (Coefficients)
Objective: Minimize the Residual Sum of Squares (RSS) (the cost function).

Assumptions

Linearity: The relationship between X and y is linear.
Independence: Observations are independent of each other.
Homoscedasticity: The variance of residuals is constant across all levels of X.
Normality: The residuals of the model are normally distributed.

Implementation

PYTHON

from sklearn.linear_model import LinearRegression

slr = LinearRegression()
slr.fit(X_train, y_train)
y_pred = slr.predict(X_test)

print('Slope:', slr.coef_)
print('Intercept:', slr.intercept_)

5. RANSAC (RANdom SAmple Consensus)

Standard Linear Regression is highly sensitive to outliers (anomalies). RANSAC is a robust regression algorithm that fits a model using a subset of "inliers" and ignores outliers.

The Algorithm

Select a random number of samples to be inliers.
Fit the model to the subset.
Test all other data points against the fitted model.
Add points that fall within a user-given tolerance to the inlier set.
Re-fit the model using all inliers.
Repeat until performance meets a threshold or max iterations are reached.

Implementation

PYTHON

from sklearn.linear_model import RANSACRegressor

ransac = RANSACRegressor(LinearRegression(), 
                         max_trials=100, 
                         min_samples=50, 
                         loss='absolute_error', 
                         residual_threshold=5.0)
ransac.fit(X, y)

6. Polynomial Regression

When data shows a curvilinear relationship, a straight line ( $y = mx+c$ ) will underfit. Polynomial regression models the relationship as an $n^{th}$ degree polynomial.

Concept: It is still considered "linear regression" because the coefficients ( $w$ ) are linear, even though the features ( $x$ ) are transformed.
Transformation: If we have one feature $x$ , we create new features $x^2, x^3, ... x^d$ .
Scikit-learn approach: Use PolynomialFeatures transformer followed by LinearRegression.

Implementation

PYTHON

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Transform features to 2nd degree
quadratic = PolynomialFeatures(degree=2)
X_quad = quadratic.fit_transform(X)

# Fit linear regression on transformed features
pr = LinearRegression()
pr.fit(X_quad, y)

7. Regularized Regression

Regularization adds a penalty term to the cost function to prevent overfitting (high variance) by shrinking the coefficient values towards zero.

A. Ridge Regression (L2 Regularization)

Adds the squared magnitude of coefficients as a penalty term to the loss function.
Cost: $RSS + \lambda \sum w_j^2$
Effect: Shrinks weights but rarely enforces them to exactly zero. Good for handling multicollinearity.

B. Lasso Regression (L1 Regularization)

Adds the absolute value of magnitude of coefficients as penalty.
Cost: $RSS + \lambda \sum |w_j|$
Effect: Can shrink weights to exactly zero. Useful for feature selection (sparse models).

C. ElasticNet

A compromise between Ridge and Lasso.
Useful when there are multiple features correlated with one another (Lasso might pick one at random, ElasticNet is likely to pick both).

Implementation

PYTHON

from sklearn.linear_model import Ridge, Lasso, ElasticNet

# Ridge (alpha is the regularization strength, lambda)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

# Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

# ElasticNet (l1_ratio=0.5 means 50% Lasso, 50% Ridge)
enet = ElasticNet(alpha=1.0, l1_ratio=0.5)
enet.fit(X_train, y_train)

8. Support Vector Regression (SVR)

SVR applies the principles of Support Vector Machines (SVM) to regression problems.

Concept: Instead of a dividing line (classification), SVR tries to fit a "tube" (margin of tolerance $\epsilon$ ) around the data.
Goal: The algorithm ignores errors situated within the $\epsilon$ -tube. It only penalizes points falling outside the tube.
Kernels: SVR uses the Kernel trick to map data into higher dimensions to solve non-linear problems (e.g., RBF Kernel).
Importance of Scaling: SVR is distance-based; feature scaling (StandardScaler) is mandatory.

Implementation

PYTHON

from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler

sc_x = StandardScaler()
sc_y = StandardScaler()
X_std = sc_x.fit_transform(X)
y_std = sc_y.fit_transform(y.reshape(-1,1))

# Radial Basis Function Kernel
svr = SVR(kernel='rbf', C=1.0, epsilon=0.1)
svr.fit(X_std, y_std.flatten())

9. Decision Tree Regression

Decision trees split the data into subsets based on feature values to make predictions.

Structure:
- Root/Internal Nodes: Represent a decision rule on a feature (e.g., $x > 5$ ).
- Leaf Nodes: Represent the output value (usually the average of the target values of samples in that leaf).
Splitting Criterion: In classification, we use Entropy/Gini. In regression, we use MSE (Variance Reduction). The split is chosen to minimize the variance of the target variable in the resulting child nodes.
Pros: Can model non-linear relationships; requires no feature scaling.
Cons: Highly prone to overfitting (high variance) if the tree grows too deep.

Implementation

PYTHON

from sklearn.tree import DecisionTreeRegressor

tree = DecisionTreeRegressor(max_depth=3)
tree.fit(X_train, y_train)

10. Random Forest Regression

Random Forest is an ensemble method that combines multiple Decision Trees to improve generalization and reduce overfitting.

Method (Bagging):
1. Create $n$ bootstrap samples (random samples with replacement) from the training data.
2. Train a decision tree on each sample.
3. Randomness: At each node split, only a random subset of features is considered.
Prediction: The final output is the average of the predictions of all individual trees.
Advantages:
- More robust to noise and outliers than single trees.
- Reduces overfitting significantly.
- Provides "Feature Importance" scores.

Implementation

PYTHON

from sklearn.ensemble import RandomForestRegressor

# n_estimators = number of trees
forest = RandomForestRegressor(n_estimators=1000, 
                               criterion='squared_error', 
                               random_state=1, 
                               n_jobs=-1)
forest.fit(X_train, y_train)

y_pred = forest.predict(X_test)

Unit 3

Unit 5

Unit 4 - Notes

Table of Contents

Unit 4: Regression with scikit-Learn

1. Introduction to Regression

Types of Relationships

2. Exploratory Data Analysis (EDA)

Visualizing Relationships

Implementation with Python

3. Evaluation Metrics

Key Metrics

Implementation

4. Linear Regression (Ordinary Least Squares)

Assumptions

Implementation

5. RANSAC (RANdom SAmple Consensus)

The Algorithm

Implementation

6. Polynomial Regression

Implementation

7. Regularized Regression

A. Ridge Regression (L2 Regularization)

B. Lasso Regression (L1 Regularization)

C. ElasticNet

Implementation

8. Support Vector Regression (SVR)

Implementation

9. Decision Tree Regression

Implementation

10. Random Forest Regression

Implementation