Unit 4 - Notes

INT395

Unit 4: Regression with scikit-Learn

1. Introduction to Regression

Regression analysis is a subfield of supervised machine learning that aims to model the relationship between a certain number of features (independent variables) and a continuous target variable (dependent variable).

  • Goal: Predict a continuous numerical value (e.g., house prices, temperature, stock prices) based on input features.
  • Mathematical Representation: Given a training set , the goal is to learn a mapping function such that is as close to the true as possible.

Types of Relationships

  • Simple Regression: One independent variable and one dependent variable.
  • Multiple Regression: Multiple independent variables and one dependent variable.
  • Multivariate Regression: Multiple dependent variables (distinct from multiple regression).

2. Exploratory Data Analysis (EDA)

Before building a model, it is crucial to understand the data structure, outliers, and relationships.

Visualizing Relationships

  • Scatter Plot Matrix (Pairplot): Visualizes the pairwise correlation between features and the target. Useful for detecting linearity.
  • Correlation Matrix (Heatmap): Quantifies linear relationships.
    • Pearson Correlation coefficient (): Ranges from -1 to 1.
    • : Perfect positive correlation.
    • : Perfect negative correlation.
    • : No linear correlation.

Implementation with Python

PYTHON
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Load dataset
df = pd.read_csv('data.csv')

# Scatter plot matrix
sns.pairplot(df, size=2.5)
plt.show()

# Correlation Matrix Heatmap
cm = df.corr().values
sns.heatmap(cm, annot=True)
plt.show()


3. Evaluation Metrics

To measure the performance of a regression model, we calculate the difference between predicted values () and actual values ().

Key Metrics

  1. Mean Absolute Error (MAE): The average of the absolute differences between predictions and actual values. It is robust to outliers compared to MSE.
  2. Mean Squared Error (MSE): The average of the squared differences. It heavily penalizes large errors. Useful for gradient descent derivation.
  3. Root Mean Squared Error (RMSE): The square root of MSE. It is interpretable as it is in the same units as the target variable.
  4. Coefficient of Determination ( Score): Represents the proportion of variance in the dependent variable that is predictable from the independent variables.
    • Range: to $1$.
    • : Perfect fit.
    • : Model predicts the mean of the target.
      (where SSE is Sum of Squared Errors and SST is Total Sum of Squares).

Implementation

PYTHON
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

print('MSE:', mean_squared_error(y_test, y_pred))
print('R2:', r2_score(y_test, y_pred))


4. Linear Regression (Ordinary Least Squares)

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data.

  • Equation:
    • : Bias (Intercept)
    • : Weights (Coefficients)
  • Objective: Minimize the Residual Sum of Squares (RSS) (the cost function).

Assumptions

  1. Linearity: The relationship between X and y is linear.
  2. Independence: Observations are independent of each other.
  3. Homoscedasticity: The variance of residuals is constant across all levels of X.
  4. Normality: The residuals of the model are normally distributed.

Implementation

PYTHON
from sklearn.linear_model import LinearRegression

slr = LinearRegression()
slr.fit(X_train, y_train)
y_pred = slr.predict(X_test)

print('Slope:', slr.coef_)
print('Intercept:', slr.intercept_)


5. RANSAC (RANdom SAmple Consensus)

Standard Linear Regression is highly sensitive to outliers (anomalies). RANSAC is a robust regression algorithm that fits a model using a subset of "inliers" and ignores outliers.

The Algorithm

  1. Select a random number of samples to be inliers.
  2. Fit the model to the subset.
  3. Test all other data points against the fitted model.
  4. Add points that fall within a user-given tolerance to the inlier set.
  5. Re-fit the model using all inliers.
  6. Repeat until performance meets a threshold or max iterations are reached.

Implementation

PYTHON
from sklearn.linear_model import RANSACRegressor

ransac = RANSACRegressor(LinearRegression(), 
                         max_trials=100, 
                         min_samples=50, 
                         loss='absolute_error', 
                         residual_threshold=5.0)
ransac.fit(X, y)


6. Polynomial Regression

When data shows a curvilinear relationship, a straight line () will underfit. Polynomial regression models the relationship as an degree polynomial.

  • Concept: It is still considered "linear regression" because the coefficients () are linear, even though the features () are transformed.
  • Transformation: If we have one feature , we create new features .
  • Scikit-learn approach: Use PolynomialFeatures transformer followed by LinearRegression.

Implementation

PYTHON
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Transform features to 2nd degree
quadratic = PolynomialFeatures(degree=2)
X_quad = quadratic.fit_transform(X)

# Fit linear regression on transformed features
pr = LinearRegression()
pr.fit(X_quad, y)


7. Regularized Regression

Regularization adds a penalty term to the cost function to prevent overfitting (high variance) by shrinking the coefficient values towards zero.

A. Ridge Regression (L2 Regularization)

  • Adds the squared magnitude of coefficients as a penalty term to the loss function.
  • Cost:
  • Effect: Shrinks weights but rarely enforces them to exactly zero. Good for handling multicollinearity.

B. Lasso Regression (L1 Regularization)

  • Adds the absolute value of magnitude of coefficients as penalty.
  • Cost:
  • Effect: Can shrink weights to exactly zero. Useful for feature selection (sparse models).

C. ElasticNet

  • A compromise between Ridge and Lasso.
  • Useful when there are multiple features correlated with one another (Lasso might pick one at random, ElasticNet is likely to pick both).

Implementation

PYTHON
from sklearn.linear_model import Ridge, Lasso, ElasticNet

# Ridge (alpha is the regularization strength, lambda)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

# Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

# ElasticNet (l1_ratio=0.5 means 50% Lasso, 50% Ridge)
enet = ElasticNet(alpha=1.0, l1_ratio=0.5)
enet.fit(X_train, y_train)


8. Support Vector Regression (SVR)

SVR applies the principles of Support Vector Machines (SVM) to regression problems.

  • Concept: Instead of a dividing line (classification), SVR tries to fit a "tube" (margin of tolerance ) around the data.
  • Goal: The algorithm ignores errors situated within the -tube. It only penalizes points falling outside the tube.
  • Kernels: SVR uses the Kernel trick to map data into higher dimensions to solve non-linear problems (e.g., RBF Kernel).
  • Importance of Scaling: SVR is distance-based; feature scaling (StandardScaler) is mandatory.

Implementation

PYTHON
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler

sc_x = StandardScaler()
sc_y = StandardScaler()
X_std = sc_x.fit_transform(X)
y_std = sc_y.fit_transform(y.reshape(-1,1))

# Radial Basis Function Kernel
svr = SVR(kernel='rbf', C=1.0, epsilon=0.1)
svr.fit(X_std, y_std.flatten())


9. Decision Tree Regression

Decision trees split the data into subsets based on feature values to make predictions.

  • Structure:
    • Root/Internal Nodes: Represent a decision rule on a feature (e.g., ).
    • Leaf Nodes: Represent the output value (usually the average of the target values of samples in that leaf).
  • Splitting Criterion: In classification, we use Entropy/Gini. In regression, we use MSE (Variance Reduction). The split is chosen to minimize the variance of the target variable in the resulting child nodes.
  • Pros: Can model non-linear relationships; requires no feature scaling.
  • Cons: Highly prone to overfitting (high variance) if the tree grows too deep.

Implementation

PYTHON
from sklearn.tree import DecisionTreeRegressor

tree = DecisionTreeRegressor(max_depth=3)
tree.fit(X_train, y_train)


10. Random Forest Regression

Random Forest is an ensemble method that combines multiple Decision Trees to improve generalization and reduce overfitting.

  • Method (Bagging):
    1. Create bootstrap samples (random samples with replacement) from the training data.
    2. Train a decision tree on each sample.
    3. Randomness: At each node split, only a random subset of features is considered.
  • Prediction: The final output is the average of the predictions of all individual trees.
  • Advantages:
    • More robust to noise and outliers than single trees.
    • Reduces overfitting significantly.
    • Provides "Feature Importance" scores.

Implementation

PYTHON
from sklearn.ensemble import RandomForestRegressor

# n_estimators = number of trees
forest = RandomForestRegressor(n_estimators=1000, 
                               criterion='squared_error', 
                               random_state=1, 
                               n_jobs=-1)
forest.fit(X_train, y_train)

y_pred = forest.predict(X_test)