Unit 4 - Notes
INT395
Unit 4: Regression with scikit-Learn
1. Introduction to Regression
Regression analysis is a subfield of supervised machine learning that aims to model the relationship between a certain number of features (independent variables) and a continuous target variable (dependent variable).
- Goal: Predict a continuous numerical value (e.g., house prices, temperature, stock prices) based on input features.
- Mathematical Representation: Given a training set , the goal is to learn a mapping function such that is as close to the true as possible.
Types of Relationships
- Simple Regression: One independent variable and one dependent variable.
- Multiple Regression: Multiple independent variables and one dependent variable.
- Multivariate Regression: Multiple dependent variables (distinct from multiple regression).
2. Exploratory Data Analysis (EDA)
Before building a model, it is crucial to understand the data structure, outliers, and relationships.
Visualizing Relationships
- Scatter Plot Matrix (Pairplot): Visualizes the pairwise correlation between features and the target. Useful for detecting linearity.
- Correlation Matrix (Heatmap): Quantifies linear relationships.
- Pearson Correlation coefficient (): Ranges from -1 to 1.
- : Perfect positive correlation.
- : Perfect negative correlation.
- : No linear correlation.
Implementation with Python
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Load dataset
df = pd.read_csv('data.csv')
# Scatter plot matrix
sns.pairplot(df, size=2.5)
plt.show()
# Correlation Matrix Heatmap
cm = df.corr().values
sns.heatmap(cm, annot=True)
plt.show()
3. Evaluation Metrics
To measure the performance of a regression model, we calculate the difference between predicted values () and actual values ().
Key Metrics
- Mean Absolute Error (MAE): The average of the absolute differences between predictions and actual values. It is robust to outliers compared to MSE.
- Mean Squared Error (MSE): The average of the squared differences. It heavily penalizes large errors. Useful for gradient descent derivation.
- Root Mean Squared Error (RMSE): The square root of MSE. It is interpretable as it is in the same units as the target variable.
- Coefficient of Determination ( Score): Represents the proportion of variance in the dependent variable that is predictable from the independent variables.
- Range: to $1$.
- : Perfect fit.
- : Model predicts the mean of the target.
(where SSE is Sum of Squared Errors and SST is Total Sum of Squares).
Implementation
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
print('MSE:', mean_squared_error(y_test, y_pred))
print('R2:', r2_score(y_test, y_pred))
4. Linear Regression (Ordinary Least Squares)
Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data.
- Equation:
- : Bias (Intercept)
- : Weights (Coefficients)
- Objective: Minimize the Residual Sum of Squares (RSS) (the cost function).
Assumptions
- Linearity: The relationship between X and y is linear.
- Independence: Observations are independent of each other.
- Homoscedasticity: The variance of residuals is constant across all levels of X.
- Normality: The residuals of the model are normally distributed.
Implementation
from sklearn.linear_model import LinearRegression
slr = LinearRegression()
slr.fit(X_train, y_train)
y_pred = slr.predict(X_test)
print('Slope:', slr.coef_)
print('Intercept:', slr.intercept_)
5. RANSAC (RANdom SAmple Consensus)
Standard Linear Regression is highly sensitive to outliers (anomalies). RANSAC is a robust regression algorithm that fits a model using a subset of "inliers" and ignores outliers.
The Algorithm
- Select a random number of samples to be inliers.
- Fit the model to the subset.
- Test all other data points against the fitted model.
- Add points that fall within a user-given tolerance to the inlier set.
- Re-fit the model using all inliers.
- Repeat until performance meets a threshold or max iterations are reached.
Implementation
from sklearn.linear_model import RANSACRegressor
ransac = RANSACRegressor(LinearRegression(),
max_trials=100,
min_samples=50,
loss='absolute_error',
residual_threshold=5.0)
ransac.fit(X, y)
6. Polynomial Regression
When data shows a curvilinear relationship, a straight line () will underfit. Polynomial regression models the relationship as an degree polynomial.
- Concept: It is still considered "linear regression" because the coefficients () are linear, even though the features () are transformed.
- Transformation: If we have one feature , we create new features .
- Scikit-learn approach: Use
PolynomialFeaturestransformer followed byLinearRegression.
Implementation
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
# Transform features to 2nd degree
quadratic = PolynomialFeatures(degree=2)
X_quad = quadratic.fit_transform(X)
# Fit linear regression on transformed features
pr = LinearRegression()
pr.fit(X_quad, y)
7. Regularized Regression
Regularization adds a penalty term to the cost function to prevent overfitting (high variance) by shrinking the coefficient values towards zero.
A. Ridge Regression (L2 Regularization)
- Adds the squared magnitude of coefficients as a penalty term to the loss function.
- Cost:
- Effect: Shrinks weights but rarely enforces them to exactly zero. Good for handling multicollinearity.
B. Lasso Regression (L1 Regularization)
- Adds the absolute value of magnitude of coefficients as penalty.
- Cost:
- Effect: Can shrink weights to exactly zero. Useful for feature selection (sparse models).
C. ElasticNet
- A compromise between Ridge and Lasso.
- Useful when there are multiple features correlated with one another (Lasso might pick one at random, ElasticNet is likely to pick both).
Implementation
from sklearn.linear_model import Ridge, Lasso, ElasticNet
# Ridge (alpha is the regularization strength, lambda)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
# Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
# ElasticNet (l1_ratio=0.5 means 50% Lasso, 50% Ridge)
enet = ElasticNet(alpha=1.0, l1_ratio=0.5)
enet.fit(X_train, y_train)
8. Support Vector Regression (SVR)
SVR applies the principles of Support Vector Machines (SVM) to regression problems.
- Concept: Instead of a dividing line (classification), SVR tries to fit a "tube" (margin of tolerance ) around the data.
- Goal: The algorithm ignores errors situated within the -tube. It only penalizes points falling outside the tube.
- Kernels: SVR uses the Kernel trick to map data into higher dimensions to solve non-linear problems (e.g., RBF Kernel).
- Importance of Scaling: SVR is distance-based; feature scaling (StandardScaler) is mandatory.
Implementation
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
sc_y = StandardScaler()
X_std = sc_x.fit_transform(X)
y_std = sc_y.fit_transform(y.reshape(-1,1))
# Radial Basis Function Kernel
svr = SVR(kernel='rbf', C=1.0, epsilon=0.1)
svr.fit(X_std, y_std.flatten())
9. Decision Tree Regression
Decision trees split the data into subsets based on feature values to make predictions.
- Structure:
- Root/Internal Nodes: Represent a decision rule on a feature (e.g., ).
- Leaf Nodes: Represent the output value (usually the average of the target values of samples in that leaf).
- Splitting Criterion: In classification, we use Entropy/Gini. In regression, we use MSE (Variance Reduction). The split is chosen to minimize the variance of the target variable in the resulting child nodes.
- Pros: Can model non-linear relationships; requires no feature scaling.
- Cons: Highly prone to overfitting (high variance) if the tree grows too deep.
Implementation
from sklearn.tree import DecisionTreeRegressor
tree = DecisionTreeRegressor(max_depth=3)
tree.fit(X_train, y_train)
10. Random Forest Regression
Random Forest is an ensemble method that combines multiple Decision Trees to improve generalization and reduce overfitting.
- Method (Bagging):
- Create bootstrap samples (random samples with replacement) from the training data.
- Train a decision tree on each sample.
- Randomness: At each node split, only a random subset of features is considered.
- Prediction: The final output is the average of the predictions of all individual trees.
- Advantages:
- More robust to noise and outliers than single trees.
- Reduces overfitting significantly.
- Provides "Feature Importance" scores.
Implementation
from sklearn.ensemble import RandomForestRegressor
# n_estimators = number of trees
forest = RandomForestRegressor(n_estimators=1000,
criterion='squared_error',
random_state=1,
n_jobs=-1)
forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)