Unit 4 - Notes

CSE274 6 min read

Unit 4: Regression

1. Regression vs. Classification

In supervised machine learning, the distinction between regression and classification lies in the nature of the target variable (output).

Feature	Regression	Classification
Output Type	Continuous (numerical) values.	Categorical (discrete) class labels.
Goal	To predict a specific quantity (e.g., price, temperature, sales).	To predict group membership (e.g., spam/not spam, digit 0-9).
Evaluation Metrics	RMSE (Root Mean Squared Error), MAE (Mean Absolute Error), $R^2$ .	Accuracy, Precision, Recall, F1-Score, ROC-AUC.
Boundary	Fits a line, curve, or hyperplane through the data points.	Finds a decision boundary that separates data points into classes.

Visualizing the Difference

A split comparison diagram with two distinct panels. The left panel is titled 'Regression' and shows... — AI-generated image — may contain inaccuracies

2. Bias-Variance Considerations in Regression

The bias-variance tradeoff is fundamental to understanding model generalization errors.

Bias (Underfitting): Error introduced by approximating a real-world problem (which may be complex) by a much simpler model.
- High Bias: The model pays very little attention to the training data and oversimplifies the model. Example: Fitting a straight line to quadratic data.
Variance (Overfitting): Error introduced by the model's sensitivity to small fluctuations in the training set.
- High Variance: The model pays too much attention to training data, capturing random noise rather than the underlying pattern. Example: High-degree polynomial passing through every data point.

Total Error Equation:
$E[(y - \hat{f}(x))^2] = \text{Bias}[\hat{f}(x)]^2 + \text{Var}[\hat{f}(x)] + \sigma^2$
(Where $\sigma^2$ is the irreducible error)

A line graph visualizing the Bias-Variance Tradeoff. The X-axis is labeled 'Model Complexity' (Low t... — AI-generated image — may contain inaccuracies

3. Simple Linear Regression (SLR)

SLR models the relationship between a single independent variable ( $x$ ) and a dependent variable ( $y$ ) using a straight line.

The Equation:
$y = \beta_0 + \beta_1x + \epsilon$

$\beta_0$ : Intercept (value of $y$ when $x=0$ ).
$\beta_1$ : Slope (rate of change).
$\epsilon$ : Error term (residuals).

Cost Function (Ordinary Least Squares - OLS):
The goal is to minimize the Residual Sum of Squares (RSS):
$RSS = \sum_{i=1}^{n} (y_i - (\beta_0 + \beta_1x_i))^2$

Assumptions:

Linearity: The relationship between $X$ and $Y$ is linear.
Independence: Observations are independent of each other.
Homoscedasticity: The variance of residual is the same for any value of X.
Normality: The error terms are normally distributed.

4. Multiple Linear Regression (MLR)

MLR extends SLR to include multiple independent variables ( $x_1, x_2, ..., x_p$ ).

The Equation:
$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p + \epsilon$

Matrix Form:
$Y = X\beta + \epsilon$

The model fits a hyperplane in a $p$ -dimensional space rather than a line.
Multicollinearity: A specific challenge in MLR where predictor variables are highly correlated with each other, making coefficient estimates unstable.

5. Interpretation of Coefficients

Interpreting $\beta$ values is crucial for inference.

Intercept ( $\beta_0$ ): The expected value of $y$ when all predictors $x_j$ are zero. This may not always have a physical meaning depending on the data range.
Slope Coefficients ( $\beta_j$ ):
- In SLR: A one-unit increase in $x$ is associated with a $\beta_1$ increase in $y$ .
- In MLR: A one-unit increase in $x_j$ is associated with a $\beta_j$ increase in $y$ , holding all other predictors constant.

Statistical Significance:

p-value: Tests the null hypothesis that $\beta_j = 0$ (no relationship). If $p < 0.05$ , the predictor is statistically significant.

6. Polynomial Feature Expansion

When data shows a non-linear pattern (curved), linear regression can still be used by transforming the features. This is a form of Basis Expansion.

Concept:
Instead of fitting $y = \beta_0 + \beta_1x$ , we fit:
$y = \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 + ...$

Although the relationship between $y$ and $x$ is non-linear, the model is still linear in parameters ( $\beta$ ), so OLS can still be used.

Implementation Example:
If input $X = [a, b]$ , a degree-2 polynomial expansion creates features: $[1, a, b, a^2, b^2, ab]$ .

7. Regularized Regression Models

Regularization introduces a penalty term to the loss function to prevent overfitting by constraining the size of the coefficients.

A. Ridge Regression (L2 Regularization)

Adds a penalty equal to the square of the magnitude of coefficients.
$Cost = RSS + \lambda \sum_{j=1}^{p} \beta_j^2$

Effect: Shrinks coefficients toward zero but rarely exactly to zero.
Use case: Good when many variables are correlated (handles multicollinearity).

B. Lasso Regression (L1 Regularization)

Adds a penalty equal to the absolute value of the magnitude of coefficients.
$Cost = RSS + \lambda \sum_{j=1}^{p} |\beta_j|$

Effect: Can shrink coefficients exactly to zero.
Use case: Feature selection (produces sparse models).

C. Elastic Net

Combines L1 and L2 penalties.
$Cost = RSS + \lambda_1 \sum |\beta_j| + \lambda_2 \sum \beta_j^2$

A geometric conceptual diagram comparing L1 (Lasso) and L2 (Ridge) regularization constraints. The d... — AI-generated image — may contain inaccuracies

8. Effect of Regularization on Model Complexity

The hyperparameter $\lambda$ (lambda) controls the regularization strength.

$\lambda = 0$ : Identical to OLS (No regularization).
Small $\lambda$ : Low Bias, High Variance. Model is flexible and can fit complex patterns.
Large $\lambda$ : High Bias, Low Variance. Coefficients are heavily penalized (shrunk). The model becomes very simple (flatter line).
- In Lasso, as $\lambda \to \infty$ , all coefficients become zero.
- In Ridge, as $\lambda \to \infty$ , all coefficients approach zero.

Selection: The optimal $\lambda$ is usually selected via Cross-Validation.

9. Tree-Based Regression Models

Instead of a global linear equation, tree-based models partition the feature space into rectangular regions.

Decision Trees for Regression:

Splitting: The algorithm recursively splits data based on feature values to minimize variance (usually MSE) within the resulting nodes.
Leaf Value: The prediction for a new data point is the average (mean) value of the training samples falling into that leaf node.

Ensemble Methods:

Random Forest: Builds many independent trees on bootstrapped data and averages their predictions (Bagging). Reduces variance.
Gradient Boosting (e.g., XGBoost, LightGBM): Builds trees sequentially. Each new tree corrects the errors (residuals) made by the previous trees. Reduces bias and variance.

10. Time-Series Regression Models

Time-series regression involves predicting future values based on past history. The "independent variables" are often derived from the target variable itself.

Key Characteristics:

Autocorrelation: Data points are not independent (today's price depends on yesterday's).
Stationarity: Statistical properties (mean, variance) should ideally remain constant over time.

Feature Engineering for Time Series:

Lag Features: Using $y_{t-1}, y_{t-2}$ as input to predict $y_t$ .
Rolling Windows: Calculating moving averages or standard deviations over a window (e.g., past 7 days).
Date-Time Features: Extracting components like Day of Week, Month, Quarter to capture seasonality.

Data Structure Transformation:
To use standard regression algorithms (like Linear Regression or Random Forest) on time series, the data must be transformed from a sequence to a supervised matrix.

A flowchart diagram showing the transformation of Time Series data into a Supervised Learning format... — AI-generated image — may contain inaccuracies

. Step 2 shows an arrow pointing to a table labeled "Supervised Matrix". The table has columns 'X (t-1)', 'X (t-2)', and 'Y (t)'. Row 1 shows inputs [20, 10] predicting [30]. Row 2 shows inputs [30, 20] predicting [40]. Visual cues should clearly show how the values shift diagonally to create features. Use a clean, tabular layout with arrows indicating the data movement.]

Unit 3

Unit 5