Unit 2 - Notes

INT234

Unit 2: SUPERVISED LEARNING: REGRESSION

1. Introduction to Regression Analysis

Regression analysis is a fundamental supervised learning technique used for predictive modeling. Its primary goal is to investigate the relationship between a dependent variable (target) and one or more independent variables (predictors/features). In regression, the target variable is always continuous (numerical).


2. Simple Linear Regression (SLR)

Definition

Simple Linear Regression is the most basic form of regression analysis. It models the relationship between a single independent variable () and a dependent variable () by fitting a straight line to the observed data.

The Equation

The mathematical representation of the population model is:

Where:

  • : The dependent variable (what we want to predict).
  • : The independent variable (input).
  • : The Y-intercept (the value of when ).
  • : The Slope (the change in for a one-unit change in ).
  • : The Error term (residuals), representing the variability in not explained by the linear relationship.

The Prediction Function

When the model is trained, we produce a prediction equation:

  • : The predicted value.
  • : The estimated coefficients derived from the data.

3. Ordinary Least Squares (OLS) Estimation

Concept

OLS is the optimization method used to estimate the unknown parameters ( and ) in a linear regression model. The goal of OLS is to find the line of "best fit."

Mechanism

The "best fit" line is defined as the line that minimizes the sum of the squared vertical differences (residuals) between the observed values and the predicted values.

The Cost Function (Sum of Squared Errors - SSE):


OLS uses calculus (partial derivatives) to find the values of and that make SSE as small as possible.

Key Assumptions of OLS

For OLS to provide valid statistical inferences, the following assumptions must hold:

  1. Linearity: The relationship between and is linear.
  2. Independence: Observations are independent of each other.
  3. Homoscedasticity: The variance of residual terms is constant at every level of .
  4. Normality: The error terms are normally distributed (important for hypothesis testing/confidence intervals, less critical for pure prediction).

4. Correlations

Pearson Correlation Coefficient ()

Correlation measures the strength and direction of the linear relationship between two continuous variables.

  • Range:
  • : Perfect positive linear relationship.
  • : Perfect negative linear relationship.
  • : No linear relationship.

Relationship to Regression

  • Correlation quantifies association; Regression quantifies prediction.
  • In Simple Linear Regression, the coefficient of determination () is the square of the Pearson correlation coefficient ().
  • Warning: Correlation does not imply causation. A high correlation between and does not mean causes .

5. Multiple Linear Regression (MLR)

Definition

MLR extends simple linear regression to include two or more independent variables. It accounts for the fact that a dependent variable is often influenced by multiple factors simultaneously.

The Equation

Where:

  • : Multiple distinct independent variables.
  • : Partial regression coefficients. represents the change in for a one-unit change in , holding all other variables constant.

Multicollinearity

A specific challenge in MLR is multicollinearity, which occurs when independent variables are highly correlated with each other.

  • Effect: It makes it difficult to determine the individual effect of each independent variable.
  • Detection: Variance Inflation Factor (VIF).

6. Polynomial Regression

Definition

Polynomial regression is a form of regression analysis used when the relationship between the independent and dependent variables is non-linear (curved).

The Concept

Although it models a non-linear relationship, it is considered a linear model in terms of estimation coefficients. We transform the input features by raising them to a power.

The Equation (Degree 2)

Bias-Variance Tradeoff

  • Underfitting: Using a linear model (Degree 1) on curved data.
  • Overfitting: Using a very high degree polynomial (e.g., Degree 15) which passes through every data point but fails to generalize to new data.

7. Logistic Regression

Definition

Despite its name, Logistic Regression is a classification algorithm, not a regression algorithm in the traditional sense. It is used to predict a discrete outcome (e.g., Yes/No, 0/1, True/False).

Why "Regression"?

It is called regression because it estimates the probability of an event occurring using a regression-like formula, which is then mapped to a class.

The Sigmoid Function (Logistic Function)

Linear regression produces values from to . To map this to a probability (0 to 1), we wrap the linear equation in a Sigmoid function:

  • S-Curve: The output forms an "S" shape.
  • Decision Boundary: A threshold (usually 0.5) is applied to the probability to classify the result.
    • If Class 1
    • If Class 0

8. Evaluate Model Performance

Evaluating how well a regression model predicts the target variable is crucial. We compare the Predicted values () against the Actual values ().

A. Mean Absolute Error (MAE)

The average of the absolute differences between predictions and actual values.

  • Interpretation: Represents the average magnitude of errors.
  • Pros: Robust to outliers compared to MSE.
  • Cons: Not differentiable at 0 (harder for some optimization algorithms).

B. Mean Squared Error (MSE)

The average of the squared differences between predictions and actual values.

  • Interpretation: Measures the variance of the residuals.
  • Pros: Heavily penalizes large errors (squaring makes big errors much bigger). Differentiable (good for gradient descent).
  • Cons: Not in the original unit of the target variable (e.g., if predicting dollars, MSE is "dollars squared").

C. Root Mean Squared Error (RMSE)

The square root of the MSE.

  • Interpretation: Represents the standard deviation of the residuals.
  • Pros: It is in the same unit as the target variable, making it highly interpretable. Like MSE, it penalizes large errors.

D. R-squared () Score

Also known as the Coefficient of Determination. It represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s).

  • Where is the mean of the observed data.
  • Range: Usually $0$ to $1$.
    • : Model explains 100% of the variance (perfect fit).
    • : Model explains none of the variance (equivalent to guessing the mean).
  • Limitation: always increases as you add more variables, even if they are irrelevant. (Adjusted is used to counter this in advanced MLR).

Summary Cheat Sheet

Metric Formula Concept Use Case
SLR One input, one output linear relationship.
MLR Multiple inputs, one output.
Logistic Sigmoid Function Binary Classification (0 or 1).
MAE Average Absolute Error When outliers shouldn't be penalized excessively.
RMSE Standard metric; penalizes large errors; interpretable units.
Goodness of fit (0 to 1).