Unit4 - Subjective Questions
CSE274 • Practice Questions with Detailed Answers
Differentiate between Regression and Classification in the context of supervised machine learning. Provide an example for each.
1. Nature of the Output Variable:
- Regression: The output variable is continuous (numerical). The model predicts a quantity or a value.
- Classification: The output variable is categorical (discrete). The model predicts a class label or probability of membership.
2. Goal:
- Regression: To find the relationship between dependent and independent variables to estimate a value.
- Classification: To find a decision boundary that separates the data into different classes.
3. Evaluation Metrics:
- Regression: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Score.
- Classification: Accuracy, Precision, Recall, F1-Score, ROC-AUC.
Examples:
- Regression: Predicting house prices based on square footage and location (Output: $450,000, $500,000).
- Classification: Predicting whether an email is spam or not spam based on keywords (Output: Spam, Not Spam).
Explain the Bias-Variance Tradeoff in regression models. How does model complexity influence this tradeoff?
The Bias-Variance Tradeoff describes the tension between two sources of error that prevent supervised learning algorithms from generalizing beyond their training set.
1. Bias (Error from Erroneous Assumptions):
- High bias causes the algorithm to miss relevant relations between features and target outputs (underfitting).
- Example: Applying a linear model to highly non-linear data.
2. Variance (Error from Sensitivity to Small Fluctuations):
- High variance causes the algorithm to model the random noise in the training data (overfitting).
- Example: A high-degree polynomial fitting every single data point.
3. Tradeoff and Complexity:
- Low Complexity (Simple models): High Bias, Low Variance.
- High Complexity (Complex models): Low Bias, High Variance.
- Total Error:
The goal is to find the sweet spot in model complexity where the total error is minimized.
Define Simple Linear Regression. State the hypothesis function and the cost function used to estimate the parameters.
Definition:
Simple Linear Regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables: one independent variable () and one dependent variable (). It assumes a linear relationship.
Hypothesis Function:
The relationship is modeled as:
Where:
- is the y-intercept.
- is the slope (coefficient).
Cost Function (Ordinary Least Squares):
To find the best parameters, we minimize the Mean Squared Error (MSE):
Where is the number of training examples.
Derive the formulas for the optimal coefficients (slope) and (intercept) for Simple Linear Regression using the method of Ordinary Least Squares (OLS).
We aim to minimize the Sum of Squared Errors (SSE):
Step 1: Partial Derivative with respect to :
Dividing by , we get:
Step 2: Partial Derivative with respect to :
Substitute :
Rearranging for :
Using algebraic properties , the numerator and denominator can be rewritten to the standard covariance/variance form:
Explain Multiple Linear Regression (MLR) using Matrix Notation. What is the Normal Equation to solve for the coefficients?
Multiple Linear Regression extends simple regression to multiple input features ().
Equation:
Matrix Notation:
We can represent the entire dataset and parameters as:
Where:
- is an vector of target values.
- is an matrix of input features (with a column of 1s for the intercept).
- is an vector of coefficients.
- is the error vector.
Normal Equation:
To minimize the squared error , the analytical solution for is:
This provides the optimal coefficients in one step, assuming is invertible.
What are the core assumptions of Linear Regression models? Explain any three.
For Linear Regression to provide valid estimates and statistical inferences, the following assumptions must hold:
- Linearity: The relationship between the independent variables () and the dependent variable () is linear. If the true relationship is non-linear, the model will perform poorly.
- Homoscedasticity: The variance of the residual errors (noise) is constant across all levels of the independent variables. If the variance changes (heteroscedasticity), the model predictions may be inefficient.
- Independence of Errors: The observations are independent of each other. There is no correlation between consecutive error terms (crucial in time-series data).
- Normality of Errors: For hypothesis testing (confidence intervals, p-values), it is assumed that the residuals follow a normal distribution.
- No Multicollinearity: The independent variables should not be highly correlated with each other.
How do you interpret the coefficients in a Multiple Linear Regression model?
In Multiple Linear Regression, the model is given by .
Interpretation:
- Intercept (): This is the expected value of when all independent variables () are equal to zero.
- Slope Coefficients (): The coefficient represents the average change in the dependent variable for a one-unit increase in the independent variable , holding all other independent variables constant.
Example:
If :
- means that for every additional year of experience, increases by 20 units, assuming Age remains the same.
What is Regularization in regression? Why is it needed?
Definition:
Regularization is a technique used to prevent overfitting by adding a penalty term to the cost function (loss function). It discourages the learning of a model that is too complex or flexible.
Why it is needed:
- Overfitting: When a model has too many features or the coefficients become very large, it fits the noise in the training data rather than the underlying pattern.
- Multicollinearity: When features are highly correlated, OLS estimates become unstable with high variance. Regularization stabilizes these estimates.
- Feature Selection: Some regularization techniques (like Lasso) help in automatic feature selection by shrinking irrelevant feature coefficients to zero.
Describe Ridge Regression (L2 Regularization). State its cost function.
Description:
Ridge Regression is a regularized version of linear regression that adds a penalty equivalent to the square of the magnitude of coefficients to the loss function. It shrinks the coefficients towards zero but rarely makes them exactly zero.
Objective:
It balances minimizing the prediction error and keeping the model weights small to reduce model variance.
Cost Function:
Where:
- is the Residual Sum of Squares (RSS).
- (Lambda) is the regularization hyperparameter.
- is the L2 penalty term (sum of squared weights).
Describe Lasso Regression (L1 Regularization). How does it assist in feature selection?
Description:
Lasso (Least Absolute Shrinkage and Selection Operator) Regression adds a penalty equivalent to the absolute value of the magnitude of coefficients.
Cost Function:
Feature Selection:
Unlike Ridge regression which asymptotically shrinks coefficients to zero, Lasso has a unique property where it can shrink coefficients exactly to zero.
- Geometrically, the constraint region for Lasso is a diamond (polytope) which has corners. The optimal solution often hits a corner where one or more parameters are zero.
- This effectively removes features from the model, making Lasso a sparse model and useful for feature selection.
Compare and contrast Ridge and Lasso regression. When would you use one over the other?
Comparison:
-
Penalty Term:
- Ridge: L2 norm (Squared magnitude: ).
- Lasso: L1 norm (Absolute magnitude: ).
-
Effect on Coefficients:
- Ridge: Shrinks coefficients towards zero uniformly. Does not eliminate features.
- Lasso: Can shrink coefficients to exactly zero, creating a sparse model.
-
Differentiability:
- Ridge: Differentiable everywhere.
- Lasso: Not differentiable at zero (requires specific optimization algorithms like coordinate descent).
Usage Scenarios:
- Use Lasso when you suspect only a few features are actually important (Sparse data) and you need feature selection.
- Use Ridge when most features are useful and contribute slightly to the output, or when multicollinearity is present.
Discuss the effect of the regularization parameter (Lambda) on model complexity and the Bias-Variance tradeoff.
The parameter controls the strength of the penalty applied to the coefficients.
-
Small ():
- The penalty term becomes negligible.
- The model behaves like standard OLS Linear Regression.
- Effect: High Complexity, Low Bias, High Variance (Risk of Overfitting).
-
Large ():
- The penalty term dominates the cost function.
- Coefficients are forced to be very small (Ridge) or zero (Lasso).
- The model becomes a flat line (constant prediction).
- Effect: Low Complexity, High Bias, Low Variance (Risk of Underfitting).
-
Optimal :
- Selected via cross-validation to balance bias and variance, minimizing the total error.
What is Polynomial Regression? How does it utilize linear regression techniques to fit non-linear data?
Definition:
Polynomial Regression is a form of regression analysis in which the relationship between the independent variable and the dependent variable is modeled as an -th degree polynomial.
Mechanism (Polynomial Feature Expansion):
Although the relationship represents a curve, Polynomial Regression is considered a linear model because it is linear in the parameters (coefficients).
- Original Feature:
- Transformed Features: We create new features .
- Equation:
By treating as distinct features (e.g., let ), we can use standard Multiple Linear Regression algorithms (OLS) to solve for the betas. This allows capturing non-linear patterns using linear algebra machinery.
Explain the concept of Tree-Based Regression. How is the predicted value for a leaf node calculated?
Concept:
Tree-based regression (Decision Tree Regressor) uses a decision tree structure to predict continuous values. It recursively partitions the feature space into smaller, rectangular regions. The goal is to split data such that the data points within each region are as similar as possible.
Structure:
- Root/Internal Nodes: Represent conditions on features (e.g., Is ?).
- Branches: Represent the outcome of the test (Yes/No).
- Leaf Nodes: Represent the final partitioned region.
Prediction Calculation:
Unlike classification trees (which predict the mode/majority class), regression trees predict a specific value for a leaf node.
- The predicted value for a specific leaf node is typically the mean (average) of the target values of all training samples falling into that leaf.
What splitting criteria are used in Regression Trees to determine the best split?
In classification, we use Gini Impurity or Entropy. In Regression Trees, we need a metric to measure the "spread" or error of continuous values within a node. The most common criteria are:
-
MSE (Mean Squared Error) / Variance Reduction:
- The algorithm attempts to minimize the weighted average of the Mean Squared Error in the child nodes.
- Split Criteria: Maximize the reduction in variance.
- Where .
-
MAE (Mean Absolute Error):
- Uses the mean absolute deviation from the median.
- More robust to outliers compared to MSE.
The split that results in the highest reduction of error (most homogenous child nodes) is chosen.
Discuss Time-series Regression. How does it differ from standard regression regarding the assumption of independence?
Time-series Regression:
This involves predicting future values based on previously observed values of the same variable and potentially other external variables. The data is indexed by time.
Key Difference - Independence Assumption:
- Standard Regression: Assumes observations are independent and identically distributed (i.i.d). The order of data points does not matter.
- Time-series Regression: Violates the independence assumption. Data points are autocorrelated (a value at time is highly dependent on values at , etc.).
Implication:
Standard OLS techniques may produce biased standard errors and unreliable significance tests if autocorrelation is not addressed. Time-series models (like AR, ARIMA) or regression with lag features explicitly model this temporal dependence.
Explain the concept of Autoregressive (AR) models in Time-series regression.
Definition:
An Autoregressive (AR) model predicts the variable of interest using a linear combination of its own past values. It operates on the premise that past values have a correlation with current values.
Model Formulation:
An AR model of order , denoted as , is defined as:
Where:
- is the value at time .
- are the lagged values (past observations).
- are the coefficients (parameters to be learned).
- is a constant.
- is white noise (random error).
This is essentially a linear regression where the "features" are the previous time steps of the target variable.
Define and explain the following Evaluation Metrics for Regression: R-squared () and RMSE.
1. R-squared () - Coefficient of Determination:
- Definition: It represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s).
- Formula:
- Interpretation: A value of 1.0 indicates a perfect fit. A value of 0 indicates the model does no better than simply predicting the mean of the data.
2. RMSE (Root Mean Squared Error):
- Definition: The square root of the average of squared differences between prediction and actual observation.
- Formula:
- Interpretation: It is in the same units as the target variable . It penalizes large errors more heavily than MAE due to the squaring term. Lower is better.
Explain the Gradient Descent algorithm as applied to Linear Regression optimization.
Concept:
Gradient Descent is an iterative optimization algorithm used to minimize the cost function (e.g., MSE) by updating coefficients in the opposite direction of the gradient.
Algorithm Steps:
- Initialize: Start with random values for coefficients (e.g., all zeros).
- Calculate Gradient: Compute the partial derivative of the cost function with respect to each coefficient .
- Update Rule: Adjust the coefficients:
Where is the Learning Rate (step size). - Repeat: Continue until convergence (when the cost function stops decreasing significantly).
This method allows finding optimal coefficients even when the dataset is too large for the analytical Normal Equation solution.
What is Elastic Net Regression? Why is it considered a hybrid of Ridge and Lasso?
Definition:
Elastic Net is a regularized regression method that linearly combines the penalties of the Lasso and Ridge methods.
Why it is a Hybrid:
It includes both the L1 (Lasso) and L2 (Ridge) regularization terms in its cost function.
Cost Function:
Benefits:
- Handling Correlated Features: Lasso tends to pick one variable from a group of correlated features and ignore the others. Elastic Net (via the Ridge part) encourages a grouping effect, where strongly correlated predictors tend to be in or out of the model together.
- Stability: It overcomes the limitations of Lasso when the number of predictors () is greater than the number of observations ().