1What is the primary goal of a regression algorithm?
Difference between regression and classification
Easy
A.To reduce the dimensionality of the data.
B.To predict a continuous numerical value.
C.To cluster data into groups.
D.To classify data into discrete categories.
Correct Answer: To predict a continuous numerical value.
Explanation:
Regression models are used for prediction tasks where the output variable is a continuous quantity, such as predicting a price, temperature, or height.
Incorrect! Try again.
2Which of the following problems is an example of regression?
Difference between regression and classification
Easy
A.Predicting the price of a house based on its features.
B.Categorizing news articles into topics like 'sports' or 'politics'.
C.Recognizing a handwritten digit as 0 through 9.
D.Identifying if an email is spam or not spam.
Correct Answer: Predicting the price of a house based on its features.
Explanation:
Predicting a house price involves forecasting a continuous numerical value, which is a regression task. The other options are classification tasks, as they involve assigning data to discrete, predefined categories.
Incorrect! Try again.
3A regression model with high bias makes strong assumptions about the data and is likely to...?
Bias-variance considerations in regression
Easy
A.Overfit the data.
B.Underfit the data.
C.Have high variance.
D.Perfectly fit the data.
Correct Answer: Underfit the data.
Explanation:
High bias means the model is too simple and cannot capture the underlying patterns in the data, leading to underfitting. This results in high error on both training and test sets.
Incorrect! Try again.
4A model that performs extremely well on the training data but poorly on unseen test data is said to have...?
Bias-variance considerations in regression
Easy
A.Low complexity.
B.High bias.
C.High variance.
D.A good bias-variance tradeoff.
Correct Answer: High variance.
Explanation:
High variance is a characteristic of a model that is too complex and has learned the training data too well, including its random noise. This phenomenon is known as overfitting.
Incorrect! Try again.
5In Simple Linear Regression, how many independent (predictor) variables are used to predict the dependent (target) variable?
Simple Linear Regression
Easy
A.One or more.
B.Exactly two.
C.Exactly one.
D.Zero.
Correct Answer: Exactly one.
Explanation:
The term 'simple' in Simple Linear Regression refers to the use of a single independent variable to model a linear relationship with a single dependent variable.
Incorrect! Try again.
6In the Simple Linear Regression equation , what does the term represent?
Simple Linear Regression
Easy
A.The y-intercept of the regression line.
B.The predicted value of y.
C.The error term.
D.The slope of the regression line.
Correct Answer: The slope of the regression line.
Explanation:
is the slope, which represents the change in the dependent variable for a one-unit change in the independent variable .
Incorrect! Try again.
7What is the main difference between Multiple Linear Regression and Simple Linear Regression?
Multiple Linear Regression
Easy
A.Multiple Linear Regression models non-linear relationships, while Simple Linear Regression models linear ones.
B.Simple Linear Regression is always more accurate.
C.Multiple Linear Regression is used for classification, while Simple Linear Regression is for regression.
D.Multiple Linear Regression uses two or more independent variables, while Simple Linear Regression uses only one.
Correct Answer: Multiple Linear Regression uses two or more independent variables, while Simple Linear Regression uses only one.
Explanation:
Multiple Linear Regression extends Simple Linear Regression by allowing the use of multiple predictor variables () to predict a single outcome variable ().
Incorrect! Try again.
8Which equation represents a Multiple Linear Regression model with two predictor variables, and ?
Multiple Linear Regression
Easy
A.
B.
C.
D.
Correct Answer:
Explanation:
This is the standard form for a multiple linear regression model, where is the intercept and and are the coefficients for the independent variables and , respectively.
Incorrect! Try again.
9In a multiple regression model for predicting salary, the coefficient for the 'Years of Experience' feature is +5000. What is the correct interpretation of this coefficient?
Interpretation of coefficients
Easy
A.The maximum possible salary is $5000.
B.A person with 0 years of experience will have a salary of $5000.
C.The average salary for all individuals is $5000.
D.Holding all other features constant, for each additional year of experience, the predicted salary increases by $5000.
Correct Answer: Holding all other features constant, for each additional year of experience, the predicted salary increases by $5000.
Explanation:
Each coefficient in multiple regression represents the change in the dependent variable for a one-unit change in the corresponding independent variable, assuming all other variables are held constant.
Incorrect! Try again.
10If a regression coefficient for a variable is zero, what does this imply about the model's prediction?
Interpretation of coefficients
Easy
A.There is no linear relationship between that variable and the target variable.
B.The data for that variable contains errors.
C.The variable is the most important predictor.
D.The model is underfitting.
Correct Answer: There is no linear relationship between that variable and the target variable.
Explanation:
A coefficient of zero means that a change in the predictor variable has no effect on the predicted outcome in the model, indicating the absence of a linear relationship as captured by the model.
Incorrect! Try again.
11Which of the following are two common types of regularized linear regression models?
Regularized Regression models
Easy
A.Linear and Logistic Regression.
B.Decision Tree and Random Forest Regression.
C.Ridge and Lasso Regression.
D.K-Means and DBSCAN Regression.
Correct Answer: Ridge and Lasso Regression.
Explanation:
Ridge (L2 regularization) and Lasso (L1 regularization) are the two most well-known techniques for adding a penalty term to the loss function of a linear regression model to prevent overfitting.
Incorrect! Try again.
12What is the primary motivation for using regularized regression instead of standard linear regression?
Regularized Regression models
Easy
A.To handle categorical variables automatically.
B.To prevent overfitting by penalizing large coefficients.
C.To speed up the model training process significantly.
D.To ensure the model always finds a perfect fit.
Correct Answer: To prevent overfitting by penalizing large coefficients.
Explanation:
Regularization adds a penalty for model complexity (i.e., large coefficient values) to the cost function, which helps to reduce model variance and prevent overfitting the training data.
Incorrect! Try again.
13In Ridge or Lasso regression, what happens to the model's coefficients as the regularization hyperparameter ( or ) is increased?
Effect of regularization on model complexity
Easy
A.The magnitudes of the coefficients are pushed towards zero.
B.The coefficients remain unchanged.
C.The magnitudes of the coefficients grow larger.
D.The y-intercept is pushed to zero.
Correct Answer: The magnitudes of the coefficients are pushed towards zero.
Explanation:
The regularization term penalizes the magnitude of the coefficients. A larger hyperparameter ( or ) means a stronger penalty, which forces the optimization process to find smaller coefficient values, thus reducing model complexity.
Incorrect! Try again.
14Which regularization technique has the ability to shrink some coefficients to exactly zero, thus performing automatic feature selection?
Effect of regularization on model complexity
Easy
A.Polynomial Regression.
B.Principal Component Regression.
C.Lasso Regression (L1).
D.Ridge Regression (L2).
Correct Answer: Lasso Regression (L1).
Explanation:
The L1 penalty used in Lasso Regression (sum of the absolute values of coefficients) has the property of shrinking less important feature coefficients all the way to zero, effectively removing them from the model.
Incorrect! Try again.
15Why would a data scientist use polynomial feature expansion with a linear regression model?
Polynomial feature expansion
Easy
A.To convert categorical features into numerical ones.
B.To model non-linear relationships between features and the target.
C.To ensure all features are on the same scale.
D.To reduce the number of features in the dataset.
Correct Answer: To model non-linear relationships between features and the target.
Explanation:
By creating polynomial terms (like , , ), a linear model can fit more complex, non-linear patterns in the data without changing the underlying linear regression algorithm itself.
Incorrect! Try again.
16If you start with a single feature and apply a polynomial feature expansion of degree 2, what new feature will be added to your model (in addition to the original feature and an intercept)?
Polynomial feature expansion
Easy
A.
B.
C.
D.
Correct Answer:
Explanation:
A polynomial expansion of degree 2 for a feature creates a new feature corresponding to raised to the power of 2, which is . The full set of features available to the model would be [1, , ].
Incorrect! Try again.
17Which of the following is a classic example of a tree-based model that can be used for regression?
Tree-Based regression models
Easy
A.Logistic Regression.
B.K-Means Clustering.
C.Decision Tree Regressor.
D.Support Vector Machine (for classification).
Correct Answer: Decision Tree Regressor.
Explanation:
A Decision Tree can be adapted for regression tasks (becoming a Decision Tree Regressor) by predicting a continuous value in its leaf nodes, typically the average of the target values of the training samples in that leaf.
Incorrect! Try again.
18In a basic regression tree, what value is typically predicted for a new data point that falls into a specific leaf node?
Tree-Based regression models
Easy
A.The most frequent target value in that leaf.
B.A class label like 'A' or 'B'.
C.The coefficient of the most important feature.
D.The average of the target values of all training samples in that leaf.
Correct Answer: The average of the target values of all training samples in that leaf.
Explanation:
Unlike a classification tree which predicts a class, a regression tree's leaf node predicts a single continuous value, which is usually the mean of the outcomes for the training data that ended up in that leaf.
Incorrect! Try again.
19What is the defining characteristic of time-series data?
Time-series Regression models
Easy
A.The data has no target variable.
B.The data is always normally distributed.
C.The data contains many categorical features.
D.The data points are ordered chronologically.
Correct Answer: The data points are ordered chronologically.
Explanation:
Time-series data is a sequence of data points indexed in time order. This temporal dependence is a crucial aspect that must be considered during modeling.
Incorrect! Try again.
20In the context of a time-series regression for predicting today's sales (), what would a 'lag-1' feature typically be?
Time-series Regression models
Easy
A.The average of all past sales.
B.The sales from the same day last year.
C.Tomorrow's sales ()
D.Yesterday's sales ()
Correct Answer: Yesterday's sales ()
Explanation:
A lag variable (or lag feature) is created by using a variable's value from a previous time step as a predictor for the current time step. A 'lag-1' feature is the value from the immediately preceding time step, e.g., using yesterday's sales to predict today's sales.
Incorrect! Try again.
21A data scientist is building a model to predict the exact amount of rainfall (in millimeters) for the next day. A colleague suggests they use Logistic Regression. Why is this suggestion inappropriate for the problem as stated?
Difference between regression and classification
Medium
A.Because rainfall data often contains outliers, which Logistic Regression cannot handle.
B.Because Logistic Regression is a classification algorithm that predicts a probability or a discrete class, not a continuous value.
C.Because Logistic Regression assumes a linear relationship, which is unlikely for rainfall prediction.
D.Because Logistic Regression is computationally more expensive than linear regression models.
Correct Answer: Because Logistic Regression is a classification algorithm that predicts a probability or a discrete class, not a continuous value.
Explanation:
The core task is to predict a continuous numerical value (rainfall in mm), which is a regression problem. Logistic Regression is designed for classification tasks, where the goal is to predict a discrete outcome (e.g., 'Rain' or 'No Rain'). Its output is a probability that is mapped to a class, not a continuous quantity.
Incorrect! Try again.
22A regression model has been trained and evaluated. It shows a very low training error (RMSE of 5.5) but a very high validation error (RMSE of 50.2). Which of the following best describes the model's condition and a suitable remedy?
Bias-variance considerations in regression
Medium
A.High bias (underfitting); simplify the model by removing features.
B.High variance (overfitting); apply L2 regularization or reduce model complexity.
C.High bias (underfitting); increase model complexity by adding polynomial features.
D.High variance (overfitting); gather more training data with different features.
Correct Answer: High variance (overfitting); apply L2 regularization or reduce model complexity.
Explanation:
The large gap between a low training error and a high validation error is a classic sign of high variance, also known as overfitting. The model has learned the training data's noise instead of the underlying pattern. To remedy this, one can introduce regularization (like L2/Ridge) to penalize large coefficients or simplify the model (e.g., use fewer features or a less complex algorithm).
Incorrect! Try again.
23You have built a simple linear regression model to predict house prices based on square footage. The model's R-squared () value is 0.65. How should this value be interpreted?
Simple Linear Regression
Medium
A.For every 1 square foot increase, the price increases by 65%.
B.The correlation between house price and square footage is 0.65.
C.65% of the variability in house prices can be explained by the square footage.
D.The model's predictions are correct 65% of the time.
Correct Answer: 65% of the variability in house prices can be explained by the square footage.
Explanation:
R-squared, or the coefficient of determination, measures the proportion of the variance in the dependent variable (house price) that is predictable from the independent variable(s) (square footage). An of 0.65 means that 65% of the variance in house prices is accounted for by the model.
Incorrect! Try again.
24In a multiple linear regression model, you notice that the p-values for two features, 'years_of_experience' and 'age', are very high, suggesting they are not statistically significant. However, when you remove either one, the other's p-value becomes very low (significant). What is the most likely cause of this phenomenon?
Multiple Linear Regression
Medium
A.Heteroscedasticity in the model's residuals.
B.Multicollinearity between 'years_of_experience' and 'age'.
C.The model is suffering from high bias (underfitting).
D.Non-linear relationships between the features and the target.
Correct Answer: Multicollinearity between 'years_of_experience' and 'age'.
Explanation:
Multicollinearity occurs when independent variables in a regression model are highly correlated. 'Age' and 'years_of_experience' are likely to be strongly correlated. When both are in the model, they explain the same variance in the target, making it difficult for the model to estimate their individual effects, resulting in high p-values and unstable coefficients. Removing one allows the other to capture that shared effect, making it appear significant.
Incorrect! Try again.
25A multiple linear regression model is built to predict the 'price' of a used car. The fitted model is: . Where 'age' is in years, 'mileage' is in miles, and 'is_luxury' is a binary variable (1 if luxury, 0 otherwise). What is the correct interpretation of the coefficient for 'age'?
Interpretation of coefficients
Medium
A.A car that is one year older is worth $200 less than a brand new car.
B.Holding mileage and luxury status constant, for each additional year of age, the car's price is predicted to decrease by $200.
C.For each additional year of age, the car's price decreases by $200.
D.The coefficient is negative, which indicates an error in the model.
Correct Answer: Holding mileage and luxury status constant, for each additional year of age, the car's price is predicted to decrease by $200.
Explanation:
In multiple regression, the coefficient of a variable represents the change in the predicted outcome for a one-unit change in that variable, assuming all other variables in the model are held constant. This 'ceteris paribus' condition is crucial for correct interpretation.
Incorrect! Try again.
26You are working on a regression problem with 100 features, and you suspect that many of them are redundant or irrelevant. You want to build a model that automatically performs feature selection. Which regularization technique would be most suitable for this specific goal?
Regularized Regression models
Medium
A.Ridge Regression (L2)
B.Lasso Regression (L1)
C.Elastic Net Regression with a high L2 ratio
D.Principal Component Regression (PCR)
Correct Answer: Lasso Regression (L1)
Explanation:
Lasso (Least Absolute Shrinkage and Selection Operator) Regression uses an L1 penalty term (), which has the property of shrinking some feature coefficients to exactly zero. This effectively removes the feature from the model, thus performing automatic feature selection. Ridge regression (L2) shrinks coefficients towards zero but never sets them exactly to zero.
Incorrect! Try again.
27Consider a Ridge regression model. What is the effect of increasing the regularization parameter alpha (or lambda, ) towards infinity ()?
Effect of regularization on model complexity
Medium
A.The model coefficients will all be forced towards zero, resulting in a model that only predicts the mean of the target variable.
B.The model will become perfectly fit to the training data, resulting in zero training error.
C.The model will be identical to an unregularized Ordinary Least Squares model.
D.The model's coefficients will grow infinitely large, causing numerical instability.
Correct Answer: The model coefficients will all be forced towards zero, resulting in a model that only predicts the mean of the target variable.
Explanation:
The Ridge regression cost function is . As becomes very large, the penalty for having non-zero coefficients dominates the cost function. To minimize the cost, the model must shrink the coefficients () to be as close to zero as possible. In the limit, all coefficients become zero, and the model's prediction is simply the intercept, which is the mean of the target variable in a standardized setting.
Incorrect! Try again.
28A scatter plot of your single feature 'X' against your target 'y' shows a clear U-shaped (parabolic) relationship. You fit a simple linear regression model () and find it has a very high error. What is the most appropriate next step to improve the model?
Polynomial feature expansion
Medium
A.Create a new feature and fit the model .
B.Apply L1 regularization to the existing linear model.
C.Gather more data points for the existing feature 'X'.
D.Transform the target variable 'y' using a logarithm.
Correct Answer: Create a new feature and fit the model .
Explanation:
The U-shaped relationship indicates a non-linear pattern that a simple linear model cannot capture (high bias). By performing a polynomial feature expansion and adding a quadratic term (), you allow the model to fit a curve (a parabola) to the data. This increases the model's complexity to better match the underlying relationship.
Incorrect! Try again.
29How does a Random Forest Regressor typically improve upon a single, fully-grown Decision Tree Regressor?
Tree-Based regression models
Medium
A.It reduces variance by averaging the predictions of many decorrelated trees.
B.It reduces bias by growing deeper trees than a single decision tree.
C.It is guaranteed to find the globally optimal set of splits.
D.It is much more interpretable than a single decision tree.
Correct Answer: It reduces variance by averaging the predictions of many decorrelated trees.
Explanation:
A single, deep decision tree is prone to overfitting, meaning it has high variance. A Random Forest builds multiple decision trees on different bootstrap samples of the data and considers only a random subset of features for each split. By averaging the predictions of these diverse (decorrelated) trees, it significantly reduces the overall variance of the model, leading to better generalization performance.
Incorrect! Try again.
30You are modeling monthly sales data and notice a strong seasonal pattern that repeats every 12 months, as well as an upward trend over time. Which of the following models is explicitly designed to handle both trend and seasonality?
SARIMA models are an extension of ARIMA models specifically designed for time series data with a seasonal component. It includes parameters (P, D, Q, m) to model the seasonality in addition to the standard ARIMA parameters (p, d, q) for trend and autocorrelation. While a linear regression with time and dummy variables for months can capture some of this, SARIMA is built from the ground up to handle these complex time-series dynamics.
Incorrect! Try again.
31A data scientist is trying to decide on the degree for a polynomial regression model. They find that a degree-2 polynomial has a validation RMSE of 15. A degree-10 polynomial has a training RMSE of 2 but a validation RMSE of 40. What does the performance of the degree-10 model indicate?
Bias-variance considerations in regression
Medium
A.The model has very high bias and very high variance.
B.The model has very high bias and very low variance.
C.The model has very low bias but very high variance.
D.The model has achieved the optimal bias-variance tradeoff.
Correct Answer: The model has very low bias but very high variance.
Explanation:
The extremely low training RMSE (2) for the degree-10 polynomial suggests it fits the training data almost perfectly, indicating low bias. However, its performance on the validation set is very poor (RMSE of 40), which is much worse than the simpler degree-2 model. This large gap between training and validation performance is a hallmark of high variance (overfitting).
Incorrect! Try again.
32When building a multiple linear regression model, you add a new feature that is completely uncorrelated with the target variable. What is the likely effect on the model's and Adjusted ?
Multiple Linear Regression
Medium
A. will slightly increase or stay the same, while Adjusted will likely decrease.
B.Both and Adjusted will decrease.
C.Both and Adjusted will increase significantly.
D. will decrease, while Adjusted will increase.
Correct Answer: will slightly increase or stay the same, while Adjusted will likely decrease.
Explanation:
The standard metric can only increase or stay the same when a new feature is added, regardless of its utility, because the model can simply assign it a coefficient of zero if it's useless. Adjusted , however, penalizes the addition of features that do not improve the model more than would be expected by chance. Therefore, adding a useless feature will cause the Adjusted to decrease.
Incorrect! Try again.
33After fitting a simple linear regression model, you create a residual plot (residuals vs. fitted values). You observe that the points form a distinct funnel shape, widening as the fitted values increase. Which assumption of linear regression is most clearly violated?
Simple Linear Regression
Medium
A.Independence of errors
B.Linearity
C.Normality of residuals
D.Homoscedasticity (constant variance of errors)
Correct Answer: Homoscedasticity (constant variance of errors)
Explanation:
The assumption of homoscedasticity states that the variance of the residuals should be constant across all levels of the independent variable(s). A funnel shape in the residual plot indicates that the variance of the errors is not constant; it increases (or decreases) as the predicted value changes. This violation is called heteroscedasticity.
Incorrect! Try again.
34In a regression model predicting employee salary, one of the predictors is 'Department', a categorical feature with three levels: 'Sales', 'HR', and 'Engineering'. 'Sales' is used as the reference category. The fitted model has a coefficient of +15000 for the 'Department_Engineering' dummy variable. What is the correct interpretation?
Interpretation of coefficients
Medium
A.The predicted salary for an employee in Engineering is, on average, $15,000 higher than for a similar employee in HR.
B.The average salary in the Engineering department is $15,000.
C.The predicted salary for an employee in Engineering is, on average, $15,000 higher than for a similar employee in Sales.
D.Moving an employee from Sales to Engineering is predicted to increase their salary by $15,000.
Correct Answer: The predicted salary for an employee in Engineering is, on average, $15,000 higher than for a similar employee in Sales.
Explanation:
When using dummy variables for categorical features, the coefficient of a specific level (e.g., 'Engineering') represents the average difference in the target variable between that level and the reference (or baseline) category ('Sales'), holding all other model variables constant.
Incorrect! Try again.
35A Lasso regression model is trained on a dataset. When the regularization strength is set to a very small, non-zero value, most coefficients are large. As is gradually increased, what is the expected behavior of the model's coefficients?
Effect of regularization on model complexity
Medium
A.Only the smallest coefficients will be set to zero, while the largest ones remain unchanged.
B.The magnitudes of all coefficients will shrink, and some will become exactly zero.
C.The coefficients will be randomly set to zero based on the value of .
D.All coefficients will shrink towards zero proportionally but will never reach it.
Correct Answer: The magnitudes of all coefficients will shrink, and some will become exactly zero.
Explanation:
This describes the regularization path of a Lasso model. As the penalty term increases, it puts more pressure on the model to have smaller coefficients. The L1 penalty used by Lasso has the unique effect of forcing some coefficients to become precisely zero once the penalty is strong enough, effectively removing them from the model.
Incorrect! Try again.
36You have a dataset with highly correlated features. You decide to use a regularized regression model to prevent overfitting and improve stability. Why might Ridge Regression be a better choice than Lasso Regression in this specific scenario?
Regularized Regression models
Medium
A.Lasso is unable to handle multicollinearity and will fail to converge.
B.Ridge tends to shrink the coefficients of correlated features towards each other, keeping all of them, while Lasso might arbitrarily pick one and eliminate the others.
C.Ridge is computationally faster than Lasso when there are many features.
D.Ridge can perform automatic feature selection, which is useful for correlated features.
Correct Answer: Ridge tends to shrink the coefficients of correlated features towards each other, keeping all of them, while Lasso might arbitrarily pick one and eliminate the others.
Explanation:
When features are highly correlated, Lasso's behavior can be unstable; it often arbitrarily selects one feature from a group of correlated features and sets the coefficients of the others to zero. Ridge, on the other hand, tends to distribute the coefficient weight among the correlated features, shrinking them together. This can lead to a more stable and sometimes more interpretable model when you believe the correlated features are all relevant.
Incorrect! Try again.
37What is the primary risk associated with using a very high-degree polynomial (e.g., degree 20) in a polynomial regression model?
Polynomial feature expansion
Medium
A.The model will be unable to capture complex non-linear relationships.
B.The model is very likely to overfit the training data, leading to poor generalization on new data.
C.The model's coefficients will be difficult to interpret due to multicollinearity between polynomial terms.
D.The computational cost will be too high for most modern computers to handle.
Correct Answer: The model is very likely to overfit the training data, leading to poor generalization on new data.
Explanation:
A high-degree polynomial creates a very flexible model with high complexity. While it might achieve a very low error on the training set by weaving through the data points, it is likely capturing noise rather than the true underlying function. This results in high variance (overfitting) and poor performance on unseen data.
Incorrect! Try again.
38When a Decision Tree Regressor makes a prediction for a new, unseen data point, how is the prediction value determined?
Tree-Based regression models
Medium
A.The prediction is the target value of the single closest training sample in the feature space.
B.The prediction is determined by a linear regression model fitted on the training samples within the final leaf node.
C.The tree calculates the weighted average of the target values of all training samples, with weights determined by the path taken.
D.The new data point is passed down the tree, and the prediction is the average of the target values of all training samples in the leaf node it reaches.
Correct Answer: The new data point is passed down the tree, and the prediction is the average of the target values of all training samples in the leaf node it reaches.
Explanation:
A decision tree works by recursively partitioning the feature space. For a new data point, it follows the splitting rules from the root node down to a terminal (leaf) node. The prediction for any point that falls into that leaf node is simply the mean of the target variable ('y' values) of all the training data points that ended up in that same leaf during training.
Incorrect! Try again.
39An analyst is using an Autoregressive model of order p, AR(p), to forecast a time series. What is the fundamental principle of an AR(p) model?
Time-series Regression models
Medium
A.It predicts the future value as a function of 'p' external predictor variables.
B.It predicts the future value by differencing the series 'p' times to make it stationary.
C.It predicts the future value of the series based on the past 'p' forecast errors (shocks).
D.It predicts the future value of the series as a linear combination of its own 'p' most recent past values.
Correct Answer: It predicts the future value of the series as a linear combination of its own 'p' most recent past values.
Explanation:
An Autoregressive (AR) model is based on the idea that the current value of a time series is dependent on its own previous values. The 'p' in AR(p) specifies that the model uses the 'p' most recent time steps (lags) as predictors in a linear equation to forecast the next value.
Incorrect! Try again.
40Which pair of evaluation metrics is most appropriate for a regression task versus a binary classification task, respectively?
Difference between regression and classification
Medium
A.Mean Squared Error (MSE) and Log Loss
B.Root Mean Squared Error (RMSE) and Area Under the ROC Curve (AUC)
C.Accuracy and Mean Absolute Error (MAE)
D.R-squared and Precision
Correct Answer: Root Mean Squared Error (RMSE) and Area Under the ROC Curve (AUC)
Explanation:
Regression models predict continuous values, so their performance is measured by error metrics like RMSE, MAE, or MSE, which quantify the magnitude of the prediction errors. Classification models predict discrete classes, and their performance is often evaluated using metrics like Accuracy, Precision, Recall, F1-score, or AUC, which measure how well the model distinguishes between classes.
Incorrect! Try again.
41In a multiple linear regression scenario with two highly correlated predictor variables, and (correlation ), both having a true positive relationship with the target , how would the estimated coefficients and likely behave in a Ridge regression versus a Lasso regression as the regularization strength is increased?
Regularized Regression models
Hard
A.Both Ridge and Lasso will shrink both coefficients towards zero at exactly the same rate, maintaining their initial ratio.
B.Both Ridge and Lasso will drive one coefficient to zero and keep the other, as this is the optimal way to handle multicollinearity.
C.Ridge will shrink both coefficients towards each other and then towards zero. Lasso is likely to arbitrarily drive one coefficient to zero while keeping the other.
D.Lasso will shrink both coefficients towards each other and then towards zero. Ridge is likely to arbitrarily drive one coefficient to zero while keeping the other.
Correct Answer: Ridge will shrink both coefficients towards each other and then towards zero. Lasso is likely to arbitrarily drive one coefficient to zero while keeping the other.
Explanation:
Ridge's penalty term () is minimized when the coefficients for correlated variables are close to each other (it 'prefers' to split the predictive power). Thus, it shrinks them together. Lasso's penalty () is indifferent to how the power is split and performs feature selection. Due to the geometry of the constraint, it will often find a solution where one coefficient is pushed to exactly zero, effectively selecting one variable from the correlated group.
Incorrect! Try again.
42You are fitting a polynomial regression model to a dataset where the true underlying relationship is a simple linear function but the data has a high level of irreducible error (high variance noise). You fit two models: a degree-1 polynomial (linear) and a degree-10 polynomial. Which statement most accurately describes the bias and variance of the degree-10 model compared to the degree-1 model?
Bias-variance considerations in regression
Hard
A.The degree-10 model will have high bias and high variance because its complexity prevents it from capturing the simple true trend.
B.The degree-10 model will have low variance but high bias, as it over-simplifies the noisy data by fitting a complex curve.
C.The degree-10 model will have low bias and low variance because its flexibility allows it to perfectly model both the trend and the noise.
D.The degree-10 model will have low bias on the training set but very high variance, leading to poor generalization on a test set.
Correct Answer: The degree-10 model will have low bias on the training set but very high variance, leading to poor generalization on a test set.
Explanation:
A high-degree polynomial (degree-10) is a highly flexible model. On the training data, it can fit both the underlying linear trend and the random noise, resulting in low training error and thus low bias with respect to the training set. However, this fitting of noise means the model's parameters will vary wildly with different samples of training data, leading to very high variance. This high variance results in poor performance on unseen test data. The simple degree-1 model would have low variance but slightly higher bias (as it cannot fit the noise perfectly).
Incorrect! Try again.
43A Gradient Boosting Regressor (GBR) and a Random Forest Regressor (RFR) are trained on the same dataset. The GBR is trained with a small learning rate and a large number of estimators, while the RFR is trained with deep, unpruned trees. If both models show signs of overfitting, how does the nature of this overfitting typically differ between the two models?
Tree-Based regression models
Hard
A.The GBR's overfitting is primarily due to reducing bias to an extremely low level at the cost of increased variance, while the RFR's overfitting is due to averaging many low-bias, high-variance models where the variance is not reduced enough.
B.The GBR overfits by sequentially fitting to residuals, which leads to high bias. The RFR overfits by creating trees that are too simple, leading to high variance.
C.The GBR's overfitting is due to high variance in its individual weak learners. The RFR's overfitting is due to a systematic bias introduced by the bagging process.
D.Both models overfit primarily by reducing variance at the cost of bias.
Correct Answer: The GBR's overfitting is primarily due to reducing bias to an extremely low level at the cost of increased variance, while the RFR's overfitting is due to averaging many low-bias, high-variance models where the variance is not reduced enough.
Explanation:
GBR is a sequential, boosting ensemble method that aims to reduce bias by fitting new models to the residuals of previous models. Overfitting in GBR occurs when it starts fitting the noise in the residuals, leading to extremely low bias on the training data but high variance. RFR is a parallel, bagging ensemble method that averages many deep, high-variance, low-bias trees. The averaging reduces variance. Overfitting in RFR occurs when the individual trees are too complex and the averaging process is insufficient to cancel out the variance, often because the trees are too correlated.
Incorrect! Try again.
44A multiple regression model is used to predict house prices: . The fitted model yields a statistically significant coefficient . How should this coefficient be interpreted?
Interpretation of coefficients
Hard
A.The model is misspecified because an interaction between size and age cannot have a negative effect on price.
B.For every additional year of age, the expected marginal effect of an additional square foot of size on price decreases by $2.5.
C.For every additional square foot of size, the house price is expected to decrease by $2.5, holding age constant.
D.For every additional year of age, the house price is expected to decrease by $2.5, holding size constant.
Correct Answer: For every additional year of age, the expected marginal effect of an additional square foot of size on price decreases by $2.5.
Explanation:
The coefficient represents the interaction effect. The marginal effect of Size on Price is the partial derivative of Price with respect to Size, which is . This shows the effect of size is not constant but depends on age. The interpretation of is how this marginal effect changes for a one-unit change in Age. Specifically, for each one-unit increase in Age, the slope of Price with respect to Size changes by .
Incorrect! Try again.
45You are building a linear regression model to forecast sales () using advertising spend () as a predictor. After fitting the model , you perform a Durbin-Watson test on the residuals and get a test statistic of 0.35. What is the most critical implication of this result for your model's statistical inference?
Time-series Regression models
Hard
A.The residuals exhibit strong positive autocorrelation, which violates the independence of errors assumption and leads to underestimated standard errors of the coefficients.
B.The residuals are not normally distributed, which invalidates the t-tests and F-tests for coefficient significance.
C.The model suffers from severe multicollinearity, making the coefficient estimates unreliable.
D.The relationship between sales and advertising spend is non-linear, meaning the model has high bias.
Correct Answer: The residuals exhibit strong positive autocorrelation, which violates the independence of errors assumption and leads to underestimated standard errors of the coefficients.
Explanation:
The Durbin-Watson statistic tests for first-order autocorrelation in residuals. A value near 2 indicates no autocorrelation. Values approaching 0 indicate strong positive autocorrelation. A value of 0.35 is very close to 0, indicating a severe violation of the OLS assumption of independent errors. A major consequence of positive autocorrelation is that the standard errors of the regression coefficients will be underestimated, leading to overly optimistic (inflated) t-statistics and p-values. This makes coefficients appear more statistically significant than they actually are.
Incorrect! Try again.
46In the context of regularized linear regression (like Ridge or Lasso), how does the concept of "effective degrees of freedom" change as the regularization parameter is increased from 0 to infinity?
Effect of regularization on model complexity
Hard
A.It decreases monotonically from (the number of predictors) towards 0.
B.It remains constant at regardless of the value of .
C.It first increases as the model finds important features and then decreases.
D.It increases monotonically from 0 towards .
Correct Answer: It decreases monotonically from (the number of predictors) towards 0.
Explanation:
The effective degrees of freedom of a model measure its complexity. For an unregularized linear model with predictors, the effective degrees of freedom is . As the regularization penalty increases, the coefficients are constrained and shrunk towards zero. This reduces the model's flexibility to fit the data, thus lowering its complexity. As , all coefficients are forced to zero (for a model without an intercept), and the model becomes a null model with 0 effective degrees of freedom. Therefore, the effective degrees of freedom decrease from towards 0 as increases.
Incorrect! Try again.
47You are building a multiple linear regression model. You include predictors for a person's weight in kilograms () and their weight in pounds (). Assuming no measurement error (), what is the precise mathematical consequence for the Ordinary Least Squares (OLS) estimation process?
Multiple Linear Regression
Hard
A.The model will produce coefficient estimates, but their standard errors will be infinitely large, making them useless.
B.The OLS algorithm in most software will automatically detect the collinearity and drop one of the two variables.
C.The R-squared of the model will be artificially inflated to 1.0, regardless of the target variable.
D.The design matrix becomes singular and its inverse does not exist, so no unique solution for the coefficient vector can be found.
Correct Answer: The design matrix becomes singular and its inverse does not exist, so no unique solution for the coefficient vector can be found.
Explanation:
This is a case of perfect multicollinearity, where one predictor is a perfect linear combination of another. The OLS solution for the coefficients is given by . When perfect multicollinearity exists, the columns of the design matrix are linearly dependent. This causes the matrix to be singular (or non-invertible), meaning its determinant is zero. As a result, its inverse is not defined, and the OLS procedure fails to find a unique set of coefficients. While some software packages implement a practical workaround (like option C), the fundamental mathematical result is the non-invertibility of .
Incorrect! Try again.
48Consider a regression problem where you apply a polynomial feature expansion of degree 5 to a single feature . You then train a Ridge regression model on these 5 derived features (). What is the primary effect of a very large regularization parameter on the resulting regression function ?
Polynomial feature expansion
Hard
A.The function will still be a complex 5th-degree polynomial but with significantly smaller oscillations.
B.The function will approximate a constant function (the mean of the target).
C.The function will approximate a simple linear function of (a line, but not necessarily flat).
D.The function will become exactly zero for all , i.e., .
Correct Answer: The function will approximate a constant function (the mean of the target).
Explanation:
Ridge regression minimizes the objective function . As the regularization parameter becomes extremely large, the penalty for non-zero coefficients dominates the RSS term. To minimize this objective, the model must force all slope coefficients () to approach zero. The intercept term, , is typically not regularized. The model that minimizes the RSS with all slope coefficients being zero is a model that predicts the mean of the target variable for all inputs. Therefore, the function will approximate a constant function, where .
Incorrect! Try again.
49According to the Gauss-Markov theorem, the Ordinary Least Squares (OLS) estimator for the coefficients in a simple linear regression model is the Best Linear Unbiased Estimator (BLUE). What does "Best" in this context specifically refer to?
Simple Linear Regression
Hard
A.It has the minimum sampling variance among all linear unbiased estimators.
B.It is robust to violations of the normality of errors assumption.
C.It provides the highest possible R-squared value for the training data.
D.It is the most computationally efficient estimator to calculate.
Correct Answer: It has the minimum sampling variance among all linear unbiased estimators.
Explanation:
The Gauss-Markov theorem states that under the standard OLS assumptions (linearity, exogeneity, homoscedasticity, and no perfect multicollinearity), the OLS estimator is BLUE. Each word has a specific meaning: "Linear" means it's a linear function of the observed values. "Unbiased" means its expected value is the true population parameter (). "Best" specifically means that it has the lowest sampling variance compared to any other linear unbiased estimator. This implies that OLS estimates are the most precise or reliable in this class of estimators.
Incorrect! Try again.
50You are tasked with modeling the number of customer support tickets received per hour. The target variable is a non-negative integer (0, 1, 2, ...). A colleague suggests using a Poisson regression model. How does this model blur the line between typical regression and classification tasks?
Difference between regression and classification
Hard
A.It predicts a continuous rate parameter () for a count distribution, but the ultimate output variable is discrete, sharing characteristics with both regression (predicting a numeric value) and classification (predicting from a set of integer classes).
B.It is purely a classification task because the output is from a discrete, ordered set of integers.
C.It is purely a regression task because it uses a generalized linear model framework to predict an expected value.
D.It is neither regression nor classification; it belongs to a separate category of 'counting models' that have no overlap with either.
Correct Answer: It predicts a continuous rate parameter () for a count distribution, but the ultimate output variable is discrete, sharing characteristics with both regression (predicting a numeric value) and classification (predicting from a set of integer classes).
Explanation:
Standard regression predicts a continuous value (e.g., price, temperature). Standard classification predicts a discrete, categorical label (e.g., cat, dog, bird). A Poisson regression model uses a linear model to predict the logarithm of a continuous, positive real number, the rate parameter (the expected number of events). This prediction of a continuous parameter is a regression-like task. However, this parameter then defines a probability distribution over a discrete, countably infinite set of outcomes (the integers 0, 1, 2, ...). This makes the target variable discrete, which is a characteristic of classification. Therefore, it sits in a gray area, using regression techniques to model a discrete (count) outcome.
Incorrect! Try again.
51When tuning an XGBoost regressor, you observe that decreasing the eta (learning rate) parameter significantly improves the model's performance on a validation set, but only if you also substantially increase the n_estimators parameter. Why is this combined adjustment necessary for improved performance?
Tree-Based regression models
Hard
A.A smaller eta forces the model to focus only on the most important features, and more trees are needed to eventually consider all features.
B.A smaller eta makes each tree contribute less to the final prediction, requiring more trees (n_estimators) to reach a good cumulative model. This slower, more gradual learning process is less likely to overfit.
C.A smaller eta increases the variance of each individual tree, which must be compensated for by averaging more trees (n_estimators).
D.The eta and n_estimators parameters are inversely proportional by definition in the XGBoost algorithm to maintain a constant model complexity.
Correct Answer: A smaller eta makes each tree contribute less to the final prediction, requiring more trees (n_estimators) to reach a good cumulative model. This slower, more gradual learning process is less likely to overfit.
Explanation:
The eta (learning rate) scales the contribution of each new tree added to the ensemble. A small eta (e.g., 0.01) means each tree makes a very small correction to the overall model. This prevents the model from making drastic changes based on any single tree, which helps avoid overfitting. However, since each step is small, many more steps (n_estimators) are required for the model to 'travel' to a good minimum in the loss function. This combination allows for a more robust and fine-tuned model that generalizes better by taking smaller, more careful steps.
Incorrect! Try again.
52You are working with a dataset that has a very large number of features (), many of which are highly correlated with each other in groups. You want to perform feature selection, but also want to avoid arbitrarily discarding correlated features that might be collectively predictive. Which regression model is explicitly designed to handle this "grouping effect" among correlated features?
Regularized Regression models
Hard
A.Ridge Regression
B.Lasso Regression
C.Elastic Net Regression
D.Principal Component Regression (PCR)
Correct Answer: Elastic Net Regression
Explanation:
Elastic Net combines the penalty of Lasso and the penalty of Ridge. The part enables sparse solutions (feature selection). The part encourages correlated features to be selected together, giving them similar coefficient values. This is known as the "grouping effect." Lasso, by itself, tends to arbitrarily select one feature from a correlated group and zero out the others. Ridge will keep all correlated features but won't perform feature selection. PCR handles correlation but by transforming features into uncorrelated components, losing original feature interpretability. Elastic Net is specifically designed for sparse selection within groups of correlated features.
Incorrect! Try again.
53A financial services company is building a regression model to predict the exact dollar amount of a potential loan default. The cost of under-predicting the default amount is extremely high, while the cost of over-predicting is relatively low. Given this asymmetric cost function, what kind of model characteristic should be prioritized during development, even if it harms standard metrics like Mean Squared Error (MSE)?
Bias-variance considerations in regression
Hard
A.A model with higher bias and lower variance, as simpler, more stable models are always preferable in finance.
B.A model that focuses exclusively on minimizing the irreducible error through better data collection.
C.A model that minimizes the Median Absolute Error instead of the Mean Squared Error, as it's more robust.
D.A model with lower bias and potentially higher variance, as its flexibility is needed to capture extreme high-default events.
Correct Answer: A model with lower bias and potentially higher variance, as its flexibility is needed to capture extreme high-default events.
Explanation:
Standard metrics like MSE penalize over- and under-predictions equally. The business problem describes an asymmetric cost. High-cost events (large defaults) are often in the tails of the distribution. A high-bias, low-variance model (e.g., a simple linear model) might be stable but could systematically under-predict these extreme values. A more complex, low-bias, high-variance model (e.g., a complex GBT or a well-tuned neural network) is more flexible and has a better chance of capturing these rare but critical high-default events. The priority is to avoid catastrophic under-prediction, which means accepting a more complex model that can produce large predictions, even at the risk of higher variance.
Incorrect! Try again.
54In a regression model, both the independent variable (e.g., advertising spend) and the dependent variable (e.g., sales) are log-transformed: . The fitted coefficient is . How is this coefficient correctly interpreted in a practical sense?
Interpretation of coefficients
Hard
A.A 1% increase in is associated with a 0.8-unit increase in .
B.A 1% increase in is associated with an expected 0.8% increase in .
C.A 1-unit increase in is associated with an 0.8-unit increase in .
D.A 1-unit increase in is associated with a 0.8% increase in .
Correct Answer: A 1% increase in is associated with an expected 0.8% increase in .
Explanation:
This is a log-log model, and the coefficient represents the elasticity of with respect to . Mathematically, . This ratio of percentage changes is interpreted as: a 1% change in is associated with a % change in . So, a 1% increase in advertising spend is associated with an expected 0.8% increase in sales. While option D is mathematically literal, it is not the standard, practical interpretation used to convey the business impact.
Incorrect! Try again.
55You are analyzing the coefficient paths of a Lasso regression as the regularization parameter varies. The path plot shows the magnitude of each coefficient as a function of . What critical information for model selection can be derived from the order in which coefficients become non-zero as decreases from a very large value towards zero?
Effect of regularization on model complexity
Hard
A.It determines the sign (positive or negative) of the relationship, which is fixed regardless of once the coefficient is non-zero.
B.It shows the optimal value of directly, which is the point where the first coefficient becomes non-zero.
C.It indicates which features are most correlated with each other, as their paths will have identical slopes.
D.It provides a data-driven ranking of feature importance, as stronger predictors tend to enter the model (become non-zero) at higher levels of regularization.
Correct Answer: It provides a data-driven ranking of feature importance, as stronger predictors tend to enter the model (become non-zero) at higher levels of regularization.
Explanation:
The Lasso path algorithm (like LARS) effectively builds the model sequentially. As you relax the penalty (decrease ), the algorithm allows coefficients to become non-zero one by one. The feature whose coefficient 'peels off' from zero first is the one that has the strongest correlation with the current residuals. This sequence of entry into the model provides a powerful heuristic for ranking feature importance. Features that can withstand stronger penalties (higher ) before being zeroed out are generally considered more important by the model.
Incorrect! Try again.
56In a multiple linear regression analysis, what is the key distinction between a high-leverage point and a high-influence outlier?
Multiple Linear Regression
Hard
A.A high-leverage point is always influential, but a high-influence outlier may not necessarily have high leverage.
B.A high-leverage point has an extreme value for the target variable (), while a high-influence outlier has extreme values for predictor variables ().
C.High-leverage and high-influence are synonymous terms for any outlier that significantly affects the regression line's slope or intercept.
D.A high-leverage point has an extreme value for one or more predictor variables (), while a high-influence outlier is a point that, if removed, would cause a large change in the regression model's coefficients. A point can have high leverage without being influential.
Correct Answer: A high-leverage point has an extreme value for one or more predictor variables (), while a high-influence outlier is a point that, if removed, would cause a large change in the regression model's coefficients. A point can have high leverage without being influential.
Explanation:
Leverage is a measure determined solely by the predictor variables ( values). A point has high leverage if its values are far from the center of the other values. Influence is a measure of how much the model's parameters (e.g., coefficients) change when a point is removed. Influence is a function of both leverage and the size of the point's residual. A point with high leverage can be non-influential if it lies close to the regression line determined by the other points (i.e., it has a small residual). However, a high-leverage point with a large residual will almost always be highly influential.
Incorrect! Try again.
57You are creating polynomial features of degree 3 from a set of 10 original features. The original features have vastly different scales (e.g., age from 20-60, income from 50,000-200,000). Why is it critically important to scale the original features before applying the polynomial expansion?
Polynomial feature expansion
Hard
A.To prevent features with large scales from numerically dominating the polynomial terms, which can cause instability in model fitting algorithms and render regularization ineffective for small-scale features.
B.To ensure that the resulting design matrix is orthogonal, which simplifies the calculation of the OLS coefficients.
C.Because polynomial expansion is only mathematically defined for features scaled between 0 and 1.
D.To reduce the total number of polynomial features generated, as scaling can merge redundant terms.
Correct Answer: To prevent features with large scales from numerically dominating the polynomial terms, which can cause instability in model fitting algorithms and render regularization ineffective for small-scale features.
Explanation:
Consider income () on a scale of and age () on a scale of . The polynomial term will be on the order of , while will be on the order of . When these terms are fed into a linear model, the huge disparity in scale can lead to numerical precision issues. Furthermore, for regularized models like Ridge or Lasso, the single penalty term is applied to all coefficients. The penalty will be dominated by the coefficients of the large-scale features, effectively ignoring the small-scale features. Scaling first (e.g., using StandardScaler) ensures all original and generated polynomial features are on a comparable scale, leading to stable training and fair regularization.
Incorrect! Try again.
58You are building an autoregressive model to forecast a time series. You notice the series has a clear upward trend and a strong seasonal pattern. What is the most severe consequence of fitting a standard AR(p) model directly to this non-stationary data without any transformation?
Time-series Regression models
Hard
A.The model will perfectly fit the training data but will only be able to forecast the mean of the series for all future time steps.
B.The model will fail to compute because the time-series matrix will be singular due to the deterministic trend.
C.The model's residuals will be perfectly normally distributed, but the coefficient estimates will be biased towards zero due to the trend.
D.The model will likely produce a spurious regression, where variables appear to have a statistically significant relationship that is driven by the common trend, not a true causal link, leading to unreliable forecasts.
Correct Answer: The model will likely produce a spurious regression, where variables appear to have a statistically significant relationship that is driven by the common trend, not a true causal link, leading to unreliable forecasts.
Explanation:
A key assumption for many time-series models, including AR(p), is that the underlying series is stationary (i.e., its statistical properties like mean and variance do not change over time). A series with a trend and seasonality is non-stationary. Fitting a model directly to such data can lead to a spurious regression. The model might find a very high R-squared and seemingly significant coefficients simply because the target () and the lagged features () are all increasing together due to the common trend. This relationship is not real and will break down for out-of-sample forecasting. The standard practice is to first difference the series (and/or take seasonal differences) to make it stationary before fitting the model.
Incorrect! Try again.
59You have trained a Random Forest Regressor on a dataset where the feature X ranges from 0 to 100. The model has learned the relationship well within this range. What value will the trained model most likely predict for a new data point with X = 200?
Tree-Based regression models
Hard
A.A value close to the overall average of the target variable across the entire training set.
B.A value close to the average prediction for the training instances where X was at its maximum observed value (near 100).
C.The model will return a NaN or an error because the value is outside the training domain.
D.It will extrapolate the learned trend linearly and predict a value significantly higher than any seen in the training data.
Correct Answer: A value close to the average prediction for the training instances where X was at its maximum observed value (near 100).
Explanation:
Tree-based models, including Random Forests, cannot extrapolate beyond the range of the training data. A prediction is made by averaging the values in the terminal leaves where a new data point falls. For a value of X=200, which is outside the training range of [0, 100], the data point will traverse each tree down a path. At any split on X, since 200 is greater than any split point based on X in the training data, it will always follow the same branch as a data point with the maximum X value from training. It will therefore end up in the same terminal leaves as the training points with the largest X values. The final prediction will be the average of the target values in those leaves, effectively capping the prediction at the level seen at the edge of the training data.
Incorrect! Try again.
60The ability of Lasso regression ( penalty) to produce sparse models (i.e., set some coefficients to exactly zero) is often explained by its geometric interpretation. In a two-coefficient case (), what is the key geometric property of the Lasso constraint region () that leads to this sparsity?
Regularized Regression models
Hard
A.The constraint region is a circle (), which allows the RSS contours to touch tangentially at a point where both coefficients are non-zero.
B.The constraint region is an unbounded square, which allows coefficients to be pushed to exactly zero without violating the constraint.
C.The constraint region is a rhombus with sharp corners at the axes. The elliptical contours of the residual sum of squares (RSS) are likely to make their first contact with the constraint region at one of these corners.
D.The constraint region is a non-convex shape, which creates multiple local minima, some of which are on the axes where coefficients are zero.
Correct Answer: The constraint region is a rhombus with sharp corners at the axes. The elliptical contours of the residual sum of squares (RSS) are likely to make their first contact with the constraint region at one of these corners.
Explanation:
The solution to a regularized regression problem is the point where the elliptical contours of the RSS (centered at the OLS solution) first touch the boundary of the constraint region defined by the penalty. For Ridge regression ( penalty), the constraint is a circle (in 2D), a smooth shape. For Lasso ( penalty), the constraint is a rhombus (a diamond shape in 2D). This shape has sharp corners that lie exactly on the axes (e.g., at or ). It is geometrically probable that the expanding RSS ellipse will hit one of these sharp corners before it touches any of the flat sides. A point on a corner means one of the coefficients is exactly zero, thus inducing sparsity.