1 $What is the primary purpose of the Variance Threshold method in feature selection?$

variance threshold Feature Selection Easy

A.

To remove features that have low or zero variance.

B.

To create new features from existing ones.

C.

To select features based on their importance in a tree-based model.

D.

To remove features that are highly correlated with each other.

2 $A feature in your dataset has a variance of 0. What does this imply?$

variance threshold Feature Selection Easy

A.

All values in that feature are identical.

B.

The feature has a perfect linear relationship with the target variable.

C.

The feature contains only missing values.

D.

The feature is the most important predictor.

3 $When is it appropriate to use correlation-based feature removal?$

correlation-based removal Feature Selection Easy

A.

When two or more independent features are highly correlated with each other.

B.

When the target variable is categorical.

C.

When a feature has a very low variance.

D.

When you want to transform features into a lower-dimensional space.

4 $Removing highly correlated features helps to mitigate which common statistical issue in linear models?$

correlation-based removal Feature Selection Easy

A.

Autocorrelation

B.

Multicollinearity

C.

Underfitting

D.

Heteroscedasticity

5 $How does the Forward Selection algorithm build a set of features for a model?$

forward selection Easy

A.

It randomly selects a subset of features.

B.

It starts with all features and removes the least significant feature at each step.

C.

It creates new features by combining existing ones.

D.

It starts with an empty model and adds the most significant feature at each step.

6 $What is the starting point for the Backward Elimination feature selection method?$

backward elimination Easy

A.

A model that includes all available features.

B.

A model with a random subset of features.

C.

A model with no features.

D.

A model with only the single best feature.

7 $In a Random Forest model, how is a feature's importance generally determined?$

tree-based feature importance Easy

A.

By the variance of the feature's values.

B.

By its p-value in a statistical test.

C.

By its correlation coefficient with the target variable.

D.

By measuring its average contribution to decreasing impurity across all trees.

8 $Which of the following best defines Feature Extraction?$

Feature Extraction Easy

A.

Removing features with zero or low variance.

B.

Creating a new set of features from the original set, often of a lower dimension.

C.

Selecting a subset of the most relevant features from the original set.

D.

Scaling numerical features to be within a specific range.

9 $Given a 'date' column in a dataset, which of the following is an example of creating a new feature?$

creating new features Easy

A.

Deleting the 'date' column because it's not a number.

B.

Converting the 'date' column to a Unix timestamp.

C.

Extracting the 'month' or 'day of the week' from the date.

D.

Sorting the dataset by the 'date' column.

10 $Combining 'total_bedrooms' and 'total_rooms' in a housing dataset to create a 'bedrooms_per_room' ratio is an example of what?$

creating new features Easy

A.

Feature selection

B.

Data normalization

C.

Dimensionality reduction

D.

Creating a new feature

11 $If you have a dataset of customer transactions, calculating the 'total_spent_by_customer' for each unique customer is an example of creating what type of feature?$

aggregation features Easy

A.

A polynomial feature

B.

A principal component

C.

A normalized feature

D.

An aggregation feature

12 $What is the primary goal of dimensionality reduction?$

Dimensionality Reduction Easy

A.

To remove all missing values from the dataset.

B.

To reduce the number of features in a dataset.

C.

To increase the number of samples in the dataset.

D.

To increase the number of features in a dataset.

13 $Which of these is a key benefit of performing dimensionality reduction?$

Dimensionality Reduction Easy

A.

It guarantees an improvement in model accuracy.

B.

It creates a more interpretable model by using original features.

C.

It automatically handles categorical data.

D.

It can reduce model training time and complexity.

14 $The 'Curse of Dimensionality' primarily refers to problems that arise when:$

Curse of dimensionality Easy

A.

The dataset contains too many outliers.

B.

The number of features is very large compared to the number of observations.

C.

Features are not on the same scale.

D.

The number of observations is very large.

15 $Principal Component Analysis (PCA) is primarily used for what purpose?$

Principal Component Analysis Easy

A.

Clustering data points into groups.

B.

Supervised classification.

C.

Unsupervised dimensionality reduction.

D.

Predicting a continuous target variable.

16 $The principal components generated by PCA have which important property?$

Principal Component Analysis Easy

A.

They are always a subset of the original features.

B.

They are uncorrelated with each other.

C.

They perfectly separate the classes in the data.

D.

They are highly correlated with each other.

17 $What does the first principal component (PC1) represent?$

Principal Component Analysis Easy

A.

The direction in the data with the maximum variance.

B.

The direction in the data with the minimum variance.

C.

The feature that is most correlated with the target.

D.

The average of all features in the dataset.

18 $In PCA, what does the 'explained variance ratio' for a specific principal component indicate?$

explained variance ration Easy

A.

The correlation between the component and the first original feature.

B.

The number of original features that make up that component.

C.

The percentage of the total variance in the dataset that is captured by that component.

D.

The classification accuracy achieved using only that component.

19 $What is the main objective of Linear Discriminant Analysis (LDA)?$

Linear Discriminant Analysis Easy

A.

To remove features that are constant or have low variance.

B.

To find a feature subspace that maximizes the variance of the data.

C.

To find a feature subspace that maximizes the separability between classes.

D.

To group similar data points into clusters without using labels.

20 $A fundamental difference between PCA and LDA is that LDA is a ___algorithm.$

Linear Discriminant Analysis Easy

A.

Semi-supervised

B.

Supervised

C.

Reinforcement

D.

Unsupervised

21 $A dataset has four features (F1, F2, F3, F4) that have been scaled to a range of [0, 1]. Their variances are calculated as F1 : 0.02, F2 : 0.15, F3 : 0.005, F4 : 0.20. If you apply a variance threshold of 0.01 to remove quasi-constant features, which features will be kept for the model?$

variance threshold Feature Selection Medium

A.

All features will be kept

B.

F3 only

C.

F1, F2, and F4

D.

F2 and F4

22 $In a predictive modeling task, you find that two features, feature_A and feature_B, have a Pearson correlation coefficient of 0.95. The correlation of feature_A with the target variable is 0.4, while the correlation of feature_B with the target is -0.6. To reduce multicollinearity, which feature is the better choice to remove and why?$

correlation-based removal Feature Selection Medium

A.

Remove feature_B because its correlation with the target is negative.

B.

It does not matter which one you remove; the effect on the model will be identical.

C.

Remove feature_A because it has a lower absolute correlation with the target.

D.

Remove both features because their high inter-correlation will always harm the model.

23 $A data scientist is using forward selection on a dataset with 500 potential features. The process starts with an empty model and iteratively adds the most significant feature at each step. What is a key limitation of this "greedy" approach?$

forward selection Medium

A.

It may select a suboptimal combination of features because it cannot remove features that become redundant later.

B.

It is computationally more expensive than trying all possible feature combinations.

C.

It can only be used for linear models and not for tree-based models.

D.

It is guaranteed to overfit the training data more than backward elimination.

24 $You are performing backward elimination on a regression model that starts with 10 features (F1 to F10). In the first iteration, you train 10 separate models, each with 9 features (removing one feature at a time). The model performance (e.g., lowest AIC) is best when feature F7 is removed. What is the next logical step in the process?$

backward elimination Medium

A.

Stop the process because the least important feature has been identified and removed.

B.

Re-run the first iteration with a different performance metric to confirm the result.

C.

Permanently remove F7 and start the next iteration with the remaining 9 features.

D.

Put F7 back and try removing F1, as it was the second-worst performer.

25 $When using a feature importance mechanism like Gini importance (mean decrease in impurity) from a Random Forest, what is a known potential bias that a data scientist should be aware of?$

tree-based feature importance Medium

A.

The method consistently underestimates the importance of continuous numerical features.

B.

The method can only be used for classification tasks, not regression.

C.

The method tends to inflate the importance of high-cardinality features.

D.

The method is biased towards features that have a linear relationship with the target.

26 $A machine learning engineer is working with a dataset of grayscale images, where each image is 28x28 pixels. They use Principal Component Analysis (PCA) to transform the original 784-pixel features into 50 principal components. This process is best described as:$

Feature Extraction Medium

A.

Feature Extraction, because new, composite features are created from the original ones.

B.

Data Augmentation, because new training data is being synthesized.

C.

Feature Scaling, because the range of pixel values is being normalized.

D.

Feature Selection, because 734 features are being discarded.

27 $You are given a dataset for predicting flight delays. The data includes scheduled_departure_time and actual_departure_time in a datetime format. Which of the following newly created features would likely be the most direct and powerful predictor for the model?$

creating new features Medium

A.

departure_difference = (actual_departure_time - scheduled_departure_time) in minutes.

B.

departure_day_of_week extracted from scheduled_departure_time .

C.

departure_month extracted from scheduled_departure_time .

D.

departure_is_weekend extracted from scheduled_departure_time .

28 $To predict customer churn, you have a transaction log with customer_id, transaction_date, and transaction_amount . Which of the following is the best example of creating an aggregation feature at the customer level?$

aggregation features Medium

A.

Calculating the average transaction_amount for each customer_id .

B.

Normalizing the transaction_amount column using Min-Max scaling.

C.

One-hot encoding the payment method used for each transaction.

D.

Creating a days_since_last_transaction feature for each transaction record.

29 $In a dataset where many features are highly correlated (high multicollinearity), why is a dimensionality reduction technique like PCA often more effective than a feature selection technique?$

Dimensionality Reduction Medium

A.

Feature selection is always computationally more expensive than PCA.

B.

PCA preserves the original features, making the model more interpretable.

C.

Feature selection methods cannot handle non-linear relationships, whereas PCA can.

D.

PCA creates new uncorrelated features that capture the shared variance from the original correlated features.

30 $How does the "curse of dimensionality" primarily affect distance-based algorithms like k-Nearest Neighbors (k-NN)?$

Curse of dimensionality Medium

A.

High-dimensional data is always linearly separable, making k-NN more accurate.

B.

The computational cost to calculate distances decreases as the number of dimensions grows.

C.

The distance between any two points in a high-dimensional space becomes less meaningful as they tend to become almost equidistant.

D.

The concept of a "neighborhood" becomes more dense and easier to define.

31 $After performing PCA on a 2D dataset with standardized features 'study_hours' and 'sleep_hours', the first principal component (PC1) is defined by the vector [0.707, -0.707]. What does this imply about the relationship between the two original features in the data?$

Principal Component Analysis Medium

A.

'study_hours' and 'sleep_hours' are strongly positively correlated.

B.

The two features are completely independent.

C.

'study_hours' and 'sleep_hours' are strongly negatively correlated.

D.

'study_hours' has significantly more variance than 'sleep_hours'.

32 $A data scientist performs PCA on a dataset and obtains the following explained variance ratios for the first four principal components: PC1=0.55, PC2=0.25, PC3=0.12, PC4=0.04. To capture at least 90% of the total variance in the data, what is the minimum number of principal components they should retain?$

explained variance ratio Medium

A.

2

B.

1

C.

3

D.

4

33 $What is the fundamental difference in the optimization objective between Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA)?$

Linear Discriminant Analysis Medium

A.

PCA aims to minimize the within-class scatter, while LDA aims to maximize the total variance.

B.

LDA seeks to find a feature subspace that maximizes the separability between classes, while PCA seeks to maximize the variance in the data.

C.

LDA creates new features that are always orthogonal, while PCA features are not.

D.

PCA is a supervised technique that uses labels, while LDA is unsupervised.

34 $Applying a variance threshold to a dataset with features of vastly different scales (e.g., 'age' in years and 'income' in thousands of dollars) is problematic. What preprocessing step is essential to make the variance threshold meaningful in this scenario?$

variance threshold Feature Selection Medium

A.

Feature scaling (e.g., Standardization or Min-Max scaling).

B.

Applying a logarithmic transformation to the 'income' feature.

C.

One-hot encoding of all categorical features.

D.

Removing outliers from the 'income' feature.

35 $Consider a scenario with a very large number of features and a relatively small number of samples (i.e.,). Why would forward selection generally be a more feasible wrapper method than backward elimination in this case?$

backward elimination Medium

A.

Backward elimination is more prone to getting stuck in local optima than forward selection.

B.

Backward elimination starts by fitting a model with all features, which may be computationally intractable or statistically impossible if .

C.

Forward selection can handle non-linear models while backward elimination can only be used for linear models.

D.

Forward selection always finds the globally optimal feature set, unlike backward elimination.

36 $A feature in your dataset has a very low Pearson correlation with the target variable, but a Random Forest model assigns it a high feature importance score. What is the most likely reason for this discrepancy?$

tree-based feature importance Medium

A.

The feature importance calculation is flawed and should be ignored.

B.

The feature is highly correlated with another unimportant feature, causing multicollinearity.

C.

The feature has very low variance and was improperly scaled.

D.

The feature has a strong non-linear or interaction-based relationship with the target.

37 $A key property of the principal components (PCs) generated by PCA is their relationship to one another. Which statement correctly describes this mathematical property?$

Principal Component Analysis Medium

A.

The PCs are orthogonal, meaning they are linearly uncorrelated.

B.

The PCs are a subset of the most important original features.

C.

Each PC after the first one explains more variance than the previous one.

D.

The PCs are all positively correlated with the first PC.

38 $You are given a labeled dataset with 10 classes and 50 features. Your goal is to reduce the dimensionality for a subsequent classification model. What is the maximum number of dimensions (components) that Linear Discriminant Analysis (LDA) can reduce this dataset to?$

Linear Discriminant Analysis Medium

A.

49

B.

10

C.

9

D.

50

39 $Which of the following phenomena is LEAST likely to be a direct consequence of the curse of dimensionality?$

Curse of dimensionality Medium

A.

A linear model becoming more interpretable due to feature clarity.

B.

The number of training examples required to maintain a certain data density growing exponentially.

C.

The volume of the feature space growing exponentially, making data sparse.

D.

The distance between a point's nearest and farthest neighbor becoming almost equal.

40 $You have a dataset of online user activity with session_start_time and session_end_time . You create a new feature session_duration = session_end_time - session_start_time . You then create another feature is_long_session = 1 if session_duration > 15 minutes else 0 . What is the primary risk of including both session_duration and is_long_session in a linear model?$

creating new features Medium

A.

Losing important information about session length.

B.

Violating the assumption of normality for the model's residuals.

C.

Introducing high multicollinearity between the two features.

D.

Creating a feature that is too sparse to be useful.

41 $In a high-dimensional space (e.g., >1000 dimensions), you are using a k-Nearest Neighbors (k-NN) algorithm. Which of the following statements best describes the most critical impact of the curse of dimensionality on the algorithm's distance calculations?$

Curse of dimensionality Hard

A.

The ratio of the distance to the nearest neighbor and the distance to the farthest neighbor approaches 1, making distance-based discrimination difficult.

B.

The computational cost of calculating distances becomes infinite, rendering the algorithm unusable.

C.

The Manhattan distance becomes a more reliable metric than Euclidean distance because it is less sensitive to outliers.

D.

The Euclidean distance between any two points converges to zero, making all points seem equidistant.

42 $You perform PCA on a dataset that has been standardized (mean 0, variance 1). You then create a new feature,, which is a perfect linear combination of two existing features, and (i.e.,). You add to the dataset and re-run PCA. What is the most likely effect on the eigenvalues of the new covariance matrix?$

Principal Component Analysis Hard

A.

At least one eigenvalue will become zero, and the largest eigenvalues corresponding to the variance in and may increase.

B.

All eigenvalues will increase proportionally to the variance of the new feature.

C.

The number of non-zero eigenvalues will increase by one, corresponding to the new feature.

D.

A new eigenvalue of approximately zero will be introduced, and the other eigenvalues will remain unchanged.

43 $You are working on a 3-class classification problem with 50 features. You decide to use Linear Discriminant Analysis (LDA) for dimensionality reduction. What is the absolute maximum number of dimensions (discriminant components) you can reduce your data to using LDA, and why?$

Linear Discriminant Analysis Hard

A.

2, because the number of components is limited by, which in this case is .

B.

3, because the number of components is limited by, where is the number of classes.

C.

49, because the number of components is limited by, where is the number of features.

D.

50, because LDA can generate as many components as there are original features.

44 $Consider a scenario where the optimal feature subset for a regression model is {A, B, C}. However, feature A is only predictive in the presence of both B and C. Individually, A has zero correlation with the target. How would a standard Forward Selection procedure, using p-value or R-squared as the selection criterion, likely perform in this situation?$

forward selection Hard

A.

It will fail to select feature A because its individual contribution will appear negligible at every step, thus getting stuck in a local optimum.

B.

It will correctly identify {A, B, C} as it iteratively adds the most significant feature at each step.

C.

Forward Selection is guaranteed to find the global optimum, so it will test all combinations including {A, B, C} and select it.

D.

It will first add B and C, and then in the next step, it will recognize the interaction and add A.

45 $You are using a Random Forest model and notice two features, monthly_income and yearly_income, are highly correlated (). How would the default Gini Importance (Mean Decrease in Impurity) metric likely represent the importance of these two features compared to Permutation Importance?$

tree-based feature importance Hard

A.

Both methods will correctly identify that only one of the two features is needed and assign high importance to one and zero to the other.

B.

Permutation Importance will be biased and show both as highly important, while Gini Importance will correctly split the importance between them.

C.

Gini Importance may arbitrarily assign high importance to one and low to the other, while Permutation Importance might rate both as moderately important if they are predictive.

D.

Gini Importance will assign high importance to both, while Permutation Importance will assign low importance to both.

46 $In a feature selection process, you identify two features, F1 and F2, with a correlation of 0.9. You also find that Corr(F1, Target) = 0.6 and Corr(F2, Target) = 0.2. A common heuristic is to remove the feature with the lower correlation to the target. What is a significant risk of automatically applying this heuristic without further investigation?$

correlation-based removal Feature Selection Hard

A.

F2, despite its low direct correlation, might capture unique variance or have a strong relationship with the target after controlling for F1 (i.e., high partial correlation).

B.

The high correlation between F1 and F2 is likely spurious and should be ignored.

C.

There is no risk; removing the feature with lower correlation to the target is always the optimal choice to reduce multicollinearity.

D.

The removed feature, F2, might have a strong non-linear relationship with the target that is not captured by the Pearson correlation.

47 $You are applying PCA to a dataset containing features with vastly different scales (e.g., age in years, income in dollars). You forget to scale the data before applying PCA. Which of the following outcomes is most likely?$

Principal Component Analysis Hard

A.

The results will be identical to PCA on scaled data, as PCA is scale-invariant.

B.

The first principal component will be dominated by the feature with the highest variance, effectively ignoring the information from other features.

C.

The principal components will align with the original feature axes, providing no dimensionality reduction.

D.

The PCA algorithm will fail to converge because the covariance matrix will be ill-conditioned.

48 $When comparing Backward Elimination with Forward Selection for a dataset with a large number of features (p > 100), which statement accurately describes a key trade-off, assuming the goal is to optimize a model's AIC (Akaike Information Criterion)?$

backward elimination Hard

A.

Both methods are equivalent and will always arrive at the same final set of features, regardless of the starting point.

B.

Backward Elimination is less prone to missing important feature interactions because it starts with all features, but it is more computationally expensive.

C.

Forward Selection is guaranteed to find a better or equal AIC score because it builds the model from the ground up.

D.

Backward Elimination is computationally cheaper because it performs fewer model fits than Forward Selection.

49 $You apply a VarianceThreshold feature selector with threshold=0.1 to a dataset where all features have been scaled using MinMaxScaler (scaling each feature to a range of [0, 1]). What is the most significant potential issue with this approach?$

variance threshold Feature Selection Hard

A.

The threshold of 0.1 is arbitrary and has no statistical justification, making the selection process unreliable.

B.

This approach is invalid because VarianceThreshold should only be applied before any scaling.

C.

The variance of a min-max scaled feature depends on the distribution of its data; features with values clustered at 0 and 1 will have high variance, while those clustered in the middle will have low variance, potentially leading to incorrect removals.

D.

The threshold is too low and will not remove any features, as all variances will be close to 1.0.

50 $You have a binary classification problem where the classes are not linearly separable but form two concentric circles in a 2D feature space. How would applying Linear Discriminant Analysis (LDA) for dimensionality reduction from 2D to 1D perform?$

Linear Discriminant Analysis Hard

A.

LDA will find the perfect 1D projection that separates the two circles, as it maximizes class separability.

B.

LDA will overfit to the circular structure and create a highly non-linear decision boundary.

C.

LDA will project the data onto a line that best separates the variance, similar to PCA.

D.

LDA will fail to find a useful projection because the class means are identical (at the center of the circles), resulting in a between-class scatter of zero.

51 $After performing PCA on a 50-dimensional dataset, you plot the cumulative explained variance. The plot shows a smooth, almost linear increase, reaching 95% cumulative variance only after 45 components. There is no clear 'elbow' in the scree plot. What does this suggest about the original dataset?$

explained variance ratio Hard

A.

The dataset has a low intrinsic dimensionality, and most of the variance is captured by the first few components.

B.

The variance is evenly distributed across many dimensions, suggesting that the features are largely uncorrelated and each contributes a small amount of unique information.

C.

The PCA implementation is flawed, as a clear elbow should always be present.

D.

The features in the original dataset are highly correlated and redundant.

52 $You are building a linear regression model. The true underlying relationship between a feature and the target is . You decide to create polynomial features from (e.g.,). What is the theoretical implication of this choice for your linear model?$

creating new features Hard

A.

According to Taylor's theorem, a sufficiently high-degree polynomial can approximate the sine function within a specific range, allowing the linear model to capture the non-linear relationship.

B.

Using polynomial features will introduce perfect multicollinearity, making the model's coefficients uninterpretable and unstable.

C.

The model will remain linear and will not be able to approximate the sine function, regardless of the degree .

D.

Only features of an odd degree (e.g.,) will be useful, as the sine function is an odd function.

53 $You are building a time-series model to predict sales for a store on a given day. You create a feature avg_sales_for_day_of_week by calculating the average sales for every Monday, every Tuesday, etc., using the entire dataset. You then use this feature to train a model and evaluate it on a held-out test set from the last 10% of the time-series. Why is this feature engineering approach likely to produce an overly optimistic performance estimate?$

aggregation features Hard

A.

The feature introduces multicollinearity with other time-based features like the month of the year.

B.

The feature does not account for seasonality and will therefore be inaccurate.

C.

Aggregation over the day of the week is too coarse and will not be predictive.

D.

This is a form of data leakage; the feature value for a training sample is derived using information from the future (including the test set).

54 $You are given a dataset of grayscale images (e.g., MNIST). Which of the following pairs of techniques represents a valid and common hierarchical feature extraction pipeline for this type of data?$

Feature Extraction Hard

A.

Step 1: Apply edge detection filters (e.g., Sobel), Step 2: Apply Histogram of Oriented Gradients (HOG)

B.

Step 1: Forward Selection, Step 2: Create polynomial features

C.

Step 1: Linear Discriminant Analysis, Step 2: Principal Component Analysis

D.

Step 1: One-Hot Encode pixel intensities, Step 2: Apply Variance Threshold

55 $In the context of the curse of dimensionality, what is the 'concentration of norms' phenomenon and how does it affect machine learning algorithms?$

Curse of dimensionality Hard

A.

All data points concentrate in the center of the feature space, making them indistinguishable.

B.

Feature norms must be standardized (concentrated to a mean of 0 and variance of 1) for any algorithm to work in high dimensions.

C.

The norm of the weight vector in linear models tends to concentrate to zero, a phenomenon known as regularization.

D.

The L2 norm (Euclidean distance) of random vectors in high dimensions becomes tightly concentrated around its mean, making relative distances less meaningful.

56 $An impurity-based feature importance metric (like Gini Importance) from a tree-based model is known to be biased. Which of the following describes this bias most accurately?$

tree-based feature importance Hard

A.

It is biased in favor of numerical features over categorical features, regardless of their cardinality.

B.

It is biased in favor of high-cardinality features (both categorical and numerical with many unique values).

C.

It is biased in favor of features that are highly correlated with the target variable.

D.

It is biased against features that appear later in the feature list of the dataset.

57 $You are using Backward Elimination on a dataset with 50 features to build a multiple linear regression model. You notice that the removal of a feature causes a very large change in the coefficients and p-values of several other features. What is this phenomenon indicative of?$

backward elimination Hard

A.

The dataset is too small for a model with 50 features.

B.

Feature has a strong non-linear relationship with the target.

C.

There is strong multicollinearity between and the other features whose coefficients changed.

D.

Feature is an outlier and its removal correctly stabilizes the model.

58 $Under which scenario would using Feature Selection (e.g., Forward Selection) be strongly preferred over Feature Extraction (e.g., PCA), even if PCA achieves slightly better predictive performance?$

Dimensionality Reduction Hard

A.

When the features are on vastly different scales and standardization is not possible.

B.

When the underlying relationship between features and target is known to be highly non-linear.

C.

When the dataset has a very large number of features (p > 1000).

D.

When the model's primary purpose is inference and the interpretability of individual features' effects is critical.

59 $You are performing correlation-based feature removal on a dataset with 3 features (A, B, C). You observe the following correlations: Corr(A, B) = 0.9, Corr(A, C) = 0.8, Corr(B, C) = 0.7. The algorithm iteratively removes the feature from the most correlated pair that has the lower average correlation with all other features. Which feature will be removed first?$

correlation-based removal Feature Selection Hard

A.

Either A or B, the choice is arbitrary.

B.

Feature B

C.

Feature C

D.

Feature A

60 $You have a dataset where feature A causally influences feature B, and both A and B causally influence the target Y . Consequently, A and B are highly correlated. A standard automated feature selection process flags the (A, B) pair for multicollinearity and removes B because Corr(B, Y) is slightly lower than Corr(A, Y) . What is the primary risk of this action if the goal is model interpretability and understanding the system's mechanics?$

correlation-based removal Feature Selection Hard

A.

This action is correct, as the direct effect of A on Y is the only relationship of interest.

B.

The model's predictive accuracy will be significantly reduced because B contains unique information about Y not present in A .

C.

The model may remain predictive, but it loses causal interpretability by removing a key mediating variable (B) in the pathway from A to Y .

D.

There is no risk; removing the less correlated feature is the optimal way to handle multicollinearity.

61 $You run Forward Selection on two different random 80% subsamples of your full dataset. You find that the two runs result in significantly different final sets of selected features. What is this phenomenon called, and what does it imply about your dataset?$

forward selection Hard

A.

This is called overfitting, and it implies the model is too complex for the data.

B.

This is called model instability, and it suggests that there are several near-optimal feature subsets and/or high multicollinearity.

C.

This is called the 'curse of dimensionality', and it implies the feature space is too sparse.

D.

This is called feature leakage, and it implies information from the test set was used in training.

62 $In a high-dimensional space (e.g., >1000 dimensions), you are using a k-Nearest Neighbors (k-NN) algorithm. Which of the following statements best describes the most critical impact of the curse of dimensionality on the algorithm's distance calculations?$

Curse of dimensionality Hard

A.

The ratio of the distance to the nearest neighbor and the distance to the farthest neighbor approaches 1, making distance-based discrimination difficult.

B.

The Euclidean distance between any two points converges to zero, making all points seem equidistant.

C.

The Manhattan distance becomes a more reliable metric than Euclidean distance because it is less sensitive to outliers.

D.

The computational cost of calculating distances becomes infinite, rendering the algorithm unusable.

63 $You perform PCA on a dataset that has been standardized (mean 0, variance 1). You then create a new feature,, which is a perfect linear combination of two existing features, and (i.e.,). You add to the dataset and re-run PCA. What is the most likely effect on the eigenvalues of the new covariance matrix?$

Principal Component Analysis Hard

A.

All eigenvalues will increase proportionally to the variance of the new feature.

B.

A new eigenvalue of approximately zero will be introduced, and the other eigenvalues will remain unchanged.

C.

At least one eigenvalue will become zero, and the largest eigenvalues corresponding to the variance in and may increase.

D.

The number of non-zero eigenvalues will increase by one, corresponding to the new feature.

64 $You are working on a 3-class classification problem with 50 features. You decide to use Linear Discriminant Analysis (LDA) for dimensionality reduction. What is the absolute maximum number of dimensions (discriminant components) you can reduce your data to using LDA, and why?$

Linear Discriminant Analysis Hard

A.

2, because the number of components is limited by, which in this case is .

B.

3, because the number of components is limited by, where is the number of classes.

C.

49, because the number of components is limited by, where is the number of features.

D.

50, because LDA can generate as many components as there are original features.

65 $Consider a scenario where the optimal feature subset for a regression model is {A, B, C}. However, feature A is only predictive in the presence of both B and C. Individually, A has zero correlation with the target. How would a standard Forward Selection procedure, using p-value or R-squared as the selection criterion, likely perform in this situation?$

forward selection Hard

A.

It will fail to select feature A because its individual contribution will appear negligible at every step, thus getting stuck in a local optimum.

B.

It will first add B and C, and then in the next step, it will recognize the interaction and add A.

C.

Forward Selection is guaranteed to find the global optimum, so it will test all combinations including {A, B, C} and select it.

D.

It will correctly identify {A, B, C} as it iteratively adds the most significant feature at each step.

66 $You are using a Random Forest model and notice two features, monthly_income and yearly_income, are highly correlated (). How would the default Gini Importance (Mean Decrease in Impurity) metric likely represent the importance of these two features compared to Permutation Importance?$

tree-based feature importance Hard

A.

Gini Importance will assign high importance to both, while Permutation Importance will assign low importance to both.

B.

Both methods will correctly identify that only one of the two features is needed and assign high importance to one and zero to the other.

C.

Gini Importance may arbitrarily assign high importance to one and low to the other, while Permutation Importance is also likely to underestimate their collective importance.

D.

Permutation Importance will be biased and show both as highly important, while Gini Importance will correctly split the importance between them.

67 $In a feature selection process, you identify two features, F1 and F2, with a correlation of 0.9. You also find that Corr(F1, Target) = 0.6 and Corr(F2, Target) = 0.2. A common heuristic is to remove the feature with the lower correlation to the target (F2). What is a significant risk of automatically applying this heuristic without further investigation?$

correlation-based removal Feature Selection Hard

A.

The removed feature, F2, might have a strong non-linear relationship with the target that is not captured by the Pearson correlation.

B.

There is no risk; removing the feature with lower correlation to the target is always the optimal choice to reduce multicollinearity.

C.

The high correlation between F1 and F2 is likely spurious and should be ignored.

D.

F2, despite its low direct correlation, might capture unique variance or have a strong relationship with the target after controlling for F1 (i.e., high partial correlation).

68 $You are applying PCA to a dataset containing features with vastly different scales (e.g., age in years, income in dollars). You forget to scale the data before applying PCA. Which of the following outcomes is most likely?$

Principal Component Analysis Hard

A.

The PCA algorithm will fail to converge because the covariance matrix will be ill-conditioned.

B.

The first principal component will be dominated by the feature with the highest variance, effectively ignoring the information from other features.

C.

The results will be identical to PCA on scaled data, as PCA is scale-invariant.

D.

The principal components will align with the original feature axes, providing no dimensionality reduction.

69 $When comparing Backward Elimination with Forward Selection for a dataset with a large number of features (p > 100), which statement accurately describes a key trade-off, assuming the goal is to optimize a model's AIC (Akaike Information Criterion)?$

backward elimination Hard

A.

Backward Elimination is less prone to missing important feature interactions because it starts with all features, but it is more computationally expensive.

B.

Both methods are equivalent and will always arrive at the same final set of features, regardless of the starting point.

C.

Forward Selection is guaranteed to find a better or equal AIC score because it builds the model from the ground up.

D.

Backward Elimination is computationally cheaper because it performs fewer model fits than Forward Selection.

70 $You apply a VarianceThreshold feature selector with threshold=0.1 to a dataset where all features have been scaled using MinMaxScaler (scaling each feature to a range of [0, 1]). What is the most significant potential issue with this approach?$

variance threshold Feature Selection Hard

A.

This approach is invalid because VarianceThreshold should only be applied before any scaling.

B.

The variance of a min-max scaled feature depends on the distribution of its data; features with values clustered at 0 and 1 will have high variance, while those clustered in the middle will have low variance, potentially leading to incorrect removals.

C.

The threshold is too low and will not remove any features, as all variances will be close to 1.0.

D.

The threshold of 0.1 is arbitrary and has no statistical justification, making the selection process unreliable.

71 $You have a binary classification problem where the classes are not linearly separable but form two concentric circles in a 2D feature space. How would applying Linear Discriminant Analysis (LDA) for dimensionality reduction from 2D to 1D perform?$

Linear Discriminant Analysis Hard

A.

LDA will overfit to the circular structure and create a highly non-linear decision boundary.

B.

LDA will find the perfect 1D projection that separates the two circles, as it maximizes class separability.

C.

LDA will fail to find a useful projection because the class means are identical (at the center of the circles), resulting in a between-class scatter of zero.

D.

LDA will project the data onto a line that best separates the variance, similar to PCA.

72 $After performing PCA on a 50-dimensional dataset, you plot the cumulative explained variance. The plot shows a smooth, almost linear increase, reaching 95% cumulative variance only after 45 components. There is no clear 'elbow' in the scree plot. What does this suggest about the original dataset?$

explained variance ratio Hard

A.

The features in the original dataset are highly correlated and redundant.

B.

The PCA implementation is flawed, as a clear elbow should always be present.

C.

The variance is evenly distributed across many dimensions, suggesting that the features are largely uncorrelated and each contributes a small amount of unique information.

D.

The dataset has a low intrinsic dimensionality, and most of the variance is captured by the first few components.

73 $You are building a linear regression model. The true underlying relationship between a feature and the target is . You decide to create polynomial features from (e.g.,). What is the theoretical implication of this choice for your linear model?$

creating new features Hard

A.

Only features of an odd degree (e.g.,) will be useful, as the sine function is an odd function.

B.

Using polynomial features will introduce perfect multicollinearity, making the model's coefficients uninterpretable and unstable.

C.

According to Taylor's theorem, a sufficiently high-degree polynomial can approximate the sine function within a specific range, allowing the linear model to capture the non-linear relationship.

D.

The model will remain linear and will not be able to approximate the sine function, regardless of the degree .

74 $You are building a time-series model to predict sales for a store on a given day. You create a feature avg_sales_for_day_of_week by calculating the average sales for every Monday, every Tuesday, etc., using the entire dataset. You then use this feature to train a model and evaluate it on a held-out test set from the last 10% of the time-series. Why is this feature engineering approach likely to produce an overly optimistic performance estimate?$

aggregation features Hard

A.

The feature introduces multicollinearity with other time-based features like the month of the year.

B.

Aggregation over the day of the week is too coarse and will not be predictive.

C.

This is a form of data leakage; the feature value for a training sample is derived using information from the future (including the test set).

D.

The feature does not account for seasonality and will therefore be inaccurate.

75 $You are given a dataset of grayscale images (e.g., MNIST). Which of the following pairs of techniques represents a valid and common hierarchical feature extraction pipeline for this type of data?$

Feature Extraction Hard

A.

Step 1: One-Hot Encode pixel intensities, Step 2: Apply Variance Threshold

B.

Step 1: Linear Discriminant Analysis, Step 2: Principal Component Analysis

C.

Step 1: Apply edge detection filters (e.g., Sobel), Step 2: Apply Histogram of Oriented Gradients (HOG)

D.

Step 1: Forward Selection, Step 2: Create polynomial features

76 $In the context of the curse of dimensionality, what is the 'concentration of norms' phenomenon and how does it affect machine learning algorithms?$

Curse of dimensionality Hard

A.

All data points concentrate in the center of the feature space, making them indistinguishable.

B.

The norm of the weight vector in linear models tends to concentrate to zero, a phenomenon known as regularization.

C.

The L2 norm (Euclidean distance) of random vectors in high dimensions becomes tightly concentrated around its mean, making relative distances less meaningful.

D.

Feature norms must be standardized (concentrated to a mean of 0 and variance of 1) for any algorithm to work in high dimensions.

77 $An impurity-based feature importance metric (like Gini Importance) from a tree-based model is known to be biased. Which of the following describes this bias most accurately?$

tree-based feature importance Hard

A.

It is biased in favor of numerical features over categorical features, regardless of their cardinality.

B.

It is biased in favor of features that are highly correlated with the target variable.

C.

It is biased in favor of high-cardinality features (both categorical and numerical with many unique values).

D.

It is biased against features that appear later in the feature list of the dataset.

78 $You are using Backward Elimination on a dataset with 50 features to build a multiple linear regression model. You notice that the removal of a feature causes a very large change in the coefficients and p-values of several other features. What is this phenomenon indicative of?$

backward elimination Hard

A.

There is strong multicollinearity between and the other features whose coefficients changed.

B.

Feature has a strong non-linear relationship with the target.

C.

Feature is an outlier and its removal correctly stabilizes the model.

D.

The dataset is too small for a model with 50 features.

79 $Under which scenario would using Feature Selection (e.g., Forward Selection) be strongly preferred over Feature Extraction (e.g., PCA), even if PCA achieves slightly better predictive performance?$

Dimensionality Reduction Hard

A.

When the model's primary purpose is inference and the interpretability of individual features' effects is critical.

B.

When the underlying relationship between features and target is known to be highly non-linear.

C.

When the dataset has a very large number of features (p > 1000).

D.

When the features are on vastly different scales and standardization is not possible.

80 $You have a dataset where feature A causally influences feature B, and both A and B causally influence the target Y . Consequently, A and B are highly correlated. A standard automated feature selection process flags the (A, B) pair for multicollinearity and removes B because Corr(B, Y) is slightly lower than Corr(A, Y) . What is the primary risk of this action if the goal is model interpretability and understanding the system's mechanics?$

correlation-based removal Feature Selection Hard

A.

The model's predictive accuracy will be significantly reduced because B contains unique information about Y not present in A .

B.

This action is correct, as the direct effect of A on Y is the only relationship of interest.

C.

The model may remain predictive, but it loses causal interpretability by removing a key mediating variable (B) in the pathway from A to Y .

D.

There is no risk; removing the less correlated feature is the optimal way to handle multicollinearity.

81 $You run Forward Selection on two different random 80% subsamples of your full dataset. You find that the two runs result in significantly different final sets of selected features. What is this phenomenon called, and what does it imply about your dataset?$

forward selection Hard

A.

This is called model instability, and it suggests that there are several near-optimal feature subsets and/or high multicollinearity.

B.

This is called the 'curse of dimensionality', and it implies the feature space is too sparse.

C.

This is called feature leakage, and it implies information from the test set was used in training.

D.

This is called overfitting, and it implies the model is too complex for the data.

Unit 2 - Practice Quiz