1What is the primary motivation for using a Variance Threshold in feature selection?
A.To remove features that have a high correlation with the target variable.
B.To remove features that contain constant or quasi-constant values.
C.To increase the dimensionality of the dataset.
D.To normalize the data distribution.
Correct Answer: To remove features that contain constant or quasi-constant values.
Explanation:Variance Threshold is a simple baseline approach to feature selection. It removes all features whose variance doesn't meet some threshold. Features with zero or very low variance (constant features) carry little to no information for distinguishing between samples.
Incorrect! Try again.
2If a Bernoulli random variable (binary feature) takes the value 1 with probability and 0 with probability , what is its variance?
A.
B.
C.
D.
Correct Answer:
Explanation:The variance of a Bernoulli variable is calculated as . This formula is often used to set default thresholds for binary feature selection.
Incorrect! Try again.
3Which of the following problems arises when two features in a dataset have a Pearson correlation coefficient close to 1 or -1?
A.Overfitting due to noise
B.Multicollinearity
C.Underfitting due to high bias
D.The curse of dimensionality
Correct Answer: Multicollinearity
Explanation:High correlation between independent features leads to multicollinearity, where one feature can be linearly predicted from the other. This can make the model parameters unstable and hard to interpret.
Incorrect! Try again.
4In Correlation-based Feature Selection, what is the standard strategy when two features are highly correlated?
A.Keep both features to maximize information.
B.Create a new feature by multiplying them.
C.Remove one of the features.
D.Apply PCA to both features immediately.
Correct Answer: Remove one of the features.
Explanation:Since highly correlated features provide redundant information, the standard strategy is to remove one of them to reduce dimensionality and computational cost without losing significant information.
Incorrect! Try again.
5Which of the following describes the Forward Selection algorithm?
A.Start with all features and remove the least significant one iteratively.
B.Start with no features and add the most significant one iteratively.
C.Select features randomly and evaluate performance.
D.Calculate the variance of all features and filter them simultaneously.
Correct Answer: Start with no features and add the most significant one iteratively.
Explanation:Forward Selection is a wrapper method that starts with an empty set of features and adds the single feature that improves the model performance the most in each iteration.
Incorrect! Try again.
6What is a major disadvantage of Wrapper Methods (like Forward Selection and Backward Elimination) compared to Filter Methods?
A.They do not consider feature interactions.
B.They are computationally expensive because they train the model multiple times.
C.They always result in lower accuracy.
D.They cannot handle categorical data.
Correct Answer: They are computationally expensive because they train the model multiple times.
Explanation:Wrapper methods evaluate subsets of variables by training a model for each subset, making them computationally intensive, especially for datasets with a large number of features.
Incorrect! Try again.
7In Backward Elimination, what is the starting point of the algorithm?
A.An empty set of features.
B.A randomly selected subset of features.
C.The set containing all available features.
D.The single feature with the highest variance.
Correct Answer: The set containing all available features.
Explanation:Backward Elimination starts with the full model (all features included) and iteratively removes the least significant feature based on a specific metric (e.g., p-value or performance drop).
Incorrect! Try again.
8Tree-based feature importance (e.g., in Random Forest) is typically calculated based on:
A.The correlation of the feature with the target.
B.The magnitude of the regression coefficients.
C.The reduction in impurity (Gini or Entropy) contributed by the feature across all trees.
D.The variance of the feature independent of the target.
Correct Answer: The reduction in impurity (Gini or Entropy) contributed by the feature across all trees.
Explanation:In decision trees and ensembles like Random Forests, feature importance is derived from the total decrease in node impurity (weighted by the probability of reaching that node) brought by that feature.
Incorrect! Try again.
9What is the primary difference between Feature Selection and Feature Extraction?
A.Feature Selection creates new variables; Feature Extraction keeps original variables.
B.Feature Selection selects a subset of existing features; Feature Extraction transforms data into a new lower-dimensional space.
C.Feature Selection is unsupervised; Feature Extraction is always supervised.
D.There is no difference; the terms are interchangeable.
Correct Answer: Feature Selection selects a subset of existing features; Feature Extraction transforms data into a new lower-dimensional space.
Explanation:Selection keeps the original feature meanings but reduces the count. Extraction (like PCA) creates new features (components) that are combinations of the original ones.
Incorrect! Try again.
10Which of the following is an example of creating a Polynomial Feature?
A.Calculating the mean of a time series.
B.Creating a feature from existing features and .
C.Scaling a feature to the range [0, 1].
D.Encoding a categorical variable using One-Hot Encoding.
Correct Answer: Creating a feature from existing features and .
Explanation:Polynomial features involve creating interaction terms (like products ) or power terms () to capture non-linear relationships in linear models.
Incorrect! Try again.
11What are Aggregation Features typically used for?
A.To reduce the variance of a single image pixel.
B.To summarize information from multiple records related to a single entity (e.g., average transaction amount per user).
C.To split a dataset into training and testing sets.
D.To visualize high-dimensional data in 2D.
Correct Answer: To summarize information from multiple records related to a single entity (e.g., average transaction amount per user).
Explanation:Aggregation features are created by grouping data (usually by an ID or time window) and calculating statistics like mean, sum, count, min, or max to represent the group's behavior.
Incorrect! Try again.
12The Curse of Dimensionality refers to the phenomenon where:
A.Models become too simple as dimensions increase.
B.Data becomes sparse and distance metrics lose meaning as the number of features increases.
C.Computation time decreases as the number of features increases.
D.The correlation between features always decreases in high dimensions.
Correct Answer: Data becomes sparse and distance metrics lose meaning as the number of features increases.
Explanation:As dimensionality increases, the volume of the space increases exponentially, making data sparse. Euclidean distance between points tends to become uniform, making it hard for distance-based algorithms (like k-NN) to distinguish between near and far points.
Incorrect! Try again.
13In the context of the Curse of Dimensionality, what happens to the amount of data required to maintain statistical significance as dimensions increase?
A.It decreases linearly.
B.It remains constant.
C.It increases exponentially.
D.It increases logarithmically.
Correct Answer: It increases exponentially.
Explanation:To maintain the same density of data points in the feature space, the number of samples required grows exponentially with the number of dimensions.
Incorrect! Try again.
14Principal Component Analysis (PCA) is best described as a technique for:
A.Non-linear dimensionality reduction.
B.Supervised feature selection.
C.Linear unsupervised dimensionality reduction.
D.Clustering categorical data.
Correct Answer: Linear unsupervised dimensionality reduction.
Explanation:PCA is an unsupervised linear transformation technique that identifies the directions (principal components) of maximum variance in the data.
Incorrect! Try again.
15In PCA, the first Principal Component (PC1) is the direction that:
A.Maximizes the correlation with the target variable.
B.Minimizes the variance of the projected data.
C.Maximizes the variance of the projected data.
D.Is orthogonal to the direction of maximum variance.
Correct Answer: Maximizes the variance of the projected data.
Explanation:The objective of PCA is to find the axis (PC1) along which the data varies the most, thereby retaining the most information.
Incorrect! Try again.
16What is the relationship between the first Principal Component (PC1) and the second Principal Component (PC2)?
A.They are parallel to each other.
B.They are orthogonal (perpendicular) to each other.
C.They are highly correlated.
D.PC2 always captures more variance than PC1.
Correct Answer: They are orthogonal (perpendicular) to each other.
Explanation:Principal components are constructed to be mutually orthogonal (uncorrelated). PC2 is the direction of maximum variance subject to the constraint that it is orthogonal to PC1.
Incorrect! Try again.
17Before applying PCA, it is crucial to perform which preprocessing step?
A.One-Hot Encoding
B.Feature Scaling (Standardization)
C.Target Encoding
D.Upsampling
Correct Answer: Feature Scaling (Standardization)
Explanation:PCA seeks to maximize variance. If features are on different scales (e.g., meters vs. millimeters), the feature with the larger scale will dominate the variance calculation, leading to biased components. Standardization ensures all features contribute equally.
Incorrect! Try again.
18Mathematically, Principal Components are the ____ of the covariance matrix of the data.
A.Eigenvalues
B.Eigenvectors
C.Inverse
D.Determinant
Correct Answer: Eigenvectors
Explanation:The Principal Components are the eigenvectors of the covariance matrix, and their corresponding eigenvalues represent the magnitude of variance in those directions.
Incorrect! Try again.
19What does the Explained Variance Ratio of a Principal Component indicate?
A.The ratio of training error to testing error.
B.The percentage of the dataset's total variance captured by that component.
C.The correlation between that component and the target.
D.The ratio of the number of features to the number of samples.
Correct Answer: The percentage of the dataset's total variance captured by that component.
Explanation:The explained variance ratio tells us how much information (variance) is retained by projecting the data onto a specific principal component. It is calculated as .
Incorrect! Try again.
20Which plot is commonly used to determine the optimal number of Principal Components to retain?
A.Box plot
B.Scree plot
C.Scatter plot
D.Histogram
Correct Answer: Scree plot
Explanation:A Scree plot displays the eigenvalues (or explained variance) for each component. The 'elbow' point in the plot usually indicates the optimal number of components to keep.
Incorrect! Try again.
21Linear Discriminant Analysis (LDA) differs from PCA in that LDA is:
A.Unsupervised
B.Supervised
C.Based on kernel methods
D.Only applicable to regression
Correct Answer: Supervised
Explanation:LDA uses the class labels (target variable) to find a linear combination of features that maximizes class separability, whereas PCA ignores class labels and focuses only on total variance.
Incorrect! Try again.
22What is the main objective function of Linear Discriminant Analysis (LDA)?
A.Maximize within-class variance and minimize between-class variance.
B.Maximize total variance regardless of class.
C.Maximize between-class variance and minimize within-class variance.
D.Minimize the reconstruction error.
Correct Answer: Maximize between-class variance and minimize within-class variance.
Explanation:LDA aims to project data such that samples from the same class are close together (low within-class variance) and the centers of different classes are far apart (high between-class variance).
Incorrect! Try again.
23If you have a classification problem with classes, what is the maximum number of linear discriminants (components) LDA can produce?
A.
B.
C.The number of features
D.
Correct Answer:
Explanation:The number of non-zero eigenvalues in LDA is limited by the rank of the between-class scatter matrix, which is at most . Thus, you can project data into at most dimensions.
Incorrect! Try again.
24Which of the following is a key assumption of Linear Discriminant Analysis?
A.Data is distributed uniformly.
B.Classes have identical covariance matrices and are normally distributed.
C.Features are strictly categorical.
D.The relationship between features and target is non-linear.
Correct Answer: Classes have identical covariance matrices and are normally distributed.
Explanation:LDA assumes that the data for each class comes from a Gaussian distribution with different means but the same covariance matrix (homoscedasticity).
Incorrect! Try again.
25In the context of dimensionality reduction, what does Intrinsic Dimensionality mean?
A.The total number of features in the raw dataset.
B.The minimum number of parameters needed to account for the observed properties of the data.
C.The number of rows in the dataset.
D.The maximum possible dimensions a computer can handle.
Correct Answer: The minimum number of parameters needed to account for the observed properties of the data.
Explanation:Intrinsic dimensionality represents the true number of variables required to describe the data structure, which is often lower than the number of observed features due to correlations.
Incorrect! Try again.
26Which feature selection technique is generally known as Recursive Feature Elimination (RFE)?
A.A filter method based on correlation.
B.A wrapper method that recursively removes the weakest feature.
C.An embedded method using L1 regularization.
D.A dimensionality reduction technique similar to SVD.
Correct Answer: A wrapper method that recursively removes the weakest feature.
Explanation:RFE works by training the model, ranking features by importance (weights or coefficients), removing the least important features, and repeating the process until the desired number of features remains.
Incorrect! Try again.
27When creating Interaction Features (e.g., ), what is the primary goal?
A.To remove outliers.
B.To capture the combined effect of two variables that affects the target differently than the sum of their individual effects.
C.To reduce the number of features.
D.To linearly separate the classes.
Correct Answer: To capture the combined effect of two variables that affects the target differently than the sum of their individual effects.
Explanation:Interaction features allow linear models to learn effects where the impact of one feature depends on the value of another feature.
Incorrect! Try again.
28Which formula represents the Fisher's criterion used in LDA for a two-class problem?
A.
B.
C.
D.
Correct Answer:
Explanation:Fisher's Linear Discriminant maximizes the ratio of the squared difference between class means (between-class variance) to the sum of within-class variances.
Incorrect! Try again.
29Why might one choose Lasso Regression (L1 regularization) as an Embedded Feature Selection method?
A.It shrinks coefficients to exactly zero, effectively performing feature selection.
B.It shrinks coefficients towards zero but never reaches exactly zero.
C.It maximizes the correlation between features.
D.It creates orthogonal components.
Correct Answer: It shrinks coefficients to exactly zero, effectively performing feature selection.
Explanation:Lasso (L1) regularization adds a penalty equal to the absolute value of the coefficients. This geometry forces the coefficients of less important features to become exactly zero, thereby selecting a subset of features.
Incorrect! Try again.
30In a dataset with high dimensionality, points tend to concentrate:
A.At the center (origin) of the space.
B.In the corners or shells of the hypercube.
C.Uniformly throughout the space.
D.Along a single axis.
Correct Answer: In the corners or shells of the hypercube.
Explanation:Due to the geometry of high-dimensional space, the volume of a hypersphere inscribed in a hypercube becomes negligible compared to the hypercube's volume. Most data points land in the 'corners', making them equidistant from the center.
Incorrect! Try again.
31When calculating the explained variance ratio in PCA, what does the eigenvalue represent?
A.The mean of the -th feature.
B.The variance of the data along the -th principal component.
C.The covariance between the -th and -th feature.
D.The reconstruction error of the -th component.
Correct Answer: The variance of the data along the -th principal component.
Explanation:The eigenvalue corresponding to an eigenvector (principal component) quantifies the amount of variance in the data captured along that specific direction.
Incorrect! Try again.
32If the first two principal components explain 95% of the variance, what does this imply?
A.The data is essentially 2-dimensional despite having more features.
B.The other 5% of variance contains the most important class information.
C.The model is overfitting.
D.You should discard the first two components.
Correct Answer: The data is essentially 2-dimensional despite having more features.
Explanation:High cumulative explained variance in the first few components suggests that the intrinsic dimensionality of the data is low, and most information can be represented in 2D with minimal loss.
Incorrect! Try again.
33What is a limitation of using Pearson Correlation for feature removal?
A.It is computationally very expensive.
B.It only detects linear relationships.
C.It cannot handle negative numbers.
D.It requires the target variable to be calculated.
Correct Answer: It only detects linear relationships.
Explanation:Pearson correlation measures linear dependence. It might miss strong non-linear relationships (e.g., quadratic), leading to the incorrect removal of valuable features.
Incorrect! Try again.
34In the context of creating features from datetime data, which of the following is NOT a typical extracted feature?
A.Day of the week
B.Hour of the day
C.Time elapsed since a specific event
D.The raw Unix timestamp treated as a categorical category
Correct Answer: The raw Unix timestamp treated as a categorical category
Explanation:Treating a raw timestamp as a category would create a unique category for every instant, resulting in massive cardinality and no generalization. Typical extraction involves cyclical components (hour, day) or durations.
Incorrect! Try again.
35When performing PCA, if the data matrix is , the covariance matrix is computed as (assuming centered data):
A.
B.
C.
D.
Correct Answer:
Explanation:The covariance matrix for a centered data matrix (where rows are samples and columns are features) is proportional to .
Incorrect! Try again.
36Which of the following is a disadvantage of PCA?
A.It is sensitive to the scale of features.
B.The resulting principal components are often hard to interpret physically.
C.It assumes linear relationships.
D.All of the above.
Correct Answer: All of the above.
Explanation:PCA requires scaling, produces linear combinations of features (losing original physical meaning), and only captures linear variance structures.
Incorrect! Try again.
37In Linear Discriminant Analysis, the 'Within-Class Scatter Matrix' represents:
A.How far the class means are from the global mean.
B.The scatter (variance) of samples around their respective class means.
C.The correlation between features.
D.The noise in the dataset.
Correct Answer: The scatter (variance) of samples around their respective class means.
Explanation: aggregates the covariance matrices of each individual class, representing how tightly grouped the samples are within their own classes.
Incorrect! Try again.
38Which technique would be most appropriate if you want to visualize a dataset with 50 features and 3 classes in a 2D plot while keeping the classes as distinct as possible?
A.Variance Thresholding
B.Principal Component Analysis (PCA)
C.Linear Discriminant Analysis (LDA)
D.Forward Selection
Correct Answer: Linear Discriminant Analysis (LDA)
Explanation:While PCA can visualize data in 2D, LDA is supervised and specifically optimizes for class separability, making it better for visualizing distinct classes.
Incorrect! Try again.
39What is the computational complexity of calculating the Covariance Matrix for a dataset with samples and features?
A.
B.
C.
D.
Correct Answer:
Explanation:Calculating the covariance between two features takes . Since there are entries in the covariance matrix, the total complexity is .
Incorrect! Try again.
40When creating Binning/Discretization features (e.g., turning Age into Age Groups), what is a potential benefit?
A.It increases the precision of the data.
B.It handles non-linear relationships using linear models.
C.It always increases the variance.
D.It eliminates the need for a target variable.
Correct Answer: It handles non-linear relationships using linear models.
Explanation:Binning transforms a continuous variable into categories. If the relationship is non-linear (e.g., U-shaped), a linear model can learn a different coefficient for each bin, approximating the non-linearity.
Incorrect! Try again.
41Forward selection is an example of a ____ algorithm.
A.Greedy
B.Dynamic Programming
C.Divide and Conquer
D.Backtracking
Correct Answer: Greedy
Explanation:Forward selection makes the locally optimal choice at each step (adding the best single feature) with the hope of finding a global optimum. It does not re-evaluate previous choices.
Incorrect! Try again.
42Why is the Curse of Dimensionality particularly problematic for k-Nearest Neighbors (k-NN)?
A.k-NN requires a training phase.
B.In high dimensions, all points become approximately equidistant from each other.
C.k-NN only works with binary features.
D.k-NN cannot handle negative values.
Correct Answer: In high dimensions, all points become approximately equidistant from each other.
Explanation:As dimensions increase, the ratio of the distance to the nearest neighbor vs. the farthest neighbor approaches 1, meaning 'nearest' neighbors are no longer statistically closer than random points.
Incorrect! Try again.
43In PCA, what is the geometric interpretation of the eigenvalues being zero?
A.The data has no noise.
B.The data lies perfectly on a lower-dimensional subspace (hyperplane).
C.The features are completely uncorrelated.
D.The mean of the data is zero.
Correct Answer: The data lies perfectly on a lower-dimensional subspace (hyperplane).
Explanation:If an eigenvalue is zero, the variance along that principal component is zero. This means the data is completely flat in that direction, indicating it resides in a lower-dimensional subspace.
Incorrect! Try again.
44Which of the following creates a new feature by calculating the ratio of two existing features (e.g., TotalPrice / Quantity)?
A.Polynomial Expansion
B.Domain-specific feature construction
C.One-Hot Encoding
D.Normalization
Correct Answer: Domain-specific feature construction
Explanation:Creating ratios (like Unit Price from Total Price and Quantity) is a manual feature construction technique often driven by domain knowledge.
Incorrect! Try again.
45If you perform PCA on a dataset where all features are completely uncorrelated and have equal variance, what will the Principal Components look like?
A.They will be the original axes (features) themselves.
B.They will be rotated by 45 degrees.
C.PCA will fail to find any components.
D.The eigenvalues will be negative.
Correct Answer: They will be the original axes (features) themselves.
Explanation:If features are uncorrelated, the covariance matrix is diagonal. The eigenvectors of a diagonal matrix are the standard basis vectors (the original axes).
Incorrect! Try again.
46What is a 'Constant Feature'?
A.A feature that increases constantly over time.
B.A feature that has the same value for all observations.
C.A feature that is constantly missing.
D.A feature with a variance of 1.
Correct Answer: A feature that has the same value for all observations.
Explanation:A constant feature has zero variance (all values are identical) and provides no discriminative information to a machine learning model.
Incorrect! Try again.
47In the context of LDA, what is a Singular matrix problem?
A.When the scatter matrix has a determinant of zero and cannot be inverted.
B.When the matrix has only one class.
C.When the matrix is square.
D.When the eigenvalues are all 1.
Correct Answer: When the scatter matrix has a determinant of zero and cannot be inverted.
Explanation:LDA requires inverting the within-class scatter matrix (). If features are collinear or the number of samples is smaller than the number of features, becomes singular (non-invertible).
Incorrect! Try again.
48What does a correlation of 0 between two features imply?
A.They are statistically independent.
B.There is no linear relationship between them.
C.One is a constant multiple of the other.
D.They are negatively related.
Correct Answer: There is no linear relationship between them.
Explanation:Zero Pearson correlation implies no linear relationship, but there could still be a strong non-linear relationship (e.g., over a symmetric interval).
Incorrect! Try again.
49Which dimensionality reduction technique transforms variables into a set of linearly uncorrelated variables called principal components?
A.LDA
B.PCA
C.t-SNE
D.Forward Selection
Correct Answer: PCA
Explanation:This is the definition of Principal Component Analysis.
Incorrect! Try again.
50When creating lag features for time-series data (e.g., value at ), what issue is introduced for the first few rows of the dataset?
A.Infinite values
B.Missing values (NaN)
C.Zero variance
D.Multicollinearity
Correct Answer: Missing values (NaN)
Explanation:If you shift a column down to create a lag, the top rows will not have a previous value to reference, resulting in Missing Values (NaN) that must be handled.