Unit 2 - Practice Quiz

CSE274 50 Questions
0 Correct 0 Wrong 50 Left
0/50

1 What is the primary motivation for using a Variance Threshold in feature selection?

A. To remove features that have a high correlation with the target variable.
B. To remove features that contain constant or quasi-constant values.
C. To increase the dimensionality of the dataset.
D. To normalize the data distribution.

2 If a Bernoulli random variable (binary feature) takes the value 1 with probability and 0 with probability , what is its variance?

A.
B.
C.
D.

3 Which of the following problems arises when two features in a dataset have a Pearson correlation coefficient close to 1 or -1?

A. Overfitting due to noise
B. Multicollinearity
C. Underfitting due to high bias
D. The curse of dimensionality

4 In Correlation-based Feature Selection, what is the standard strategy when two features are highly correlated?

A. Keep both features to maximize information.
B. Create a new feature by multiplying them.
C. Remove one of the features.
D. Apply PCA to both features immediately.

5 Which of the following describes the Forward Selection algorithm?

A. Start with all features and remove the least significant one iteratively.
B. Start with no features and add the most significant one iteratively.
C. Select features randomly and evaluate performance.
D. Calculate the variance of all features and filter them simultaneously.

6 What is a major disadvantage of Wrapper Methods (like Forward Selection and Backward Elimination) compared to Filter Methods?

A. They do not consider feature interactions.
B. They are computationally expensive because they train the model multiple times.
C. They always result in lower accuracy.
D. They cannot handle categorical data.

7 In Backward Elimination, what is the starting point of the algorithm?

A. An empty set of features.
B. A randomly selected subset of features.
C. The set containing all available features.
D. The single feature with the highest variance.

8 Tree-based feature importance (e.g., in Random Forest) is typically calculated based on:

A. The correlation of the feature with the target.
B. The magnitude of the regression coefficients.
C. The reduction in impurity (Gini or Entropy) contributed by the feature across all trees.
D. The variance of the feature independent of the target.

9 What is the primary difference between Feature Selection and Feature Extraction?

A. Feature Selection creates new variables; Feature Extraction keeps original variables.
B. Feature Selection selects a subset of existing features; Feature Extraction transforms data into a new lower-dimensional space.
C. Feature Selection is unsupervised; Feature Extraction is always supervised.
D. There is no difference; the terms are interchangeable.

10 Which of the following is an example of creating a Polynomial Feature?

A. Calculating the mean of a time series.
B. Creating a feature from existing features and .
C. Scaling a feature to the range [0, 1].
D. Encoding a categorical variable using One-Hot Encoding.

11 What are Aggregation Features typically used for?

A. To reduce the variance of a single image pixel.
B. To summarize information from multiple records related to a single entity (e.g., average transaction amount per user).
C. To split a dataset into training and testing sets.
D. To visualize high-dimensional data in 2D.

12 The Curse of Dimensionality refers to the phenomenon where:

A. Models become too simple as dimensions increase.
B. Data becomes sparse and distance metrics lose meaning as the number of features increases.
C. Computation time decreases as the number of features increases.
D. The correlation between features always decreases in high dimensions.

13 In the context of the Curse of Dimensionality, what happens to the amount of data required to maintain statistical significance as dimensions increase?

A. It decreases linearly.
B. It remains constant.
C. It increases exponentially.
D. It increases logarithmically.

14 Principal Component Analysis (PCA) is best described as a technique for:

A. Non-linear dimensionality reduction.
B. Supervised feature selection.
C. Linear unsupervised dimensionality reduction.
D. Clustering categorical data.

15 In PCA, the first Principal Component (PC1) is the direction that:

A. Maximizes the correlation with the target variable.
B. Minimizes the variance of the projected data.
C. Maximizes the variance of the projected data.
D. Is orthogonal to the direction of maximum variance.

16 What is the relationship between the first Principal Component (PC1) and the second Principal Component (PC2)?

A. They are parallel to each other.
B. They are orthogonal (perpendicular) to each other.
C. They are highly correlated.
D. PC2 always captures more variance than PC1.

17 Before applying PCA, it is crucial to perform which preprocessing step?

A. One-Hot Encoding
B. Feature Scaling (Standardization)
C. Target Encoding
D. Upsampling

18 Mathematically, Principal Components are the ____ of the covariance matrix of the data.

A. Eigenvalues
B. Eigenvectors
C. Inverse
D. Determinant

19 What does the Explained Variance Ratio of a Principal Component indicate?

A. The ratio of training error to testing error.
B. The percentage of the dataset's total variance captured by that component.
C. The correlation between that component and the target.
D. The ratio of the number of features to the number of samples.

20 Which plot is commonly used to determine the optimal number of Principal Components to retain?

A. Box plot
B. Scree plot
C. Scatter plot
D. Histogram

21 Linear Discriminant Analysis (LDA) differs from PCA in that LDA is:

A. Unsupervised
B. Supervised
C. Based on kernel methods
D. Only applicable to regression

22 What is the main objective function of Linear Discriminant Analysis (LDA)?

A. Maximize within-class variance and minimize between-class variance.
B. Maximize total variance regardless of class.
C. Maximize between-class variance and minimize within-class variance.
D. Minimize the reconstruction error.

23 If you have a classification problem with classes, what is the maximum number of linear discriminants (components) LDA can produce?

A.
B.
C. The number of features
D.

24 Which of the following is a key assumption of Linear Discriminant Analysis?

A. Data is distributed uniformly.
B. Classes have identical covariance matrices and are normally distributed.
C. Features are strictly categorical.
D. The relationship between features and target is non-linear.

25 In the context of dimensionality reduction, what does Intrinsic Dimensionality mean?

A. The total number of features in the raw dataset.
B. The minimum number of parameters needed to account for the observed properties of the data.
C. The number of rows in the dataset.
D. The maximum possible dimensions a computer can handle.

26 Which feature selection technique is generally known as Recursive Feature Elimination (RFE)?

A. A filter method based on correlation.
B. A wrapper method that recursively removes the weakest feature.
C. An embedded method using L1 regularization.
D. A dimensionality reduction technique similar to SVD.

27 When creating Interaction Features (e.g., ), what is the primary goal?

A. To remove outliers.
B. To capture the combined effect of two variables that affects the target differently than the sum of their individual effects.
C. To reduce the number of features.
D. To linearly separate the classes.

28 Which formula represents the Fisher's criterion used in LDA for a two-class problem?

A.
B.
C.
D.

29 Why might one choose Lasso Regression (L1 regularization) as an Embedded Feature Selection method?

A. It shrinks coefficients to exactly zero, effectively performing feature selection.
B. It shrinks coefficients towards zero but never reaches exactly zero.
C. It maximizes the correlation between features.
D. It creates orthogonal components.

30 In a dataset with high dimensionality, points tend to concentrate:

A. At the center (origin) of the space.
B. In the corners or shells of the hypercube.
C. Uniformly throughout the space.
D. Along a single axis.

31 When calculating the explained variance ratio in PCA, what does the eigenvalue represent?

A. The mean of the -th feature.
B. The variance of the data along the -th principal component.
C. The covariance between the -th and -th feature.
D. The reconstruction error of the -th component.

32 If the first two principal components explain 95% of the variance, what does this imply?

A. The data is essentially 2-dimensional despite having more features.
B. The other 5% of variance contains the most important class information.
C. The model is overfitting.
D. You should discard the first two components.

33 What is a limitation of using Pearson Correlation for feature removal?

A. It is computationally very expensive.
B. It only detects linear relationships.
C. It cannot handle negative numbers.
D. It requires the target variable to be calculated.

34 In the context of creating features from datetime data, which of the following is NOT a typical extracted feature?

A. Day of the week
B. Hour of the day
C. Time elapsed since a specific event
D. The raw Unix timestamp treated as a categorical category

35 When performing PCA, if the data matrix is , the covariance matrix is computed as (assuming centered data):

A.
B.
C.
D.

36 Which of the following is a disadvantage of PCA?

A. It is sensitive to the scale of features.
B. The resulting principal components are often hard to interpret physically.
C. It assumes linear relationships.
D. All of the above.

37 In Linear Discriminant Analysis, the 'Within-Class Scatter Matrix' represents:

A. How far the class means are from the global mean.
B. The scatter (variance) of samples around their respective class means.
C. The correlation between features.
D. The noise in the dataset.

38 Which technique would be most appropriate if you want to visualize a dataset with 50 features and 3 classes in a 2D plot while keeping the classes as distinct as possible?

A. Variance Thresholding
B. Principal Component Analysis (PCA)
C. Linear Discriminant Analysis (LDA)
D. Forward Selection

39 What is the computational complexity of calculating the Covariance Matrix for a dataset with samples and features?

A.
B.
C.
D.

40 When creating Binning/Discretization features (e.g., turning Age into Age Groups), what is a potential benefit?

A. It increases the precision of the data.
B. It handles non-linear relationships using linear models.
C. It always increases the variance.
D. It eliminates the need for a target variable.

41 Forward selection is an example of a ____ algorithm.

A. Greedy
B. Dynamic Programming
C. Divide and Conquer
D. Backtracking

42 Why is the Curse of Dimensionality particularly problematic for k-Nearest Neighbors (k-NN)?

A. k-NN requires a training phase.
B. In high dimensions, all points become approximately equidistant from each other.
C. k-NN only works with binary features.
D. k-NN cannot handle negative values.

43 In PCA, what is the geometric interpretation of the eigenvalues being zero?

A. The data has no noise.
B. The data lies perfectly on a lower-dimensional subspace (hyperplane).
C. The features are completely uncorrelated.
D. The mean of the data is zero.

44 Which of the following creates a new feature by calculating the ratio of two existing features (e.g., TotalPrice / Quantity)?

A. Polynomial Expansion
B. Domain-specific feature construction
C. One-Hot Encoding
D. Normalization

45 If you perform PCA on a dataset where all features are completely uncorrelated and have equal variance, what will the Principal Components look like?

A. They will be the original axes (features) themselves.
B. They will be rotated by 45 degrees.
C. PCA will fail to find any components.
D. The eigenvalues will be negative.

46 What is a 'Constant Feature'?

A. A feature that increases constantly over time.
B. A feature that has the same value for all observations.
C. A feature that is constantly missing.
D. A feature with a variance of 1.

47 In the context of LDA, what is a Singular matrix problem?

A. When the scatter matrix has a determinant of zero and cannot be inverted.
B. When the matrix has only one class.
C. When the matrix is square.
D. When the eigenvalues are all 1.

48 What does a correlation of 0 between two features imply?

A. They are statistically independent.
B. There is no linear relationship between them.
C. One is a constant multiple of the other.
D. They are negatively related.

49 Which dimensionality reduction technique transforms variables into a set of linearly uncorrelated variables called principal components?

A. LDA
B. PCA
C. t-SNE
D. Forward Selection

50 When creating lag features for time-series data (e.g., value at ), what issue is introduced for the first few rows of the dataset?

A. Infinite values
B. Missing values (NaN)
C. Zero variance
D. Multicollinearity