Unit2 - Subjective Questions
CSE274 • Practice Questions with Detailed Answers
What is the Variance Threshold method in feature selection, and why is it used?
Variance Threshold is a simple, unsupervised baseline approach to feature selection. It works on the principle that features with low variance likely contain little information needed for a model to distinguish between instances.
Key Characteristics:
- Method: It removes all features whose variance doesn't meet some threshold (). By default, it removes all zero-variance features (features that have the same value in all samples).
- Formula: For Boolean features (Bernoulli random variables), variance is calculated as .
- Usage: It is usually the first step in a feature selection pipeline to eliminate constant or quasi-constant features before applying more complex algorithms.
- limitation: It does not consider the relationship between features and the target variable.
Explain the concept of Correlation-based Feature Selection. How does handling multicollinearity improve model performance?
Correlation-based Feature Selection involves evaluating the relationship between features (independents) and the target (dependent), as well as the relationship between features themselves.
Process:
- Feature-Target Correlation: Select features that have a high correlation with the target variable, as they are strong predictors.
- Feature-Feature Correlation (Multicollinearity): Identify pairs of features that are highly correlated with each other (e.g., Pearson coefficient > 0.8).
Handling Multicollinearity:
- If two features convey almost the same information, one should be removed.
- Benefits: Removing collinear features reduces the complexity of the model, prevents overfitting, and ensures linear models (like Linear Regression) remain stable and interpretable. It reduces the redundancy in the dataset.
Distinguish between Forward Selection and Backward Elimination methods.
Both are wrapper methods for feature selection, but they traverse the feature space in opposite directions.
Forward Selection:
- Start: Starts with an empty model (no features).
- Process: Iteratively adds the feature that best improves the model performance.
- Stop: Stops when adding a new feature does not improve performance significantly.
- Pros: Computationally cheaper if the optimal subset is small.
Backward Elimination:
- Start: Starts with a model containing all available features.
- Process: Iteratively removes the least significant feature (the one whose removal hurts performance the least or improves it).
- Stop: Stops when removing a feature significantly degrades performance.
- Pros: Can capture interacting features better than forward selection, but is computationally expensive for high-dimensional data.
How is Tree-based Feature Importance calculated? Give an example using Random Forest.
Tree-based algorithms (like Decision Trees, Random Forests, and Gradient Boosting) provide an intrinsic method to rank features based on how well they improve the purity of the node.
Mechanism:
- Gini Importance (or Mean Decrease Impurity): Every time a split of a node is made on variable , the impurity criterion (Gini or Entropy) for the two descendent nodes is less than the parent node. Adding up the weighted impurity decreases for all nodes where variable is used, averaged over all trees in the forest, gives a fast measure of feature importance.
- Permutation Importance: Alternatively, values of a feature are randomly shuffled. If the feature is important, the model's error will increase significantly after shuffling.
Example: In a Random Forest attempting to predict housing prices, the feature 'Square Footage' might appear near the root of many trees, leading to large impurity decreases, thus receiving a high importance score.
Define Feature Extraction. How does it differ from Feature Selection?
Feature Extraction and Feature Selection are both dimensionality reduction techniques, but they approach the problem differently.
Feature Selection:
- Definition: Selecting a subset of the original features without changing them.
- Example: Choosing 'Age' and 'Salary' from a dataset of 10 columns and ignoring the rest.
- Advantage: Preserves the physical meaning of original variables.
Feature Extraction:
- Definition: Transforming the data from a high-dimensional space to a lower-dimensional space. The new features are linear or non-linear combinations of the original features.
- Example: Principal Component Analysis (PCA) creating 'Principal Component 1' which is a mix of 'Age', 'Salary', and 'Debt'.
- Advantage: Often compresses information better than selection, but the new features lose interpretability.
What are Aggregation Features? Provide examples of how they are created from transactional data.
Aggregation Features are new features created by summarizing historical or granular data, often used when transforming one-to-many relationships (like a user having multiple transactions) into a single row per entity.
Creation Process:
Data is grouped by a unique identifier (e.g., CustomerID) and statistical operations are applied to other columns.
Examples:
- Count: Total number of transactions per user.
- Sum/Mean: Total amount spent or Average transaction value.
- Min/Max: Minimum or Maximum purchase value.
- Variance: Variability in transaction amounts.
- Time-based: Days since the last transaction.
These features capture user behavior patterns that raw transactional rows cannot represent directly in standard ML models.
Explain the Curse of Dimensionality and its impact on Machine Learning models.
The Curse of Dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of features).
Key Impacts:
- Data Sparsity: As dimensions increase, the volume of the space increases exponentially. The available data becomes sparse, making it difficult to find reliable patterns.
- Distance Concentration: In high dimensions, the distance between the nearest and farthest data points becomes negligible. Distance-based algorithms (like KNN or K-Means) fail because "everyone is far from everyone else".
- Overfitting: With more features than observations, models can easily learn noise rather than the signal, leading to poor generalization.
- Computational Cost: Training time and storage requirements increase significantly.
Describe the step-by-step mathematical algorithm for Principal Component Analysis (PCA).
PCA is a linear transformation technique that yields the axes of maximum variance.
Steps:
- Standardization: Scale the data matrix (dimensions ) so each feature has a mean of 0 and variance of 1.
- Covariance Matrix Computation: Calculate the covariance matrix to understand how variables vary together.
- Eigen Decomposition: Compute the eigenvectors () and eigenvalues () of the covariance matrix.
- Eigenvectors represent the directions (Principal Components).
- Eigenvalues represent the magnitude of variance in those directions.
- Sort and Select: Sort eigenvalues in descending order. Choose the top eigenvectors corresponding to the largest eigenvalues to form a projection matrix .
- Projection: Transform the original samples onto the new subspace.
What is the Explained Variance Ratio in PCA, and how is it used to select the number of components?
Explained Variance Ratio indicates the proportion of the dataset's variance that lies along the axis of each principal component.
Calculation:
If is the eigenvalue for the -th component, the explained variance ratio is:
Selection of Components ():
- Cumulative Variance: Plot the cumulative sum of explained variance ratios.
- Threshold: Select such that the cumulative variance reaches a desired threshold (e.g., 95% or 99%).
- Scree Plot: Plot eigenvalues against component numbers. Look for the "elbow point" where the drop in variance levels off, indicating that subsequent components add little information.
Compare and contrast Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA).
Both are linear transformation techniques used for dimensionality reduction, but they have different objectives.
| Feature | PCA (Principal Component Analysis) | LDA (Linear Discriminant Analysis) |
|---|---|---|
| Type | Unsupervised (ignores class labels). | Supervised (uses class labels). |
| Goal | Maximize the variance of the data. | Maximize the separation between multiple classes. |
| Focus | Preserves the global structure of data. | Preserves the discriminatory information. |
| Axes | Finds directions of maximum spread. | Finds directions that maximize the ratio of between-class variance to within-class variance. |
| Usage | General dimensionality reduction, noise reduction. | Pre-processing for classification tasks. |
Summary: PCA is about spread; LDA is about separation.
Derive the mathematical criterion (Fisher's Criterion) used in Linear Discriminant Analysis (LDA).
LDA seeks a projection vector that maximizes class separability.
1. Scatter Matrices:
- Within-class Scatter (): Measures the spread of data within the same class.
- Between-class Scatter (): Measures the separation between class means () and the global mean ().
2. Fisher's Criterion:
We want to maximize the distance between means (projected ) and minimize the variance within classes (projected ). The objective function is:
3. Solution:
To maximize , we solve the generalized eigenvalue problem:
The eigenvectors corresponding to the largest eigenvalues of form the new axes.
Discuss the strategies for creating new features from existing continuous and categorical variables.
Creating new features (Feature Construction) can significantly boost model power.
Strategies:
- Polynomial Features: Creating interaction terms () or power terms () to capture non-linear relationships in linear models.
- Binning/Discretization: Converting continuous variables (e.g., Age) into bins (e.g., Child, Adult, Senior) to handle outliers and non-linearities.
- Domain-Specific Ratios: Combining features based on logic. E.g., in real estate: .
- Date/Time Decomposition: Extracting Day, Month, Year, DayOfWeek, or Hour from a timestamp.
- One-Hot/Target Encoding: Converting categorical variables into numerical formats suitable for algorithms.
- Log/Box-Cox Transformations: Applying mathematical functions to normalize skewed distributions.
What is the role of the Covariance Matrix in Dimensionality Reduction?
The Covariance Matrix is central to techniques like PCA.
Role:
- Capturing Relationships: It is a square symmetric matrix () that expresses how every variable correlates with every other variable.
- Diagonal elements: Variance of individual features.
- Off-diagonal elements: Covariance between two different features.
- Geometry of Data: The covariance matrix defines the shape and orientation of the data cloud. Positive covariance indicates variables move together; negative indicates they move inversely.
- Basis for Transformation: In PCA, diagonalizing the covariance matrix (finding eigenvectors) rotates the axes to align with the directions where the data varies the most (decorrelating the features).
Explain the concept of Wrapper Methods in feature selection. What are their advantages and disadvantages?
Wrapper Methods evaluate a subset of features by actually training a machine learning model and measuring its performance (e.g., accuracy, RMSE).
Mechanism:
It treats feature selection as a search problem. Examples include Forward Selection, Backward Elimination, and Recursive Feature Elimination (RFE).
Advantages:
- Interaction: Considers the interaction between features.
- Accuracy: Usually results in the best performing feature subset for that specific model because it optimizes the specific metric.
Disadvantages:
- Computationally Expensive: Requires training the model multiple times (exponential complexity in worst case).
- Overfitting: High risk of overfitting to the training data, especially if the dataset is small.
Why is Dimensionality Reduction considered necessary before applying algorithms like K-Nearest Neighbors (KNN)?
Dimensionality reduction is critical for KNN due to the distance-based nature of the algorithm.
- Distance Calculation: KNN relies on calculating Euclidean (or Manhattan) distances between points.
- Curse of Dimensionality: As dimensions increase, points spread out. The distance between the nearest neighbor and the farthest neighbor converges, making the concept of "nearest" meaningless.
- Noise Reduction: High-dimensional data often contains irrelevant features (noise) that distort distance calculations.
- Computational Efficiency: Calculating distances in is significantly slower than in .
Reducing dimensions compacts the signal and makes distance metrics more robust.
What are the limitations of Principal Component Analysis (PCA)?
While powerful, PCA has specific limitations:
- Linearity Assumption: PCA assumes that the principal components are linear combinations of original features. It fails to unfold non-linear manifolds (e.g., the Swiss Roll dataset). Techniques like t-SNE or Kernel PCA are needed for non-linear data.
- Variance Reliance: It assumes that high variance implies high information (signal) and low variance implies noise. This isn't always true; low variance features might be critical for class separation.
- Orthogonality: PCA forces the new features to be orthogonal (perpendicular). The underlying natural factors might not be strictly orthogonal.
- Interpretability: The resulting Principal Components are mathematical abstractions (e.g., ) which are hard to interpret in domain terms compared to original features.
Explain Embedded Methods for feature selection with an example.
Embedded Methods perform feature selection during the model training process itself. They combine the qualities of Filter and Wrapper methods.
Characteristics:
- The feature selection is built into the algorithm's objective function.
- They are more efficient than wrappers and more accurate than filters.
Example: LASSO Regression (L1 Regularization)
- LASSO adds a penalty term proportional to the absolute value of coefficients: .
- Selection Mechanism: This penalty forces the coefficients of less important features to shrink to exactly zero.
- Features with non-zero coefficients after training are the "selected" features.
- Other examples include Decision Trees and Random Forests (which select features at node splits).
How does Recursive Feature Elimination (RFE) work?
Recursive Feature Elimination (RFE) is a popular greedy optimization algorithm (a type of backward elimination wrapper).
Algorithm Steps:
- Train: Train the model on the full set of features.
- Rank: Compute feature importance (e.g., coefficients in linear regression or feature importance in trees).
- Prune: Identify the least important feature(s) and remove them from the set.
- Repeat: Re-train the model on the remaining features.
- Stop: Continue until the desired number of features remains.
RFE is effective because it repeatedly re-evaluates feature strength in the context of the current subset, capturing dependencies.
What is the distinction between Univariate and Multivariate Feature Selection?
Univariate Feature Selection:
- Approach: Evaluates each feature individually against the target variable.
- Methods: Chi-Square test, ANOVA F-value, Mutual Information.
- Pros: Fast, scalable.
- Cons: Ignores dependencies between features. A feature might be useless on its own but powerful when combined with another.
Multivariate Feature Selection:
- Approach: Evaluates subsets of features together.
- Methods: RFE, Forward Selection, LASSO.
- Pros: Captures feature interactions and redundancies.
- Cons: Computationally intensive and prone to overfitting on small datasets.
In the context of LDA, explain the terms Within-class Scatter Matrix () and Between-class Scatter Matrix ().
LDA relies on projecting data to a space where classes are well-separated. This is mathematically defined by two matrices:
1. Within-class Scatter Matrix ():
- Definition: Represents the scatter (variance) of samples around their respective class means.
- Goal: We want to minimize this. We want samples of Class A to be tightly clustered together.
- Formula: Sum of covariance matrices of individual classes.
2. Between-class Scatter Matrix ():
- Definition: Represents the scatter of the class means around the overall global mean of the data.
- Goal: We want to maximize this. We want the center of Class A to be as far as possible from the center of Class B.
The optimal projection in LDA maximizes the ratio .