Unit1 - Subjective Questions
CSE274 • Practice Questions with Detailed Answers
Explain the four different levels of data measurement (scales of measurement) with examples.
The four levels of data measurement, often referred to as scales of measurement, are:
-
Nominal Scale:
- This is the lowest level of measurement.
- Data is categorized into distinct classes or labels without any inherent order.
- Examples: Gender (Male, Female), Colors (Red, Blue, Green), Zip Codes.
- Statistical operations: Mode, Frequency distribution.
-
Ordinal Scale:
- Data categories have a meaningful order or ranking, but the intervals between the ranks are not necessarily equal or known.
- Examples: Education Level (High School, Bachelor's, Master's), Customer Satisfaction (Unhappy, Neutral, Happy), Movie Ratings (1 to 5 stars).
- Statistical operations: Median, Percentile.
-
Interval Scale:
- Data has an order, and the difference between two values is meaningful and constant.
- It lacks a 'true zero' point (zero does not mean the absence of the attribute).
- Examples: Temperature in Celsius or Fahrenheit (0°C is not 'no temperature'), Calendar years.
- Statistical operations: Mean, Standard Deviation, Correlation.
-
Ratio Scale:
- This is the highest level of measurement.
- It possesses all properties of the interval scale but has a true zero point (0 implies none).
- Ratios between numbers are meaningful (e.g., 20kg is twice as heavy as 10kg).
- Examples: Height, Weight, Age, Income.
- Statistical operations: All statistical operations including Coefficient of Variation.
What is Data Pre-processing? Why is it considered a crucial step in the Machine Learning pipeline?
Data Pre-processing is the process of converting raw data into a clean, organized, and understandable format suitable for machine learning models. Real-world data is often incomplete, inconsistent, and lacks specific behaviors or trends, likely containing errors.
Why it is crucial:
- Improving Data Quality: It handles noise, missing values, and inconsistencies which can otherwise lead to misleading results.
- Algorithm Requirements: Many algorithms (like SVM or K-Means) require numerical input and scaled data to function correctly.
- Better Accuracy: Clean data leads to better training, which improves the model's predictive accuracy.
- Efficiency: Pre-processed data (e.g., after dimensionality reduction) often reduces the computational cost and training time.
- Model Convergence: Techniques like scaling help optimization algorithms (like Gradient Descent) converge faster.
Define Data Leakage. Describe the common causes of data leakage and how to prevent it.
Data Leakage occurs when information from outside the training dataset is used to create the model. This allows the model to 'see' the data it creates predictions for, leading to overly optimistic performance during training but poor performance on real-world data.
Common Causes:
- Target Leakage: When predictors include data that will not be available at the time of prediction (e.g., including 'Approval_Status' variable when predicting 'Loan_Risk').
- Train-Test Contamination: When preprocessing (like scaling or missing value imputation) is applied to the entire dataset before splitting into train and test sets.
Prevention:
- Split First: Always split data into training and testing sets before any preprocessing.
- Pipelines: Use pipeline structures (e.g., Scikit-Learn Pipelines) to ensure transformations are fit only on the training set and applied to the test set.
- Time-Series Validation: For time-dependent data, use time-based splitting (train on past, test on future) rather than random shuffling.
- Feature Review: Carefully analyze features to ensure they are causally prior to the target variable.
Compare and contrast Standardization and Normalization (Min-Max Scaling). Provide the mathematical formula for each.
Both are feature scaling techniques used to bring features to a similar scale.
1. Standardization (Z-Score Normalization):
- Definition: Rescales data to have a mean () of 0 and a standard deviation () of 1.
- Formula:
- When to use: Preferred when algorithms assume a Gaussian distribution (e.g., Logistic Regression, Linear Discriminant Analysis) or when outliers are present (as it is less sensitive to outliers than Min-Max).
- Range: Unbounded (though mostly between -3 and 3).
2. Normalization (Min-Max Scaling):
- Definition: Rescales the data to a fixed range, typically [0, 1].
- Formula:
- When to use: Useful when the data does not follow a Gaussian distribution or for algorithms that compute distances (e.g., K-Nearest Neighbors, Neural Networks). It preserves the shape of the original distribution.
- Range: Strictly [0, 1].
Key Difference: Standardization changes the distribution to a normal shape (if the original was roughly normal) and is centered at 0. Normalization squashes the data into a strict range.
Explain the concept of Missing Values in a dataset. Discuss the three mechanisms of missing data (MCAR, MAR, MNAR).
Missing values occur when no data value is stored for the variable in an observation. Understanding why data is missing is crucial for choosing the right handling method.
Mechanisms of Missing Data:
-
Missing Completely at Random (MCAR):
- The probability of being missing is the same for all observations.
- There is no relationship between the missing data and any other values (observed or missing).
- Example: A weighing scale runs out of battery randomly.
- Handling: Deletion is often acceptable here without introducing bias.
-
Missing at Random (MAR):
- The probability of being missing is related to the observed data but not the missing data itself.
- Example: Men might be less likely to report their 'depression score', but this depends on 'Gender' (which is observed), not on how depressed they actually are.
- Handling: Imputation based on other features is preferred.
-
Missing Not at Random (MNAR):
- The missingness depends on the value of the missing data itself.
- Example: People with very high incomes may refuse to disclose their income.
- Handling: Requires modeling the missingness mechanism; simple imputation may bias the model.
Describe different strategies to handle missing values, specifically focusing on Deletion and Imputation methods.
1. Deletion Strategies:
- Listwise Deletion (Complete Case Analysis): Removing entire rows that contain any missing value. Simple but causes data loss and potential bias if data isn't MCAR.
- Pairwise Deletion: Uses all available data for specific analyses (e.g., correlation) even if rows have missing values elsewhere.
- Dropping Columns: Removing an entire feature if it has a very high percentage (e.g., >60%) of missing values.
2. Imputation Strategies:
- Simple Imputation: Replacing missing values with a statistic.
- Mean: For continuous data (sensitive to outliers).
- Median: For continuous data (robust to outliers).
- Mode: For categorical data.
- K-Nearest Neighbors (KNN) Imputation: Finds 'k' samples in the dataset that are most similar to the sample with missing data and imputes the value based on the average/mode of neighbors.
- Predictive Imputation: Treating the missing column as a target variable and training a regression or classification model on known data to predict the missing values.
What are Outliers? Explain how the IQR (Interquartile Range) method is used to detect and handle outliers.
Outliers are data points that differ significantly from other observations. They may be due to variability in the measurement or experimental errors.
IQR Method for Detection:
The Interquartile Range (IQR) measures statistical dispersion.
- Calculate the first quartile (, 25th percentile) and third quartile (, 75th percentile).
- Calculate .
- Define bounds:
- Lower Bound:
- Upper Bound:
- Any data point outside these bounds is considered an outlier.
Handling:
- Trimming: Remove the data points outside the bounds.
- Capping (Winsorization): Replace values below the lower bound with the lower bound value, and values above the upper bound with the upper bound value.
Differentiate between Label Encoding and One-Hot Encoding. When should each be used?
1. Label Encoding:
- Concept: Converts each category into a specific integer (e.g., Low=0, Medium=1, High=2).
- Pros: Does not add new columns; computationally efficient.
- Cons: Introduces an artificial order (). If the data is nominal (e.g., Red, Blue, Green), the model might misinterpret 'Green' as being 'greater' than 'Red'.
- Usage: Best for Ordinal categorical data (where order matters) or tree-based models which can handle numeric categories well.
2. One-Hot Encoding:
- Concept: Creates a new binary column (dummy variable) for each category. (e.g., Color_Red, Color_Blue).
- Pros: Does not assume any order; fair representation of nominal data.
- Cons: Increases dimensionality significantly (Curse of Dimensionality) if cardinality is high.
- Usage: Best for Nominal categorical data with low-to-medium cardinality.
What is the Class Imbalance problem? List three metrics that should be used instead of 'Accuracy' when dealing with imbalanced datasets.
Class Imbalance refers to a scenario in a classification problem where the number of observations in each class is not equally distributed. For example, in fraud detection, 99% of transactions might be legitimate and only 1% fraudulent. A model predicting 'Legitimate' for every case would have 99% accuracy but is useless.
Metrics to use instead of Accuracy:
- Precision: Out of all instances predicted as positive, how many are actually positive?
- Recall (Sensitivity): Out of all actual positive instances, how many did the model correctly identify?
- F1-Score: The harmonic mean of Precision and Recall. It provides a balance between the two.
- AUC-ROC: Area Under the Receiver Operating Characteristic Curve.
Explain SMOTE (Synthetic Minority Over-sampling Technique). How does it differ from simple random oversampling?
SMOTE is an advanced oversampling technique used to address class imbalance.
How it works:
Instead of duplicating existing minority class samples (which causes overfitting), SMOTE synthesizes new examples.
- It selects a minority sample .
- It finds the nearest neighbors of belonging to the same class.
- It selects a random neighbor .
- It creates a new point anywhere on the line segment connecting and .
- Mathematically:
Difference from Random Oversampling:
- Random Oversampling: Simply duplicates existing minority samples. This creates exact copies, leading to a high risk of overfitting because the model memorizes specific points.
- SMOTE: Generates plausible new data points that lie in the feature space of the minority class, helping the model generalize better.
What is Discretization (Binning)? Explain Equal-Width Binning and Equal-Frequency Binning.
Discretization (or Binning) is the process of converting continuous variables into discrete categorical 'bins' or intervals. It is used to handle outliers, improve value spread, or fit models that require categorical input.
-
Equal-Width Binning:
- Divides the range of data into intervals of equal size.
- Formula:
- Pros: Preserves the data distribution shape.
- Cons: Sensitive to outliers (outliers can squash the rest of the data into a single bin).
-
Equal-Frequency (Quantile) Binning:
- Divides the data into groups where each group contains approximately the same number of observations.
- Pros: Handles outliers well; creates a uniform distribution.
- Cons: Can disrupt the natural relationship/distance between values.
Why do distance-based algorithms like K-Means and KNN require feature scaling? Explain with an example.
Distance-based algorithms calculate the distance (typically Euclidean) between data points to determine similarity. If features have different scales, the feature with the larger magnitude will dominate the distance calculation.
Example:
Consider a dataset with two features:
- Age: Range [20, 60]
- Salary: Range [30,000, 100,000]
If we calculate the Euclidean distance between two persons:
A difference of 10 years in Age contributes to the squared sum.
A difference of 1000 in Salary contributes .
The Salary feature completely overpowers the Age feature solely because of its magnitude, not its importance. Scaling brings both features to a comparable range (e.g., 0 to 1), allowing the algorithm to weight them equally.
What is Target Encoding (Mean Encoding)? What is the risk associated with it, and how can it be mitigated?
Target Encoding involves replacing a categorical value with the mean of the target variable for that category.
Example: If predicting 'Default' (0 or 1) and the category is 'City=NY', replace 'NY' with the average default rate of all people in NY.
Risk: Overfitting / Data Leakage.
If a category has very few samples (e.g., only 1 person in 'City=SmallTown' who defaulted), the encoding becomes exactly 1. The model memorizes this perfectly, leading to poor generalization on test data.
Mitigation:
- Smoothing: Weigh the category mean with the global mean. , where depends on the number of samples.
- Cross-Validation: Compute encodings inside cross-validation folds (calculate means on the training fold, map to validation fold) to prevent leakage.
Describe the Z-Score method for outlier detection. What are its limitations?
Z-Score Method:
This method assumes the data follows a Gaussian (Normal) distribution. The Z-score tells us how many standard deviations a data point is away from the mean.
Formula:
Detection:
Typically, if the absolute Z-score is greater than a threshold (usually 3), the point is considered an outlier. This covers 99.7% of data in a normal distribution.
Limitations:
- Mean and SD sensitivity: The mean and standard deviation themselves are sensitive to outliers. A massive outlier can shift the mean and inflate the SD, masking other outliers.
- Assumption of Normality: It only works reliably if the data is normally distributed. For skewed distributions, it is inaccurate.
Explain the Curse of Dimensionality in the context of data pre-processing. How does handling categorical data relate to this?
The Curse of Dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces. As the number of features (dimensions) increases, the volume of the space increases so fast that the available data becomes sparse.
Consequences:
- Distance becomes meaningless (all points are roughly equidistant).
- Risk of overfitting increases dramatically.
- Computational complexity explodes.
Relation to Categorical Data:
Techniques like One-Hot Encoding increase dimensionality. If a categorical feature has high cardinality (e.g., 'Zip Code' with 10,000 unique values), One-Hot Encoding adds 10,000 new sparse features. This triggers the curse of dimensionality. Therefore, pre-processing steps like dimensionality reduction (PCA) or using alternative encodings (Target Encoding, Embedding) are necessary.
Discuss Transformation techniques used to handle skewed data, specifically Log Transformation and Box-Cox Transformation.
Many machine learning algorithms (like Linear Regression) assume that residuals are normally distributed. If input data is highly skewed, these assumptions fail.
1. Log Transformation:
- Applying or to the data.
- Usage: effective for right-skewed data (e.g., Income distribution). Compresses large values and expands small values.
- Constraint: Inputs must be positive.
2. Box-Cox Transformation:
- A parameterized transformation that tries to identify the best power exponent () to transform data into a normal distribution.
- Formula:
- Usage: More flexible than log transform as it finds the optimal .
- Constraint: Data must be strictly positive.
You are given a dataset for Credit Card Fraud detection. The dataset has missing values, outliers, categorical variables, and a severe class imbalance. Design a step-by-step pre-processing pipeline to handle this.
A robust pipeline would follow this sequence:
- Train-Test Split: Split data first (e.g., 80/20) to prevent leakage, ensuring Stratified Splitting to maintain the ratio of fraud cases in both sets.
- Handling Missing Values:
- Check mechanisms. If numeric columns are skewed, use Median Imputation. If categorical, use Mode or create a 'Missing' category.
- Handling Outliers:
- Apply Winsorization or Capping (e.g., 99th percentile) on continuous variables like 'Transaction Amount'. Avoid removing rows due to the rarity of fraud data.
- Feature Encoding:
- Convert categorical data. Use One-Hot Encoding for low cardinality features (e.g., 'Card Type'). Use Frequency or Target Encoding for high cardinality features (e.g., 'Merchant ID').
- Scaling:
- Apply StandardScaler or RobustScaler (if outliers were not fully capped) to numerical features.
- Handling Class Imbalance (Applied on Training Set ONLY):
- Apply SMOTE or ADASYN to oversample the minority class (Fraud) in the training data.
- Alternatively, use Undersampling on the majority class if the dataset is massive.
What is the difference between Structured, Semi-Structured, and Unstructured data? Give examples for each.
-
Structured Data:
- Highly organized and formatted.
- Fits neatly into relational databases (rows and columns).
- Searchable via SQL.
- Examples: Excel spreadsheets, SQL databases, CSV files (Customer ID, Name, Age).
-
Semi-Structured Data:
- Does not reside in a relational database but has some organizational properties.
- Uses tags or markers to separate semantic elements and enforce hierarchies.
- Examples: JSON files, XML files, HTML code, NoSQL databases (MongoDB).
-
Unstructured Data:
- Lacks any specific form or structure.
- Cannot be stored in standard relational databases without transformation.
- Makes up the majority (~80%) of world data.
- Examples: Text documents, PDF files, Images, Audio, Video files.
Explain the concept of Robust Scaling. How is it different from Standard Scaling and when is it preferred?
Robust Scaling is a scaling technique that is robust to outliers.
Formula:
Where is the Median, and is the Interquartile Range (IQR).
Difference from Standard Scaling:
- Standard Scaler uses the Mean and Variance, which are highly influenced by outliers. One large outlier can squash the transformed values of the inliers.
- Robust Scaler uses the Median and IQR, which are resistant to outliers.
When Preferred:
It is preferred when the dataset contains significant outliers that you do not wish to remove or cap, but you still need to scale the data for an algorithm (like SVM or Neural Networks).
What is Undersampling? Discuss the risks involved and techniques like Tomek Links.
Undersampling reduces the size of the majority class to match the minority class to balance the dataset.
Risks:
- Loss of Information: You discard potentially valuable data from the majority class.
- Biased Sample: The remaining sample might not be representative of the actual population density of the majority class.
Techniques:
- Random Undersampling: Randomly removing samples.
- Tomek Links:
- A Tomek link exists if two samples from different classes are each other's nearest neighbors.
- They represent noise or borderline cases.
- Method: Remove the majority class sample (or both) from the Tomek link pair. This cleans the decision boundary between classes rather than just blindly balancing counts.