1

Explain the four different levels of data measurement (scales of measurement) with examples.

2

What is Data Pre-processing? Why is it considered a crucial step in the Machine Learning pipeline?

3

Define Data Leakage. Describe the common causes of data leakage and how to prevent it.

4

Compare and contrast Standardization and Normalization (Min-Max Scaling). Provide the mathematical formula for each.

5

Explain the concept of Missing Values in a dataset. Discuss the three mechanisms of missing data (MCAR, MAR, MNAR).

6

Describe different strategies to handle missing values, specifically focusing on Deletion and Imputation methods.

7

What are Outliers? Explain how the IQR (Interquartile Range) method is used to detect and handle outliers.

8

Differentiate between Label Encoding and One-Hot Encoding. When should each be used?

9

What is the Class Imbalance problem? List three metrics that should be used instead of 'Accuracy' when dealing with imbalanced datasets.

10

Explain SMOTE (Synthetic Minority Over-sampling Technique). How does it differ from simple random oversampling?

11

What is Discretization (Binning)? Explain Equal-Width Binning and Equal-Frequency Binning.

12

Why do distance-based algorithms like K-Means and KNN require feature scaling? Explain with an example.

13

What is Target Encoding (Mean Encoding)? What is the risk associated with it, and how can it be mitigated?

14

Describe the Z-Score method for outlier detection. What are its limitations?

15

Explain the Curse of Dimensionality in the context of data pre-processing. How does handling categorical data relate to this?

16

Discuss Transformation techniques used to handle skewed data, specifically Log Transformation and Box-Cox Transformation.

17

You are given a dataset for Credit Card Fraud detection. The dataset has missing values, outliers, categorical variables, and a severe class imbalance. Design a step-by-step pre-processing pipeline to handle this.

18

What is the difference between Structured, Semi-Structured, and Unstructured data? Give examples for each.

19

Explain the concept of Robust Scaling. How is it different from Standard Scaling and when is it preferred?

20

What is Undersampling? Discuss the risks involved and techniques like Tomek Links.

Unit1 - Subjective Questions