Unit 1 - Notes

CSE274 7 min read

Unit 1: Data Pre-processing

1. Introduction to Data and Pre-processing

Data Pre-processing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and lacking in certain behaviors or trends, and is likely to contain many errors. Pre-processing is a proven method of resolving such issues.

The "Garbage In, Garbage Out" (GIGO) Principle:
If the input data is of poor quality, the output of the machine learning model will be of poor quality, regardless of how sophisticated the algorithm is.

Key Steps in Pre-processing Pipeline:

Data Cleaning: Handling missing data, noisy data, etc.
Data Integration: Combining data from multiple sources.
Data Reduction: Dimensionality reduction, numerosity reduction.
Data Transformation: Normalization, discretization, hierarchy generation.

2. Types of Data

Understanding the data type is crucial for selecting the correct visualization techniques and machine learning algorithms.

A. Qualitative (Categorical) Data

Data that describes characteristics or qualities. It cannot be counted or measured in the traditional sense.

Nominal Data:
- Categories without any intrinsic ordering.
- Examples: Gender (Male/Female), Color (Red/Blue/Green), Zip Codes.
- Statistical limit: Mode.
Ordinal Data:
- Categories with a clear ordering or ranking.
- Examples: Education Level (High School < Bachelor's < Master's), Customer Satisfaction (Low < Medium < High).
- Statistical limit: Median, Percentiles.

B. Quantitative (Numerical) Data

Data that deals with numbers and things you can measure objectively.

Interval Data:
- Numeric scales where we know the order and the exact difference between values.
- Key characteristic: No true zero point (0 does not mean "nothing").
- Examples: Temperature in Celsius (0°C is not "no temperature"), pH scale.
- Operations: Addition/Subtraction allowed; Multiplication/Division not meaningful.
Ratio Data:
- Numeric scales with a clear definition of zero.
- Key characteristic: True zero point exists.
- Examples: Height, Weight, Salary, Age.
- Operations: All arithmetic operations allowed (e.g., A is twice as heavy as B).

C. Structured vs. Unstructured

Structured: Highly organized (SQL tables, CSVs).
Unstructured: No pre-defined model (Text, Images, Audio, Video).
Semi-Structured: JSON, XML (contains tags/markers).

A hierarchical tree diagram classifying "Types of Data". The root node is "Data". It branches into t... — AI-generated image — may contain inaccuracies

3. The Concept of Data Leakage

Data Leakage occurs when information from outside the training dataset is used to create the model. This allows the model to "see" the unexpected data or the test data during training, leading to overly optimistic performance scores that drop significantly on real-world data.

Common Causes:

Leaking Test Data into Training Data: Performing pre-processing (like imputation or scaling) on the entire dataset before splitting into train/test sets.
Leaking Future Information: Including features that would not be available at the time of prediction (e.g., using "Time_to_Churn" to predict "Will_Churn").

Prevention:

Split First, Process Later: Always split data into Train/Test sets before any transformation.
Pipelines: Use sklearn.pipeline.Pipeline to ensure transformations fit only on training data and transform the test data using those learned parameters.

A comparison diagram split into two vertical panels titled "Incorrect Approach (Leakage)" and "Corre... — AI-generated image — may contain inaccuracies

4. Handling Missing Values

Missing data is marked as NaN (Not a Number), null, or specific placeholders like -999.

Mechanisms of Missingness:

MCAR (Missing Completely at Random): Probability of missingness is unrelated to any data.
MAR (Missing at Random): Missingness is related to observed data (e.g., men might be less likely to report depression than women, but within the "men" group, it is random).
MNAR (Missing Not at Random): Missingness depends on the missing value itself (e.g., rich people not disclosing income).

Handling Techniques:

1. Deletion

Listwise Deletion: Drop entire rows with nulls. (Risk: Loss of data).
Pairwise Deletion: Only analyze cases with available data for specific variables.

2. Imputation (Simple)

Mean/Median: Good for numerical data. Median is robust to outliers.
Mode: Used for categorical data.
Constant: Fill with 0 or "Unknown".

3. Imputation (Advanced)

KNN Imputation: Find $K$ nearest neighbors and use their average to fill the gap.
Multivariate Imputation by Chained Equations (MICE): Models each feature with missing values as a function of other features.

PYTHON

# Python Example using Scikit-Learn
from sklearn.impute import SimpleImputer
import numpy as np

# Mean Imputation
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)

5. Outlier Handling

Outliers are data points that differ significantly from other observations. They can skew statistical measures and ruin the performance of distance-based algorithms (like KNN or SVM) and linear models.

Detection Methods:

Z-Score: measures how many standard deviations a point is from the mean.
- Rule: If $|Z| > 3$ , it is an outlier.
- Assumption: Data follows a Gaussian (Normal) distribution.
IQR (Interquartile Range) Method: Robust to non-normal distributions.
- $IQR = Q3 (75th percentile) - Q1 (25th percentile)$
- Lower Bound: $Q1 - 1.5 \times IQR$
- Upper Bound: $Q3 + 1.5 \times IQR$

Treatment Methods:

Trimming: Remove the outliers.
Capping (Winsorizing): Replace outliers with the upper/lower bound values.
Transformation: Log transformation or Box-Cox transformation to reduce the impact of extreme values.

A detailed anatomical diagram of a Box and Whisker Plot (Boxplot) used for outlier detection. The ce... — AI-generated image — may contain inaccuracies

6. Handling Categorical Data

Machine Learning models require numerical input. Categorical text data must be converted.

1. One-Hot Encoding (Nominal Data)

Creates a new binary column for each unique category.

Example: Color (Red, Blue) $\rightarrow$ Is_Red (1,0), Is_Blue (0,1).
Pros: No order implied.
Cons: Curse of dimensionality (if cardinality is high).
Dummy Variable Trap: Multicollinearity introduced if $N$ columns are created for $N$ categories. usually drop one column ( $N-1$ ).

2. Label Encoding (Ordinal Data)

Assigns an integer to each category based on alphabetical order or rank.

Example: Low=0, Medium=1, High=2.
Pros: Preserves order.
Cons: If used on nominal data, the model might learn false relationships (e.g., Blue(2) > Red(1)).

3. Frequency/Count Encoding

Replace category with the count of its occurrences in the train set.

7. Scaling and Normalization

Feature scaling ensures that all features contribute equally to the result. Without scaling, a feature with a range [0, 10000] (e.g., Salary) will dominate a feature with range [0, 1] (e.g., Age/100) in distance calculations.

A. Standardization (Z-Score Normalization)

Rescales data to have a mean ( $\mu$ ) of 0 and standard deviation ( $\sigma$ ) of 1.
$x' = \frac{x - \mu}{\sigma}$

Best for: SVM, Logistic Regression, Neural Networks.
Properties: Preserves outliers (doesn't cap them).

B. Normalization (Min-Max Scaling)

Rescales data to a fixed range, usually [0, 1].
$x' = \frac{x - \min(x)}{\max(x) - \min(x)}$

Best for: Neural Networks, Image Processing, algorithms requiring bounded input.
Properties: Highly sensitive to outliers.

Feature	Standardization	Normalization
Range	Unbounded	[0, 1]
Outliers	Robust	Sensitive
Distribution	Gaussian (Bell Curve)	Non-Gaussian

A visualization comparing "Unscaled Data" vs "Scaled Data" using 2D scatter plots and gradient desce... — AI-generated image — may contain inaccuracies

8. Class Imbalance Handling

Occurs when the target class has an uneven distribution of observations (e.g., Fraud Detection: 99% Normal, 1% Fraud). Models tend to be biased toward the majority class.

Techniques:

1. Resampling

Random Undersampling: Removing examples from the majority class.
- Issue: Loss of potentially valuable information.
Random Oversampling: Duplicating examples from the minority class.
- Issue: Overfitting (model memorizes duplicates).

2. SMOTE (Synthetic Minority Over-sampling Technique)

Instead of duplicating, SMOTE creates synthetic new data points.

Select a minority sample $A$ .
Find its $k$ nearest neighbors (e.g., $B$ ).
Draw a line between $A$ and $B$ .
Generate a new point randomly somewhere on that line.

3. Algorithmic Approaches

Class Weights: Modify the loss function to penalize the model more heavily for misclassifying the minority class (e.g., class_weight='balanced' in sklearn).
Tree-based models: Random Forests and XGBoost are generally more robust to imbalance than linear models.