Unit 1 - Notes
INT395
Unit 1: Introduction and Data Preprocessing
1. Overview of Supervised Learning
Definition
Supervised learning is a paradigm of machine learning where the model is trained on a labeled dataset. The algorithm learns a mapping function from input variables (, also called features) to output variables (, also called labels or targets).
- Goal: To approximate the mapping function so well that when you have new input data (), you can predict the output variables () for that data.
- The Teacher Analogy: The process is supervised because we know the correct answers. The algorithm makes predictions on the training data and is corrected by the "teacher" (the loss function) whenever the predictions are wrong.
Key Use Cases
- Email Filtering: Classifying emails as "Spam" or "Not Spam."
- Credit Scoring: Predicting if a customer will default on a loan based on credit history.
- Real Estate: Predicting house prices based on square footage, location, and number of bedrooms.
- Medical Diagnosis: Predicting disease presence (Positive/Negative) based on symptoms and test results.
- Image Recognition: Identifying objects (e.g., cats vs. dogs) within an image.
2. Types of Supervised Learning
Supervised learning problems are grouped based on the nature of the target variable ().
A. Classification
The output variable is a category (discrete value).
- Binary Classification: Two possible classes (e.g., Yes/No, Churn/Retain).
- Multi-class Classification: More than two classes (e.g., Categorizing news articles into Sports, Politics, Tech).
- Algorithms: Logistic Regression, Decision Trees, Support Vector Machines (SVM), k-Nearest Neighbors (KNN), Random Forest.
B. Regression
The output variable is a real value (continuous value).
- Examples: Predicting temperature, stock prices, age, or sales revenue.
- Algorithms: Linear Regression, Polynomial Regression, Support Vector Regression (SVR), Decision Tree Regression.
3. Setting up Python and Scikit-learn Environment
To perform supervised learning, the Python ecosystem relies on a stack of scientific libraries.
Essential Libraries
- NumPy: Fundamental package for scientific computing (arrays, matrices).
- Pandas: Data manipulation and analysis (DataFrames).
- Matplotlib/Seaborn: Data visualization.
- Scikit-learn (sklearn): The core machine learning library containing algorithms and preprocessing tools.
Installation
pip install numpy pandas matplotlib seaborn scikit-learn jupyter
Basic Import Convention
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
# Specific models are imported as needed, e.g.:
# from sklearn.linear_model import LinearRegression
4. Types of Data
Understanding the nature of features is critical for preprocessing.
- Numerical (Quantitative):
- Continuous: Can take any value within a range (e.g., Height, Weight, Price).
- Discrete: Countable integers (e.g., Number of children, Number of purchases).
- Categorical (Qualitative):
- Nominal: Categories with no intrinsic ordering (e.g., Color: Red/Blue/Green, Gender: M/F).
- Ordinal: Categories with a clear order (e.g., T-shirt size: S/M/L/XL, Rating: Low/Medium/High).
- Time-Series: Data points indexed in time order (e.g., Daily stock closing prices).
5. Loading Datasets
Data is usually stored in CSV, Excel, or SQL databases.
Using Pandas
# Loading CSV
df = pd.read_csv('dataset.csv')
# Loading Excel
df = pd.read_excel('dataset.xlsx')
Using Scikit-learn (Toy Datasets)
Useful for practice and prototyping.
from sklearn.datasets import load_iris, load_boston
# Load Iris dataset (Classification)
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
6. Data Exploration (EDA) with Pandas and Visualization
Exploratory Data Analysis (EDA) allows us to understand the data structure, patterns, and anomalies before modeling.
Pandas Inspection Methods
df.head(): View the first 5 rows.df.shape: Dimensions of the dataset (rows, columns).df.info(): Summarizes data types and non-null counts per column.df.describe(): Statistical summary (mean, std, min, max, quartiles) for numerical features.df['column'].value_counts(): Frequency of unique values (for categorical data).
Visualization (Matplotlib & Seaborn)
- Histograms: Check distribution of numerical data (Normal vs. Skewed).
PYTHONsns.histplot(df['age'], kde=True) - Box Plots: Identify outliers and quartiles.
PYTHONsns.boxplot(x=df['salary']) - Scatter Plots: Analyze relationships between two numerical variables.
PYTHONplt.scatter(df['experience'], df['salary']) - Heatmaps: Visualize correlation matrices to find collinearity.
PYTHONsns.heatmap(df.corr(), annot=True, cmap='coolwarm')
7. Common Data Issues and Cleaning
Real-world data is "dirty." Models trained on bad data produce bad predictions (Garbage In, Garbage Out).
Common Issues
- Noise: Random error or variance in a measured variable.
- Inconsistency: Formatting differences (e.g., "NY", "N.Y.", "New York").
- Duplicates: Identical rows that bias the model.
PYTHONdf.drop_duplicates(inplace=True)
Handling Missing Values
Missing data is represented as NaN (Not a Number) or None.
Detection: df.isnull().sum()
Strategies:
- Dropping:
- Remove rows (
df.dropna()) if the dataset is large and missing rows are few. - Remove columns if >50% of the data is missing.
- Remove rows (
- Imputation (Filling):
- Mean: Use for normally distributed numerical data.
- Median: Use for skewed numerical data (robust to outliers).
- Mode: Use for categorical data.
- Model-based: Use KNN or Regression to predict missing values.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
df['age'] = imputer.fit_transform(df[['age']])
Handling Outliers
Outliers are extreme values that deviate significantly from other observations.
- Detection:
- Z-Score: Values beyond standard deviations.
- IQR (Interquartile Range): Values or .
- Treatment:
- Trimming: Remove the outlier rows.
- Capping (Winsorizing): Set outliers to the 5th or 95th percentile value.
- Log Transformation: Compresses the range of high-magnitude values.
8. Handling Class Imbalance
Occurs in classification when one class significantly outnumbers the other (e.g., Fraud Detection: 99% legitimate, 1% fraud). Models become biased toward the majority class.
Techniques
- Resampling:
- Undersampling: Randomly remove samples from the majority class (loss of information).
- Oversampling: Duplicating samples from the minority class.
- SMOTE (Synthetic Minority Over-sampling Technique): Creates synthetic (artificial) data points for the minority class by interpolating between existing samples.
- Class Weights: Modify the algorithm to penalize errors on the minority class more heavily (e.g.,
class_weight='balanced'in Scikit-learn).
9. Feature Scaling
Scaling ensures all features contribute equally to the result. It is crucial for distance-based algorithms (KNN, SVM, K-Means) and Gradient Descent optimization. Tree-based models (Random Forest) generally do not require scaling.
Techniques
-
Standardization (Z-score Normalization):
- Scales features to have Mean () = 0 and Standard Deviation () = 1.
- Preferred when data follows a Gaussian distribution or has outliers.
- Formula:
PYTHONfrom sklearn.preprocessing import StandardScaler scaler = StandardScaler()
-
Normalization (Min-Max Scaling):
- Scales features to a range [0, 1].
- Preferred for neural networks or image processing.
- Formula:
PYTHONfrom sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler()
10. Encoding Categorical Variables
Machine learning models require numerical input. Categorical text data must be converted.
1. Label Encoding
Assigns a unique integer to each category (e.g., Low=0, Medium=1, High=2).
- Use case: Ordinal data (where order matters).
- Library:
sklearn.preprocessing.LabelEncoder
2. One-Hot Encoding
Creates a new binary column for each category.
- Use case: Nominal data (where no order exists, e.g., Cities).
- Issue: Can increase dimensionality significantly.
- Library:
pd.get_dummies()orsklearn.preprocessing.OneHotEncoder. - Dummy Variable Trap: Multicollinearity caused by including all dummy variables. Solution: Drop one column (
drop_first=True).
11. Feature Engineering
The art of creating new features from existing ones to improve model performance.
Examples:
- Binning: Converting continuous age into groups (Child, Teen, Adult, Senior).
- Polynomial Features: Creating interaction terms () to capture non-linear relationships.
- Date/Time Extraction: Extracting 'Day of Week', 'Month', or 'Hour' from a timestamp.
- Domain Specific: Creating a "Debt-to-Income Ratio" from "Total Debt" and "Income" columns.
12. Feature Selection
Selecting the most relevant subset of features to reduce overfitting, improve accuracy, and speed up training.
Methods
- Filter Methods: Use statistical tests to select features independent of the model.
- Correlation Matrix: Remove features highly correlated with each other (redundant).
- Chi-Square Test: For categorical features.
- Wrapper Methods: Evaluate specific subsets of features by training a model.
- RFE (Recursive Feature Elimination): Recursively removes the least important features.
- Embedded Methods: Feature selection occurs during model training.
- Lasso Regression (L1 Regularization): Shrinks coefficients of less important features to zero.
- Tree-based Importance: Random Forests provide a
feature_importances_attribute.
13. Data Splitting
We must separate data to evaluate how the model performs on unseen data.
- Training Set: Used to train the model (usually 70-80%).
- Testing Set: Used to evaluate performance (usually 20-30%).
- Validation Set: (Optional) Used for hyperparameter tuning during training.
from sklearn.model_selection import train_test_split
X = df.drop('target', axis=1)
y = df['target']
# random_state ensures reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
14. Dimensionality Reduction
Reducing the number of input variables while retaining the meaningful information. Used to mitigate the Curse of Dimensionality.
Principal Component Analysis (PCA)
A linear unsupervised technique that projects data onto new orthogonal axes (Principal Components) that maximize variance.
- Usage: Visualization (reducing to 2D/3D) or noise reduction.
- Note: Data must be scaled (Standardized) before applying PCA.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
Linear Discriminant Analysis (LDA)
A supervised technique that finds a linear combination of features that separates two or more classes. Unlike PCA, LDA focuses on maximizing class separability.